Vetted Talent

Priyanshu Gandhi

Vetted Talent

Data Engineer II with 6 years of experience specializing in JavaScript, Java, Python, SQL. Proficient in Pyspark and Django frameworks. Skilled in working with JSON data format and Pytorch library. Experienced in developing applications using Flask and Numpy. Knowledgeable in Pandas data manipulation and Streamlit for data visualization.

Role
Senior Data & Databricks Engineer
Years of Experience
6 years
Professional Portfolio
View here

Skillsets

keboola
dbt
Delta Lake
Docker
FastAPI
GCP
Git
Hadoop
Hive
Iceberg
Kafka
CI/CD
LlamaIndex
Mixpanel
MongoDB
MySQL
Oracle
Redshift
Teradata
Terraform
Vector databases
Keras
SQL - 5 Years
Snowflake - 4 Years
Apache Spark - 4 Years
AWS - 4 Years
Databricks - 2 Years
Dask
Django
Flask
Java
JSON
Python - 5 Years
LangChain
NumPy
pandas
PySpark
PyTorch
Streamlit
Airflow
Azure
BigQuery

Vetted For

13Skills

Roles & Skills
Results
Details

Data Engineer || (Remote)AI Screening
81%

Skills assessed :Airflow, Data Governance, machine learning and data science, BigQuery, ETL processes, Hive, Relational DB, Snowflake, Hadoop, Java, Postgre SQL, Python, SQL
Score: 73/90

Professional Summary

6Years

Jan, 2024 - Present2 yr 1 month
Senior Data Engineer
Tripadvisor
Jan, 2023 - Jan, 20241 yr
Software Development Engineer- II
Groupon
Jan, 2019 - Jan, 20234 yr
Lead Data Engineer
LTIMindtree

Applications & Tools Known

Python
Snowflake
Spark SQL
Apache Airflow

Work History

6Years

Senior Data Engineer

Tripadvisor

Jan, 2024 - Present2 yr 1 month

Architected and scaled real-time event tracking and analytics platform using AWS (S3, Glue, Lambda), Snowflake and DBT processing 2TB+ clickstream and product data daily. Enabled dynamic engagement and achieved 98% data accuracy through a unified ingestion template with automated schema evolution and governance. Engineered and optimized SnowflakeMixpanel data pipelines to automate user event ingestion of 2B+ events monthly, reducing processing time from 5 weeks to 4 days enabling real-time funnel analysis and cohort tracking for product teams, improving time-to-analytics (TAT) and reducing manual data prep by 60%. Partnered with MLOps teams to develop LLM-driven systems, including an AI-powered image ranking model analyzing 1M+ hotel and attraction images (boosting CTR by 18%) and a semantic data layer with auto-PII detection and policy-based access control, improving compliance coverage by 95%. Built an internal LLM-powered chatbot for self-service data exploration, integrating semantic search and natural language querying, eliminating manual SQL and dashboards, improving ad-hoc analytics delivery by 40%.

Software Development Engineer- II

Groupon

Jan, 2023 - Jan, 20241 yr

Engineered scalable ELT pipelines and a modular data architecture on Keboola, integrating REST APIs and GCP storage layers to process 500GB+ data daily reducing manual intervention by 30%, improving latency by 45%, and enabling self-service access to analytics-ready datasets for product teams. Developed and optimized a petabyte-scale data warehouse on Google Cloud Platform (BigQuery, Dataproc, Composer), improving pipeline performance by 35% and ensuring 99.9% reliability across business-critical datasets. Collaborated with product and engineering teams to design and implement a unified data model, enabling personalized insights and improving targeting accuracy by 20% across key engagement channels. Championed best practices for metadata management, data lineage, and cost-efficient orchestration, enhancing data observability and reducing operational spend by 25%.

Lead Data Engineer

LTIMindtree

Jan, 2019 - Jan, 20234 yr

Led the design and engineering of LTIMindtrees flagship data modernization platform, Canvas PolarSled, built using Python, Flask, React, and microservices reducing manual migration efforts by 60% and driving a $100M+ annual revenue stream across global clients. Architected and migrated a 100+TB enterprise data warehouse to Snowflake, modernizing 100K+ data objects and 5000+ ETL pipelines across 16 business lines, accelerating migration timelines by 70% Developed an intelligent Python-based transcompiler engine for automated legacy-to-modern code translation and Spark-based automated data migration and reconciliation frameworks, accelerating modernization efforts and earning the Best Innovation Award at the Snowflake Summit among 100+ partners.

Testimonial

Nordea Bank

Nordic

Working with Priyanshu has been a game-changer for our project. Their expertise in data engineering has not only streamlined our processes but has also significantly improved our data infrastructure. They consistently delivered high-quality solutions that were tailored to our specific needs, and their attention to detail ensured that every aspect of the project was meticulously executed. Their proactive approach to problem-solving and their ability to adapt to changing requirements were truly commendable. Thanks to Priyanshu, we were able to overcome complex challenges and achieve our project goals effectively and efficiently. We highly recommend Priyanshu to anyone looking for a dedicated and skilled data engineer in data and analytics space.

Major Projects

3Projects

Columbus Migration

Conceptualized modernized data warehouse architecture, developed Graph based clustering application.

Canvas PolarSled

Streamlined and engineered microservices for product migration, developed Intelligent transcompiler engine.

STARS Replatforming

Nov, 2020 - Aug, 2021 9 months

Enhancing efficiency of applications for a Nordic financial group, automated code deployment process

Education

Bachelors of Engineering with Honors; Major in Computer Science
Rajiv Gandhi Proudyogiki Vishwavidyalaya (2019)

Certifications

Snowpro core
Aws solution architect
Databricks developer associate
System design
Scaler- data structures and algorithms
Azure data engineer

Interests

Trekking

Photography

Baking

AI-interview Questions & Answers

Yeah. Hi. Uh, this is Priyanshu. So I have been in data engineering industry for around, uh, 4 plus years now. Uh, primarily, uh, I have, uh, played a role of, uh, data engineer where, uh, I have worked on optimizing the pipelines, maintaining the existing pipelines. Uh, I have been part of, uh, large scale data migration programs as well. At currently, uh, I'm working at Groupon. At Groupon, uh, we have a petabyte scale warehouse, uh, which is kind of spread across, uh, the GCV infrastructure, uh, orchestrated by airflow, uh, and we have a data warehouse which is managed on BigQuery as well as Teradata. Uh, I am responsible for managing this petabyte scale data warehouse as part of my day to day job. It revolves around optimizing my current pipelines that we have. Uh, we have a huge data coming in from our SO systems, uh, responsible for making sure the pipelines are meeting their service. Uh, previously, as part of LDR Main Tree, Mindtree, uh, at LTI, Mindtree, I think my primary role was around, uh, you know, product development as well as a data engineer. As a data engineer, I'm responsible for designing and, uh, modeling the customer's target systems on the cloud. So we were essentially responsible for doing lot of, uh, large scale migrations for different customers span across banking and financial industry, advertising, travel, you know, uh, different industries. And then, uh, essentially, we would essentially go about their complete system, understand the legacy that, uh, classify all things, and then, uh, create a road map for them to essentially plan the full migrations across the cloud. Uh, once that is done, we would, uh, lay the foundation of, uh, the migrations to the complete migration starting from migrating of objects, you know, the data, uh, validating the data, so making sure all aspects are taken care of. And once that is done, uh, the most important part was, uh, around governance. So I was also responsible for helping multiple clients across the governance of Gong, uh, making sure that, uh, the queries are running as expected. They are not running into high cost. Uh, the data quality is in control. So these these are some of the primary roles I played across, uh, as as part of my career.

Okay, so the way I understand this question is, you know, if I have to create a system which can essentially, you know, I could have a pipeline which would essentially take the data, use Hadoop and Hive for processing that particular data, right. So I would assume that we would have this data coming in from multiple source systems. The first and foremost things while designing an ETL pipeline, right, which is essentially based on Hadoop or Hive infrastructure, we have to make sure that whatever data that we are essentially getting from source, right, we are essentially splitting that data well, right. So we don't want to make sure that we have cases where we are, you know, whatever data is coming from source, we just ingest all the data at once and then load it into target table. That would be a totally, you know, misconfigured ETL system. I think this particular ETL system, right, would involve during the extraction step, making sure that data is broken down into smaller components, which you could essentially call as partitions. And once that is done, right, during the transformation steps, we could essentially use, so once you have broken down the data, you essentially store it into Hadoop. For transformation, you can use Hive queries. You can essentially apply, you know, query your data for analytics purpose, you know, making sure that you use the right partitions within your query, optimize your queries, making sure you use the right kind of joins, right. And once that is done, you are essentially giving the end result to your user, right. Again, this could be orchestrated through the Hadoop infrastructure itself. You know, you could set up your orchestration service using any of the orchestration tools available, something like an Airflow or a regular scheduler as well, right. So this is what I think an ideal pipeline would look like while we have a backend or, you know, the infrastructure supported by Hadoop at Hive.

And performance of ETL pipeline. Okay. Yeah. I mean, if if I would have to design a system which will essentially, um, you know, take care of monitoring the, uh, health and performance of my ideal pipeline. And I think first thing that we would need to decide is, uh, what metrics are we measuring for on. Right? So for example, are we, you know, worried about throughput? Are we worried about latency? You know, what what are the things that we think about. Right? And once that once we are kind of decided on an input, uh, or metrics, right, which we which we are essentially worried about, the next step essentially would be in in this particular process, right, would be to make sure that we have implemented a system, uh, which can essentially monitor the whole ETL pipeline. Right? So, uh, again, this could be something like a logging framework or, you know, you know, integration with the monitoring tool. So it just make sure that, you know, we we are essentially making sure that we monitor the ATL pipeline because it is very it can be a case where, you know, the pipelines might not be falling well. The data might be skewed. Your queries might be running longer. Right? So we do need to have a system essentially taking care of the monitoring of the pipeline. Uh, once that is done right, uh, we should essentially what we should do is we should have lot of audits checks on our system, uh, which could essentially make sure that, uh, you know, we keep track of how much data we're essentially processing. Uh, is it beyond a certain range of data that we see every day? Right? So, uh, this, again, can be achieved by having, uh, you know, capturing the logs from, uh, from the ATL and making sure that, you know, your data quality checks are placed on top of those logs. Right? Uh, once that is done, you can essentially set up an alerting based mechanism to alert the user, uh, you know, about that this particular pipeline has reached a particular SLA or this particular pipeline has, you know, uh, exceeded the amount of hell that it kind of sees through the, uh, particular platform. Right? And then, uh, I think the last step would be as as we talk kind of talked about. Right? Uh, we essentially want to make sure that we have a good security built around this particular system. Uh, when I say security, right, it might be case where we are handling, uh, you know, very sensitive data. So, you know, you you should make sure that we have, uh, a a role level security implemented for this particular database, we we we we kind of make sure that, uh, you know, we are masking the fields correctly. So I think once we kind of take care of all of these, uh, you know, steps within our pipeline, this kind of make sure that, uh, you know, we are kind of on the bar with our

and data quality. Okay, got it. So I think, to make sure that we, we have a data integrity and data quality or throughout the ETL process, I think this is something that I would have covered in my previous steps as well. What you should make sure is, which is kind of very important that within your ETL pipeline that you have a regulated checks after given transformation, right. So for example, let's say given example where you kind of get the data from source, you apply some transformation, you load it to some of the dimension table, and then you finally, you know, get the data to a final fact table, which could be essentially used to calculate a metric and metric could be something like, you know, you you are calculating the sales for every day, right? Now, what could happen is, you know, there are kind of cases where you would essentially see more traffic. And, or there are cases where you would essentially, you know, find some of the things which you essentially, or patterns you should essentially not get into your data, right? So how do you make sure that you detect those patterns? And I think that that is where your data quality checks kind of come into picture. What you should do is once you have the data in your, let's say, final target table, right? You should have, you know, different type of checks defined on the table. For example, a checks could be around null, right? You should make sure that some of the business keys, right, they do not have a null value, right? That should not violate the business rules. Some of the checks could be, you know, the number of count that you see, right? Your count, for example, if you expect that you are receiving sales for every day, right, your count should never go down or drop to a below particular level, right? So you could essentially take a metric, average metric that you kind of define that, you know, your average expected sales for a given, you know, particular order or for a particular day would be in this particular range, let's say 100k, right? So you kind of define a check whether your count for that particular day is greater than that or within that particular range. Or you could also check, you know, some of the checks like, you know, whether whatever data you're getting in the final check, is it like a, it is not zero. So that's at certain times what happens is there is some problem with the process and you don't get the data loaded into the panel table. So this could be avoided by having all the data quality checks built in, you know, as part of your particular data pipeline, right? That makes sure you're making sure from, you know, from the start to the end, you're kind of covering your entire pipeline.

OK, let me think through it. I think data storage, OK. What I would say for this one. I mean, again, if I broadly think about this, if I'm talking about a relational database and I would have to optimize the data storage, right. One thing to make sure is you avoid a lot of data de-duplicacy. We see like in relational database, we might see like we kind of have a lot of redundant data stored within a particular table, right. So one way of doing that is applying the normalization principles. You kind of make sure that you avoid redundant data that has been stored into a table, you know, again, using multiple normalizations methodology, right. The next way I would think about making sure I optimize my storage is having proper indexes defined in my table. I think that is one very important thing. Many kind of developers kind of miss out on defining an index because this could be essentially used for speeding up your query retrieval, right. That I think becomes an important phenomenon. I think one more important phenomenon, what I would think about is, you know, making sure you are compressing the data while you are storing. Many relational database systems, right, offer a feature where you can essentially compress the data while you kind of store it, right. The next thing that I can also think of is, you know, a lot of times, you know, we don't need to store all the data into one system, right. What we can do is we understand that there is some amount of data which is historical and that could essentially move to an archival system, right, which is, which would only be required when it is needed and there I think the latency could be less. So I think we would kind of go into those kind of first stages. I think these are a few steps I think on a high level I can take when I think of optimizing my data storage.

Okay. I I would say, like, uh, I I have not, uh, worked much on the post request SQL. But, again, if if I would think at a very high level, if I have to define, like, design a post request SQL for optimal query. Right? Um, I think one important thing here also becomes is, uh, making sure that I avoid lot of deduplication in the data. Uh, again, I think repeating my earlier point, I think I make sure that I apply proper indexes, um, because that kind of make sure that, you know, I I'm getting the query that we will, uh, you know, uh, faster. I also understand, like, what are the query patterns that I would be getting. Like, what are the frequently executing queries that would be fired on my database. Based on it, you know, I I should be able to decide what should be my indexes, you know, uh, across the table. Right? Also, I think one more important thing that, uh, you know, uh, again, it it will not come into very important category, but I think one one thing that needs to be considered while we kind of design the system is, uh, you know, making sure, uh, to define the proper data type of the columns. Uh, lot of time, uh, you know, we kind of make sure that we, you know, when we're declaring that, we cannot declare it up to very high value. I think that doesn't make sense to so we should essentially define the right data types while we choose the, you know, database. Also, very important is, uh, to make sure that, uh, we have the right, um, uh, relationships. When I talk about relationship, I mean, in terms of, uh, the primary care and foreign key, we should make sure that we, uh, you know, define the right foreign key and the primary keys for for the relationships between across the tables, uh, so that the joints can be, uh, optimal. Um, there's also an option of, you know, having materialized views on top of, uh, the tables so that, you know, we essentially don't have to get the, you know, certain times. For example, the query that are more frequent, we don't have to go back to table to get the data and then apply the whole processing. And materialize, you can store that data, uh, and then we that can be used across. Right? So that that could be one of the ways to do it. Yeah.

So, we are essentially taking a data from table, we are creating cursor connection executing it while we have a row, we are fetching the first row, if row is none we are breaking it and then we are processing and closing it. I think one of the ways that I would think I would go about resolving this particular case is, you know, first thing I think that we can essentially avoid is while we are essentially going through the stream of data is multiple times opening and closing of this cursor. I think this is one thing that we should essentially avoid, you know, we could define a global variable which can oppose this cursor and that could be used across the query so that we don't call the close and opening that is an additional added, you know, ad hoc request on the, you know, the, you know, server so that that could be avoided. The second thing is, you know, once we are essentially getting this stream data from database, right, we are essentially firing a select star. I think that is something that we could avoid. We should essentially only select the columns that are required for our use case and get the data, you know, which is required. The second thing is, if for example, we have certain identifiers defined that could be used to identify the change data, right. So basically CDC logic, right. So then for those set of data, right, we would only query the new data from the data, you know, from the table rather than essentially querying the complete table and then reading the row one by one, right. So that that is something that again could be avoided. Also for example, rather than fetching row one by one, right, we should we can batch all the, you know, data stream that we are getting from the stream and then, you know, process it as required rather than, you know, getting it one by one. So I think these are a few ways that I think we could essentially solve this particular problem.

Uh, looking at this by the include bug in a way to and what might be the potential issue with the way exceptions are being handled. Okay? We are extracting the data. We are transforming data. We are loading the data. Okay. Mhmm. Okay. I think, um, what I think is the potential issue here is, uh, why we, uh, okay, logging in. Right? They'll do a failed due to this. Yeah. I think, uh, the way we are essentially handling this exception. Um, what we do is whenever we see a failure within the pipeline, right, we are essentially logging it with the logs log system, uh, but we are essentially also raising the, uh, exception, uh, you know, custom exception, uh, in this particular case. So I think that that could be avoided. I think, uh, what we could essentially do here is, uh, we could have, uh, within that exception, we could have a function which kind of calls, uh, SMS service, uh, which kind of notifies user about that this particular block of code has failed. Right? That would give an idea about, uh, what what what what what could've done. Also, I think one problem that I see with this is, you know, with this particular function logic, right, it's very hard to detect at which particular step we had the failure. Right? So, uh, we understand the retail job fail, but where did it fail? Did it fail at extract step? Did it fail at transform step? Did it fail at load data step? Right? So I think that would not be an ideal way. I think, ideally, what we should have is within each function, right, we would like, let's say exactly that. Right? We should have a track catch block, which could essentially gather the exception. So we that way, we would understand that, you know, from a deeper point of view, it will be easier for us to understand whether the exact step has failed at when you essentially need to see, you know, what is the problem there or transform stop step has failed or or the loading has failed. Right? So I think these two things is something that I see, um, you know, that could be avoided with this quote, and I think that that should work, uh, the next expected.

How would you employ, uh, Python to programmatically enforce asset properties on non transactional data stores? Okay. Okay. Uh, to use Python for asset properties. This is a bit trickier. I've kind of not faced this scenario. Okay. I I'll just probably think around it is, uh, that too wins we have nontransactional data. Okay? Asset properties or transactional data still make sense. Having asset properties define non non transactional data. No. I think, uh, again, uh, in my view, uh, you know, implementing asset properties on a non transactional data is, uh, you know, is something that looks a bit weird. Like, you that generally don't kind of have these kind of a use cases. But if I would have to think at a very high level, if, you know, something that comes to my mind that I could do is, for example, in case where I have to handle atomicity, right, uh, making sure that my transactions completely occur or they don't occur at all, I think, um, we can use, um, you know, logs, basically, in Python, right, to make sure that, you know, kind of make the have the logs defined on a certain function or, you know, certain, uh, data parameters that only when this kind of augurs, then only this, you know, completes. Otherwise, it just adds a lock. Right? Um, if I think from a consistency standpoint, I think what we have to make sure is, uh, again, apply kind of, uh, some validation rules or checks to make sure that the data is consistent. Uh, from an isolation point of view, I think, uh, we can use, uh, I would say, a a multi processing kind of a library within Python to make sure that you're parallelly processing each block of the code, uh, which is in isolation with each other. Um, and from, uh, uh, when I think from a durability standpoint, right, when we kind of make sure that even in case of areas where system is, uh, persistent, I think we whatever results that we see, we essentially write it to a persistent source so that we don't lose it. I think we would employ a similar thing in Python that whenever we are essentially coming out of function or writing out. Right? We essentially write it to a particular file rather than holding those intermediates, you know, values in a within a Python variable. Yeah. I mean, this is what I I would think about it, uh, but this looks a bit odd odd, uh, for me, uh, this kind of a use case.

We do employ and might then go to play the solid design. It's supposed to large complex data. That's the same task. Solid in Python. Right? So, uh, if I if I look at solid, like, at a very high level, right, uh, I think, uh, the first part is, you know, making sure that, uh, you know, as we we kind of make sure that you have single, uh, you know, single responsibility principle. Right? So we kind of make sure that every function, uh, is doing a particular task. We we won't have functions which, uh, you know, if if we kind of have, uh, functions or even for for that matter, classes that are defined in the Python, right, uh, that are doing multiple tasks at a time. So we we have that predefined, right, in in terms of that. So that is one thing I think we would make sure when you are writing the Python code. Uh, in terms of, um, you know, objective dependency, right, uh, I think what we should make sure is we kind of, uh, have this, uh, thing, uh, which make sure that your objects are basically not, uh, dependent across. Uh, right? So we don't see this particular use cases that, uh, you know, your, uh, you you know, you you we have classes that, you know, that are basically functions, you know, which can where we can easily add new functions or new functionalities rather than, you know, we but but, also, at the same time, you know, we we make sure that we, uh, we don't we don't allow users to make lot of modifications to that particular, uh, class or function because, um, you know, that that doesn't violate those particular principle. Right? So we we have to make sure that if any new feature is required, we we are able to incorporate within that particular class or function. But, um, you know, it it it it shouldn't we shouldn't modify that, uh, much. Right? So that I think that we we we we share since we make sure. Uh, in terms of, uh, uh, you know, the 3rd part of the solid principle, which is the list cost of solution principle. Right? I think, uh, the first and foremost thing there is, uh, you know, we make sure that, uh, the super class that we have, uh, it should be, uh, and and the object that we create with that prior you know, the parent class, uh, it should not um, it should be essentially replaceable, uh, with the objects of the child classes. Right? Uh, it should not, uh, it should not be a case where, you know, we are not able to use the parent class and and the child class together. So I think that is where we we are essentially be able to substitute the, uh, sub you know, the parent and child class together. In terms of the 4th principle, which kind of talks about the making sure that the interface is accreted, um, I think the one primary thing we should make within the Python code is that, uh, you know, our clients that are using this particular function, right, they should not be, uh, you know, forced to depend on, uh, you know, the interfaces that they they don't use. Right? So, basically, when I talk about interfaces, it could be something like a class or or or or a function. So any any entity. You know, the client should not be dependent on the entity that they're not using. Right? So I think that we could avoid. We could essentially break down this entity into more smaller and specific needs so that this could be used by our clients. Uh, the last part, I think, where we talk about the dependency in inversion. Right? So we should avoid those cases where we have this, um, high level functions which are essentially dependent on, uh, the lower function. Right? So we should not have cases where we have a broader function, which kind of still use you uses the results from the low low low level functions or within functions across it. Right? So we decouple, uh, things across, uh, with that

Yeah. I think, uh, um, is this something that I've worked across for the clients? Uh, when we talk about data governance, right, I think, uh, there are multiple pillars that we essentially take care. I think the 1st pillar is, uh, um, is the 1st pillar is making sure that the data is accurate. So, uh, you know, you you make sure that whatever data you're getting into your target systems, right, it is as for the business requirement. It is as for, uh, the expected results. So, um, and and in terms of the count, in terms of the values, they they are essentially representing, uh, what is expected, uh, as part of the business process. Right? So that is the first thing. I think the second thing is, uh, we also when talk about think about data governance. Right? We also think of in terms of the overall cost of the system. It should not be a case where we are when we are designing, uh, you know, when we we have this data. Right? Our the queries that we are essentially firing on our systems, it is leading to a cost being very higher, so we should have a proper governance around it as well. I think 3rd part of it is, um, uh, performance. So, um, you know, lot of cases lot of times what happens is when we don't have a proper data governance across the system, right, uh, the queries tend to, uh, run up, run very slower, and, uh, you know, it it is not performing as expected. Right? And that would essentially lead to your reports being getting refreshed a bit later. Right? So we would have to design a system which kind of meets all these three pillars, uh, your cost, your data, uh, quality, and your performance. Uh, in terms of cost, I think, uh, you have to make sure, uh, you kind of, uh, don't violate the principles. Specifically, if I talk about something like cloud system, Adrian, violate the principles of cloud, uh, you're only processing the data that is required. So, you know, you you avoid certain things like, uh, select star. You select the column that are required. You have a proper partitions defined on the table. You know? So you you you kinda make these things, uh, uh, you know, uh, take care of these things when you're talking about the cost. Um, when you talk about data quality, you make sure that you have the right checks. You you are, you know, you're checking your counts. You're checking your, um, uh, nulls. You're making sure the values in the data tables expected. You, you are capturing your data query patterns and understanding what kind of data patterns you generally query and the results are as expected to those patterns. Um, you you keep, um, you know, you keep track of, um, uh, all the min max values that are coming into your data so that you you make sure that these are the general values that you see across the systems. So you you have those data quality checks, uh, within, uh, applied to your processes. The last is in terms of performance. Right? So, uh, lot of times, the way you process data is can can be a bit trickier. Like, a lot of times, what we do is we do a lot of go by or insert. So we we we should avoid those kind of scenarios and have, like, a batch level inserts for most process defined, which is much more optimized than than single single tenant inserts. Right? So, um, um, in terms of, again, in terms of queries, we are essentially using the right join strategies other than, you know, uh, doing a product join and essentially getting all the data. Right? So and then applying the right filters based on the partition so that the data is prone and, uh, you know, your you you have that pruning happening, and you are essentially then, uh, seeing the more optimized and better query results, uh, as as as expected.

Priyanshu Gandhi

Senior Data & Databricks Engineer

6 years

View here