
Vivek Gupta is a seasoned Software Engineer with over 10 years of experience in Data Engineering and Science. Proficient in Python, SQL, Apache Spark, and machine learning frameworks, he excels in database management and predictive analytics. Gupta has led projects at Experis IT Pvt. Ltd. and Data Theta, optimizing operations, enhancing data governance, and implementing statistical methods for cost savings. He's skilled in AWS, MongoDB, and Azure tools, with a history of streamlining ETL processes and developing AI-based solutions. Gupta holds a Master's in Computer Engineering and a Bachelor's in the same field from Rajasthan universities.
Senior Data Engineering (Backend Developer)
Experis IT Pvt. Ltd. (Client: AT&T)Senior Consultant
Data ThetaSenior Consultant (Data Engineering)
Data ThetaConsultant
ARRJ MS Pvt. LtdConsultant
Dezired SolutionsAssociate Consultant (Data Science & Engineering)
TCSSenior Software Engineer (Data Science & Engineering)
ARRJ MS Pvt. LtdSenior Software Engineer
ARRJ MS Pvt Ltd.Software Engineer
NextGen Compusoft Ltd.
MongoDB
Azure Data Lake Storage Gen2 (ADLS)

AWS CloudWatch

Amazon S3

Amazon Redshift

Azure Active Directory

Microsoft Azure SQL Database

Azure Data Factory

GraphQL

React
Node.js
so so starting with my background I completed my bachelor's in 2008 in computer engineering and after that immediate I started to work in IT industry and till 2016 I worked for a back-end engineer as well as things related to the data part like into the database and all stuff so in that eight years I spent a lot of time into the back-end services like designing the high scalable back-end APIs for the clients and design a scalable system so that we can develop the product according to the quality and at that time I just work on some frameworks like JavaScript front-end back-end complete UI a full stack developer and at that time I will also use the Python SQL and some of the technologies related to the data some of the basic machine learning algorithms also on and under the cloud side I started with Azure at that time but very limited side and from 2016 to 18 I go for the master's and then completed my master's for two years in AI and machine learning so in in my master's I also publish a research paper in one of the reputed journal that is Taylor and Francis and after after completion of my master's I again started my career in data engineering and data science as well so in 2008 I started my cloud data engineer stuff where I just started with I all see in 2015-16 I already took the exposure of the ETL process and complete data pipeline so I started the cloud data engineering in 2018 and I worked on basically on Azure and AWS cloud so I get an exposure on some of the orchestration tools like as your data factory also use the snaps analytics as your data breaks data breaks on to the Azure side using this features of spark optimizing the transactions worked on the high scalable load and and I just worked on complete end-to-end pipeline okay so I handled a lot of projects from 2018 to 2022 while working for a single company and 2022 I just take another switch to another company and till now I'm working with a different company from different different domain e-commerce as well as from healthcare domain from some of the pharmaceutical domain so right now in my current project I'm working for a server utilization dashboard where we are just completely designing the system from from getting the data from the API is pulling that them into the blob storage making the transformations and showing through the power bay and all stuff so right now I'm using the PI spark again databricks Python sequel so this is my complete tech stack and my complete background
Uh, some of the advanced SQL techniques that are beneficial for optimizing Python script as you can say that some of them are broadcast broadcast join as well as some of the spark optimizations technique I use. Okay. And and, uh, like, using the CTE common table expression using the joins effectively and using the windows function and some of the other functions. Uh, you can say that are quite useful to optimizing the ETL scripts of the complete pipeline. And, uh basically uh in this using SQL techniques we can optimize lot of stuff like indexing we can say that another optimization technique that we are using into the into our EDL process So basically talk about the other things that we are using or we can say that we can we can follow into the optimization Python ETL scripts. We can use some of the spark configurations. We can use the uh, uh, we can optimize the partitions. We can perform repartitioning and all the stuff so uh uh, indexing indexes that I already took an example. You can say that optimize join, like, uh, avoid avoiding the selecting of distinct thing. Okay only and use the where clauses to filtering out the data and highly avoiding the nested query you can say that and use base ct and the most important thing use the optimization of the joins like uh don't use the joins directly the optimization should be there.
Can you propose a method for doing incremental data load in Python? Okay. So if we talk about this case like how we can so oh, the method. So this so first of all, uh, what I'm thinking over this, uh, like, okay. So first of all incremental load which we can take as an example of um, there are some some some of strategies that we can use to, uh, optimize the Python pipeline, like, uh, to minimize this resource usage. So first of all, we have to you can say that drag the last loaded time stamp or the ID. Okay. Means so when it is it's get loaded. And then then we can use some of the incremental data like, uh, query the data source, uh, for records that have been you can say that added or updated since the last load using the last loaded time stamp. Then we we can use the load increment data like, uh, we can load the data into the pipeline, then we can upload oh, sorry. We can update, uh, you can say that latest loaded times temporary and then we can schedule the incremental loads by, uh, you can say that handle we we can handle the data deletions like, if the data source for data deletion, so we can need to handle that. Then, uh, we can optimize the queries, uh, in terms of you can say that, uh, to retrieve the incremental data with the minimum resource size of data. Okay. Like, we can indexing the columns, uh, using the filter criteria as we can select only the necessary columns that are required and using the efficient joints. Okay. And then then, uh, we can monitor and tune the performances, uh, in kind of you can say that, uh, to increment to measure the performance of the increment
Strategy to handle load into SQL database. Okay. So in that case, uh, the thing is can you mention a strategy to handle? So exception in Python. Okay. See, uh, the point is, yes, we can we can handle the exception exceptions while while managing the data and all the stuff. So the thing is, like, we can we can use try except block the simple thing like we can wrap the code responsible for loading data into the database and then this allows us to catch and handle any exceptions that occur during the data loading process like we can uh use the try part first we can in the try block we can we can write the code to connect to the database and then using the cursor we can connect to the database and then we can in the exception we can drag the print out the errors that are, uh, you can say that uh, can handle the loading data into the database.
Paginate API request in a Python script to ETL. Okay. Uh, way to efficient paginating. Okay. In that case, what I'm thinking see, uh, the thing is in the script we can see a strategy like first of all we have to understand pagination parameters like first of all it supports through query parameter like page or or you can say that uh-uh per page or offset. Okay. So first of all we have to understand how the API that is going to work with handling the pagination and this will help us to understand the requests structure of the request. Okay then then we can we can set up the pagination loop like we can create a loop that iterate over the pagination results until until all the data is fetched okay so this is kind of you can say that this loop should increment the pagination parameters with each iteration to fetch the next page of data and then then we can implement the you can say that the rate limiting like if the API has rate limits and implement appropriately rate limiting so means we can avoiding we can we can write the script to avoid these kind of limits and then we can handle the errors and exceptions uh to gracefully handle any error exception that occurred during the pagination pagination request. And after that, we can we can means you can say that optimize the request frequency. Like, uh, we have to, uh, we we have to experiment with the optimal frequency of the request is to balance between the minimization of the time to fetch all the results and avoiding overbilling of the API server. So that is the process we can follow to efficiently paging your API request is in the Python script.
Design a Python ETL solution that can scale to accommodate growing data volumes. Okay. In that case, uh, what I'm thinking so see, uh, first of all, we can so The approach that we can use here, we can use the highly distributed scalable system. Okay, like a framework, uh, which can leverage, uh, through which can we can leverage the feature of the distributed processing, like Apache Spark, obviously, as I am also using that stuff in my current project. And and, uh, then then we can modularize the ETL components. Like, we can break down the ETL process into smaller task or you can say that component that can be independently independently scaled. Like, we can scale them independently. Uh, like, for an example, we can say that separate the extraction, transformational loading step into individual module or services so that we can, uh, uh, you can say that each component to scale independently based on the demand. We can we can run them independently. We can scale them independently. Then, obviously, we can use the cloud services. Like, we have lot of cloud services, so we can use for the data storage and processing. Like, we can use s three. We can use cloud storage. We can use data lake storage and we can store and process huge amount of data. So these services you can say that offer scalability reliability and kind of managed infrastructure. Okay? So we can use them also. And then then we can use some of the optimized data storage kind of stuff. Like, uh, we can choose based on the characteristics of the data. For an example, let's suppose, uh, if we have columnar data format, so we can use CRC or parkify sufficiently. And then then we can use, uh, again, uh, growing data volumes over the green data data volume. We can process through the incremental processing of the load. Okay. Like, uh, we we have to design a deal process to support incremental processing. Okay. You can say that when only new or changed data is processed for each ETL and then monitoring few. Uh, we can monitor and fine tune the performance. We can handle the fault tolerance. We can we can leveraging the caching and memorization. We can use the caching, uh, and data governance.
A simple Python code block designed to send a batch of messages to these appear to be oversight. Explain the potent, uh, could lead to error on an exception behavior. So explain the potential issue. Explain the okay. So, basically, it is using your boto3 library. And through the Lambda, we are getting, um, creating list of dictionary. So in that case so what I'm thinking? Okay. So in this code block, The problem is, So here the problem is is is is is So here's, uh, like, the issue is uh, I think there is a issue with the invoke method of the lambda client. Like, uh, this is this is, like, an evident or in the string pass to the function name parameter. That is, uh, your process message function. Like, the, uh, closing code. Explain the potential issue. It will be an oversight that could lead to the exception behavior. So, So it's a improper way to you can say that's string interpolation in the function name parameter, uh, of invoke method. Especially, there is a mismatch, uh, character code function into the string name string.
K. Look at this. K. So in this sequel query, What I'm feeling is, uh, sales. Lag revenue. Order by month. Okay. See, um, first of all, uh, the window function that written here is not valid because depending upon the SQL, you can say that the dialect being used. The lag function may not be supported or may require specific syntax here. K. So, uh, first of all, we have to verify that the SQL dialect being you support the lag, And then it should, uh, the lag function require an ordering, okay, to determine the previous month of, uh, revenue. So if the month column is in in the sales data table is not properly ordered, then it's if there is a tie between the value, the result of lag function not be efficient. So and and the third thing, we have to check the data quality over there. It's the that part is also missing here. Like, uh, if we have a missing null values, okay, in the revenue or in the month column or any inconsistency, like duplicate month or something like that. So this could lead a unexpected results. So first of all, how we can debug that? We have to check the direct compatibility. Okay, uh, like, uh, How we can use the lag function onto the over and then we have to inspect that data like, uh, there are in the sales data table ensure that revenue and month column contain valid data, normal values, and properly formatted. And then we have to test the query or execute the query into the SQL environment. And and for the best thing we can use If the log is not supported, we can use the base CT. We can use the CTs to execution
Active way to debug a Python application that experiencing performance issue during complex SQL data transformations. In that case, uh, see. Where to debug a Python application? So efficient way to debug a Python application that experiencing performance issue during complex SQL data transformation. So, uh, we can use some of the profile you can say that by, uh, profile tools like, uh, profiling tools like we can use c profile or line profiler. Okay. To analyze the execution time of, uh, and resource uses of different part of the Python application. And this will help to identify which function or section of the code are contributing most of the performance issue. Okay. And the most important thing, we can optimize the Python code. Uh, like, uh, see, first of all, we have to identify some of the performance bottlenecks. Okay? Like, a specific area of the codes or SQL query where the performance degradation is happening. Okay. Like, uh, as I already talk about profiling, we can use pinpointing the function and SQL statements. Then we can optimize the as well as we can optimization optimize the Python code as much as possible, like, uh, to improve the efficiency. And then we can optimize the analyze the SQL queries and optimize that. Uh, we can we can make query execution plans to analyze the performance of complex sequel queries, and we can optimize them. And, uh, we there is a feature of, uh, like, we can say that explain and analyze. Like, we can profiling use we can use some of the database specific features like explain and analyze to understand how the database engine is executing the SQL queries. Then we can cache the results immediately. Uh, if, like, as we can say that data transformations are compute computationally expensive and don't change frequently so we can cache them. And Testing incremental testing and benchmarking, these are the stuff we can, uh, use for the um, to uh
Times when using React and Steamlify in Python for a front end. So in that case, um, to have question it is good to have question in answering this. Won't affect your overall screening score. So what I'm thinking in this approach, uh, like, um, um, see See first of all optimize web application load times. Um, first of all, we can uh, minimize the initial code size. How much code is we can reduce? Uh, it's possible we can reduce the size of the initial. You can say that JS code or CSS files. Okay. Like, by, uh, mini mini finding or compressing. Okay. We we can use some of the webpack and tools like webpack and parcel to handle that. Then then the other approach that I'm also using in my project, like, my team are also using lazy loading, uh, kind of component and resource. We can load them not the immediately basis. Like, we can load them, uh, as per the requirement. And then then we can optimize the images. You can say that we can compress and optimize the images that we're using into the web applications to reduce the size without sacrificing the quality. And we can split the code into the modules or chunks. You can say that and that then we can load them dynamically based on the user interaction and all. And we can minimize the external dependencies. We can use the server side re rendering. We can optimize the data loading, performance monitoring. That kind of stuff, we can use
How do manage state effectively? Okay. So in that case So what I'm thinking So Uh, first of all, I'm not sure about that. We can we can, uh, kind of use define the clear data contracts. Okay. We can we can centralize the state management so that we can use them properly and we can use some of the rest API services Uh, we can I'm not sure, but we can implement our synchronous data fetching and all. We can optimize the data transforms as well, and we can use the real time communication.