Vetted Talent

Vivek Gupta

Vetted Talent

Vivek Gupta is a seasoned Software Engineer with over 10 years of experience in Data Engineering and Science. Proficient in Python, SQL, Apache Spark, and machine learning frameworks, he excels in database management and predictive analytics. Gupta has led projects at Experis IT Pvt. Ltd. and Data Theta, optimizing operations, enhancing data governance, and implementing statistical methods for cost savings. He's skilled in AWS, MongoDB, and Azure tools, with a history of streamlining ETL processes and developing AI-based solutions. Gupta holds a Master's in Computer Engineering and a Bachelor's in the same field from Rajasthan universities.

Role
Python Cloud ETL Engineer
Years of Experience
14.00 years

Skillsets

PyTorch
Deep Learning
Hadoop
Hive
Keras
NumPy
pandas
PowerBI
PySpark
Data Modeling
R
Seaborn
Spark
Tableau
TensorFlow
data transformation
Data Loading
Backend - 10 Years
Azure
MySQL - 12 Years
AWS - 3 Years
Data Engineering - 6 Years
Data Processing - 6 Years
Data Processing - 6 Years
SQL - 8 Years
SQL - 8 Years
Big Data - 6 Years
Big Data - 6 Years
Apache Spark - 6 Years
Apache Spark - 6 Years
Python - 10 Years
Python - 10 Years
Backend - 10 Years

Vetted For

9Skills

Roles & Skills
Results
Details

Python Cloud ETL Engineer (Remote)AI Screening
61%

Skills assessed :SFMC, Streamlit, API, AWS, ETL, JavaScript, Python, React Js, SQL
Score: 55/90

Professional Summary

14.00Years

Jun, 2023 - Present2 yr 9 months
Senior Data Engineering (Backend Developer)
Experis IT Pvt. Ltd. (Client: AT&T)
Mar, 2023 - May, 2023 2 months
Senior Consultant
Data Theta
Mar, 2023 - May, 2023 2 months
Senior Consultant (Data Engineering)
Data Theta
Sep, 2018 - May, 20223 yr 8 months
Consultant
ARRJ MS Pvt. Ltd
Sep, 2018 - May, 20223 yr 8 months
Consultant
Dezired Solutions
Jun, 2022 - Mar, 2023 9 months
Associate Consultant (Data Science & Engineering)
TCS
Jul, 2014 - Jul, 20162 yr
Senior Software Engineer (Data Science & Engineering)
ARRJ MS Pvt. Ltd
Aug, 2008 - Jul, 20167 yr 11 months
Senior Software Engineer
ARRJ MS Pvt Ltd.
Aug, 2008 - Jun, 20145 yr 10 months
Software Engineer
NextGen Compusoft Ltd.

Applications & Tools Known

MongoDB
Azure Data Lake Storage Gen2 (ADLS)
AWS CloudWatch
Amazon S3
Amazon Redshift
Azure Active Directory
Microsoft Azure SQL Database
Azure Data Factory
GraphQL
React
Node.js

Work History

14.00Years

Senior Data Engineering (Backend Developer)

Experis IT Pvt. Ltd. (Client: AT&T)

Jun, 2023 - Present2 yr 9 months

Designed, developed, and implemented a server utilization dashboard to provide enterprise-wide visibility into on-prem hardware and infrastructure capacity utilization metrics, enabling stakeholders to optimize performance and reduce costs. Developed backend infrastructure to process 500,000 transactions/day, increasing throughput by 40% in 3 months. Created Power BI reports and dashboards to visualize and analyze server utilization metrics

Senior Consultant

Data Theta

Mar, 2023 - May, 2023 2 months

Streamlined data governance process: Implemented multiprocessing solution to improve concurrent data frame
Writing, resulting in 15% faster storage & retrieval.
Maximizing Data Effectiveness: Through automation, I enhanced visibility and efficiency by 20% via automated
Logging and statistical reporting for running jobs.

Senior Consultant (Data Engineering)

Data Theta

Mar, 2023 - May, 2023 2 months

Automated logging and statistical reporting for running notebooks, improving visibility and efficiency by 20%. Conducted ETL on data from Cosmos DB to Data Bricks and stored in a data lake, streamlining data ingestion and enabling faster analytics by 30%. Implemented multiprocessing in a job to write data frames to multiple database locations simultaneously, optimizing data storage and retrieval by 15%.

Associate Consultant (Data Science & Engineering)

TCS

Jun, 2022 - Mar, 2023 9 months

Designed and implemented a robust data pipeline using Azure Data Factory, connecting multiple data sources (including on-premises databases and cloud data stores) to a centralized data lake, improving data accessibility and accuracy. Schedule, automate, and monitor data pipelines using Azure and databricks. Run data quality checks, and work with data pipelines in production.

Consultant

Dezired Solutions

Sep, 2018 - May, 20223 yr 8 months

Leveraged AWS machine learning services like Amazon SageMaker and AWS Deep Learning AMIs to implement deep learning and transfer learning models, enhancing search accuracy by 85% and efficiency by 25% on the AWS Cloud.
Built predictive models using Azure Machine Learning Studio's drag-and-drop interface and automated machine learning capabilities (AutoML), contributing to a 25% increase in target user engagement on Azure-hosted applications.
Conducted comprehensive analysis of customer behavior by integrating structured data from AWS databases (Amazon RDS, Amazon Redshift) and unstructured data from Azure Data Lake Storage, leading to a 25% reduction in customer churn for cloud-hosted services.
Optimized the data modeling workflow by integrating Azure Synapse Analytics (formerly Azure SQL Data Warehouse), Azure Databricks, and AWS AI/ML services, resulting in a 30% accelerated development pace for cloud-based data projects.
Spearheaded the implementation of an AWS data lake on Amazon S3, coupled with Azure Data Factory for data ingestion and Azure Databricks for data processing, resulting in a 50% improvement in data accessibility and enabling more data-driven decision-making.
Architected and deployed highly scalable and cost-effective machine learning pipelines on Azure using services like Azure Machine Learning, Azure Databricks, Azure Kubernetes Service (AKS), and Azure Functions, accelerating model training and deployment processes.

Consultant

ARRJ MS Pvt. Ltd

Sep, 2018 - May, 20223 yr 8 months

Designed and implemented a machine learning model using user behavior logs to predict anomalous behavior that may indicate insider threats or policy violations, enhancing security measures and risk management protocols. Anomaly detection model achieved 95% accuracy in identifying anomalous behavior. Utilized Azure Databricks to process big data and implement machine learning models for predictive analysis, optimize data processing, and improve decision-making capabilities. Machine learning models reduced data processing time by 50%. Designed ETL and algorithm to detect anomalies in IoT sensors health data on AWS. Anomaly detection algorithm identified 98% of all anomalies in IoT sensor data. Developed a backend service for recommendation systems and object detection on unstructured data using unsupervised machine learning techniques. This enabled more accurate and efficient data processing, which improved decision-making capabilities by 15%. Collaborated with team members to perform cohort analysis on fraud detection, identifying patterns and trends in fraudulent behavior. This led to the development of effective risk management and prevention strategies, which reduced fraud by 10%.

Senior Software Engineer (Data Science & Engineering)

ARRJ MS Pvt. Ltd

Jul, 2014 - Jul, 20162 yr

Ingested data from various sources, including SQL, Google Analytics API, and Salesforce API, using Python to create data views for BI tools such as Tableau, improving data accessibility and accuracy by 90%. Designed and implemented RESTful APIs using Python and Django that could handle over 100,000 users, enabling smooth and efficient data communication between applications and improving user experience by 20%. Developed a classification model to predict customer loan eligibility by applying machine learning algorithms such as Decision Tree, Gradient Boosting, and XGBoost, improving decision-making capabilities and risk management protocols by 15%.

Senior Software Engineer

ARRJ MS Pvt Ltd.

Aug, 2008 - Jul, 20167 yr 11 months

Enhanced the data ingestion process utilizing Python, resulting in a remarkable 90% increase in accuracy and
Facilitating valuable insights through effective Data Mining techniques.
Enhanced data communication between applications by 20% by crafting and deploying RESTful APIs using Python
And Django, promoting seamless interaction and interoperability.
Fine-tuned data mining procedures, leading to a notable 15% enhancement in data accuracy and compliance
Standards, ultimately bolstering overall data quality and integrity.

Software Engineer

NextGen Compusoft Ltd.

Aug, 2008 - Jun, 20145 yr 10 months

Created micro-services and Web Services (incl. SOA/SOAP/REST/XML) to improve customer experience, resulting in a 15% increase in customer satisfaction. Deployed and integrated software engineered by the team, resulting in a 20% reduction in deployment time. Updated scripts to streamline continuous integration practices, resulting in a 30% increase in the number of successful deployments. Optimized database performance by identifying and resolving performance bottlenecks.

Achievements

Reduced AT&T cost waste by identifying and decommissioning non-essential, under-utilized servers and VMs

Major Projects

3Projects

Server Utilization

AT&T

Jun, 2023 - Present2 yr 9 months

Utilize Databricks on on-prem hardware for monitoring hardware and infrastructure capacity metrics.
Leverage Databricks' unified analytics platform to seamlessly ingest, process, and analyze data from various sources.
Integrate Databricks with data pipelines to efficiently collect metrics and perform real-time processing.
Derive actionable insights from the processed data, ensuring consistent visibility into infrastructure utilization.
Empower informed decision-making and optimization strategies using Databricks for on-prem hardware and infra-capacity utilization metrics.

Image Search Engine and recommendation system

Dezired

Aug, 2020 - Nov, 20211 yr 3 months

Developed an AI-based search engine and recommendation system using AWS services.
Utilized Amazon SageMaker for machine learning model training and deployment.
Leveraged Amazon Elasticsearch Service for fast and scalable search capabilities.
Integrated with Amazon Personalize to enhance the system with personalized recommendations.
Provided users with tailored content suggestions based on their preferences and behavior.
Conducted thorough testing and optimization to ensure high performance and accuracy.

customer behavior utilizing

Dezired Solutions

Jun, 2019 - Jul, 20212 yr 1 month

Led a thorough customer behavior analysis initiative utilizing structured and unstructured data sources.
Valuable insights and patterns were discovered through the analysis.
Implemented targeted interventions based on the insights gained.
Achieved a significant 25% reduction in customer churn as a result.
Stabilized the customer base and improved overall business performance measurably.

Education

Master of Technology; Computer Engineering
Rajasthan Technical University (2018)
Bachelor of Engineering; Computer Engineering
University of Rajasthan (2008)
Master of Technology; Computer Engineering
Rajasthan Technical University, Kota (2018)
Bachelor of Engineering; Computer Engineering
University of Rajasthan, Jaipur (2008)

AI-interview Questions & Answers

so so starting with my background I completed my bachelor's in 2008 in computer engineering and after that immediate I started to work in IT industry and till 2016 I worked for a back-end engineer as well as things related to the data part like into the database and all stuff so in that eight years I spent a lot of time into the back-end services like designing the high scalable back-end APIs for the clients and design a scalable system so that we can develop the product according to the quality and at that time I just work on some frameworks like JavaScript front-end back-end complete UI a full stack developer and at that time I will also use the Python SQL and some of the technologies related to the data some of the basic machine learning algorithms also on and under the cloud side I started with Azure at that time but very limited side and from 2016 to 18 I go for the master's and then completed my master's for two years in AI and machine learning so in in my master's I also publish a research paper in one of the reputed journal that is Taylor and Francis and after after completion of my master's I again started my career in data engineering and data science as well so in 2008 I started my cloud data engineer stuff where I just started with I all see in 2015-16 I already took the exposure of the ETL process and complete data pipeline so I started the cloud data engineering in 2018 and I worked on basically on Azure and AWS cloud so I get an exposure on some of the orchestration tools like as your data factory also use the snaps analytics as your data breaks data breaks on to the Azure side using this features of spark optimizing the transactions worked on the high scalable load and and I just worked on complete end-to-end pipeline okay so I handled a lot of projects from 2018 to 2022 while working for a single company and 2022 I just take another switch to another company and till now I'm working with a different company from different different domain e-commerce as well as from healthcare domain from some of the pharmaceutical domain so right now in my current project I'm working for a server utilization dashboard where we are just completely designing the system from from getting the data from the API is pulling that them into the blob storage making the transformations and showing through the power bay and all stuff so right now I'm using the PI spark again databricks Python sequel so this is my complete tech stack and my complete background

Uh, some of the advanced SQL techniques that are beneficial for optimizing Python script as you can say that some of them are broadcast broadcast join as well as some of the spark optimizations technique I use. Okay. And and, uh, like, using the CTE common table expression using the joins effectively and using the windows function and some of the other functions. Uh, you can say that are quite useful to optimizing the ETL scripts of the complete pipeline. And, uh basically uh in this using SQL techniques we can optimize lot of stuff like indexing we can say that another optimization technique that we are using into the into our EDL process So basically talk about the other things that we are using or we can say that we can we can follow into the optimization Python ETL scripts. We can use some of the spark configurations. We can use the uh, uh, we can optimize the partitions. We can perform repartitioning and all the stuff so uh uh, indexing indexes that I already took an example. You can say that optimize join, like, uh, avoid avoiding the selecting of distinct thing. Okay only and use the where clauses to filtering out the data and highly avoiding the nested query you can say that and use base ct and the most important thing use the optimization of the joins like uh don't use the joins directly the optimization should be there.

Can you propose a method for doing incremental data load in Python? Okay. So if we talk about this case like how we can so oh, the method. So this so first of all, uh, what I'm thinking over this, uh, like, okay. So first of all incremental load which we can take as an example of um, there are some some some of strategies that we can use to, uh, optimize the Python pipeline, like, uh, to minimize this resource usage. So first of all, we have to you can say that drag the last loaded time stamp or the ID. Okay. Means so when it is it's get loaded. And then then we can use some of the incremental data like, uh, query the data source, uh, for records that have been you can say that added or updated since the last load using the last loaded time stamp. Then we we can use the load increment data like, uh, we can load the data into the pipeline, then we can upload oh, sorry. We can update, uh, you can say that latest loaded times temporary and then we can schedule the incremental loads by, uh, you can say that handle we we can handle the data deletions like, if the data source for data deletion, so we can need to handle that. Then, uh, we can optimize the queries, uh, in terms of you can say that, uh, to retrieve the incremental data with the minimum resource size of data. Okay. Like, we can indexing the columns, uh, using the filter criteria as we can select only the necessary columns that are required and using the efficient joints. Okay. And then then, uh, we can monitor and tune the performances, uh, in kind of you can say that, uh, to increment to measure the performance of the increment

Strategy to handle load into SQL database. Okay. So in that case, uh, the thing is can you mention a strategy to handle? So exception in Python. Okay. See, uh, the point is, yes, we can we can handle the exception exceptions while while managing the data and all the stuff. So the thing is, like, we can we can use try except block the simple thing like we can wrap the code responsible for loading data into the database and then this allows us to catch and handle any exceptions that occur during the data loading process like we can uh use the try part first we can in the try block we can we can write the code to connect to the database and then using the cursor we can connect to the database and then we can in the exception we can drag the print out the errors that are, uh, you can say that uh, can handle the loading data into the database.

Paginate API request in a Python script to ETL. Okay. Uh, way to efficient paginating. Okay. In that case, what I'm thinking see, uh, the thing is in the script we can see a strategy like first of all we have to understand pagination parameters like first of all it supports through query parameter like page or or you can say that uh-uh per page or offset. Okay. So first of all we have to understand how the API that is going to work with handling the pagination and this will help us to understand the requests structure of the request. Okay then then we can we can set up the pagination loop like we can create a loop that iterate over the pagination results until until all the data is fetched okay so this is kind of you can say that this loop should increment the pagination parameters with each iteration to fetch the next page of data and then then we can implement the you can say that the rate limiting like if the API has rate limits and implement appropriately rate limiting so means we can avoiding we can we can write the script to avoid these kind of limits and then we can handle the errors and exceptions uh to gracefully handle any error exception that occurred during the pagination pagination request. And after that, we can we can means you can say that optimize the request frequency. Like, uh, we have to, uh, we we have to experiment with the optimal frequency of the request is to balance between the minimization of the time to fetch all the results and avoiding overbilling of the API server. So that is the process we can follow to efficiently paging your API request is in the Python script.

Design a Python ETL solution that can scale to accommodate growing data volumes. Okay. In that case, uh, what I'm thinking so see, uh, first of all, we can so The approach that we can use here, we can use the highly distributed scalable system. Okay, like a framework, uh, which can leverage, uh, through which can we can leverage the feature of the distributed processing, like Apache Spark, obviously, as I am also using that stuff in my current project. And and, uh, then then we can modularize the ETL components. Like, we can break down the ETL process into smaller task or you can say that component that can be independently independently scaled. Like, we can scale them independently. Uh, like, for an example, we can say that separate the extraction, transformational loading step into individual module or services so that we can, uh, uh, you can say that each component to scale independently based on the demand. We can we can run them independently. We can scale them independently. Then, obviously, we can use the cloud services. Like, we have lot of cloud services, so we can use for the data storage and processing. Like, we can use s three. We can use cloud storage. We can use data lake storage and we can store and process huge amount of data. So these services you can say that offer scalability reliability and kind of managed infrastructure. Okay? So we can use them also. And then then we can use some of the optimized data storage kind of stuff. Like, uh, we can choose based on the characteristics of the data. For an example, let's suppose, uh, if we have columnar data format, so we can use CRC or parkify sufficiently. And then then we can use, uh, again, uh, growing data volumes over the green data data volume. We can process through the incremental processing of the load. Okay. Like, uh, we we have to design a deal process to support incremental processing. Okay. You can say that when only new or changed data is processed for each ETL and then monitoring few. Uh, we can monitor and fine tune the performance. We can handle the fault tolerance. We can we can leveraging the caching and memorization. We can use the caching, uh, and data governance.

A simple Python code block designed to send a batch of messages to these appear to be oversight. Explain the potent, uh, could lead to error on an exception behavior. So explain the potential issue. Explain the okay. So, basically, it is using your boto3 library. And through the Lambda, we are getting, um, creating list of dictionary. So in that case so what I'm thinking? Okay. So in this code block, The problem is, So here the problem is is is is is So here's, uh, like, the issue is uh, I think there is a issue with the invoke method of the lambda client. Like, uh, this is this is, like, an evident or in the string pass to the function name parameter. That is, uh, your process message function. Like, the, uh, closing code. Explain the potential issue. It will be an oversight that could lead to the exception behavior. So, So it's a improper way to you can say that's string interpolation in the function name parameter, uh, of invoke method. Especially, there is a mismatch, uh, character code function into the string name string.

K. Look at this. K. So in this sequel query, What I'm feeling is, uh, sales. Lag revenue. Order by month. Okay. See, um, first of all, uh, the window function that written here is not valid because depending upon the SQL, you can say that the dialect being used. The lag function may not be supported or may require specific syntax here. K. So, uh, first of all, we have to verify that the SQL dialect being you support the lag, And then it should, uh, the lag function require an ordering, okay, to determine the previous month of, uh, revenue. So if the month column is in in the sales data table is not properly ordered, then it's if there is a tie between the value, the result of lag function not be efficient. So and and the third thing, we have to check the data quality over there. It's the that part is also missing here. Like, uh, if we have a missing null values, okay, in the revenue or in the month column or any inconsistency, like duplicate month or something like that. So this could lead a unexpected results. So first of all, how we can debug that? We have to check the direct compatibility. Okay, uh, like, uh, How we can use the lag function onto the over and then we have to inspect that data like, uh, there are in the sales data table ensure that revenue and month column contain valid data, normal values, and properly formatted. And then we have to test the query or execute the query into the SQL environment. And and for the best thing we can use If the log is not supported, we can use the base CT. We can use the CTs to execution

Active way to debug a Python application that experiencing performance issue during complex SQL data transformations. In that case, uh, see. Where to debug a Python application? So efficient way to debug a Python application that experiencing performance issue during complex SQL data transformation. So, uh, we can use some of the profile you can say that by, uh, profile tools like, uh, profiling tools like we can use c profile or line profiler. Okay. To analyze the execution time of, uh, and resource uses of different part of the Python application. And this will help to identify which function or section of the code are contributing most of the performance issue. Okay. And the most important thing, we can optimize the Python code. Uh, like, uh, see, first of all, we have to identify some of the performance bottlenecks. Okay? Like, a specific area of the codes or SQL query where the performance degradation is happening. Okay. Like, uh, as I already talk about profiling, we can use pinpointing the function and SQL statements. Then we can optimize the as well as we can optimization optimize the Python code as much as possible, like, uh, to improve the efficiency. And then we can optimize the analyze the SQL queries and optimize that. Uh, we can we can make query execution plans to analyze the performance of complex sequel queries, and we can optimize them. And, uh, we there is a feature of, uh, like, we can say that explain and analyze. Like, we can profiling use we can use some of the database specific features like explain and analyze to understand how the database engine is executing the SQL queries. Then we can cache the results immediately. Uh, if, like, as we can say that data transformations are compute computationally expensive and don't change frequently so we can cache them. And Testing incremental testing and benchmarking, these are the stuff we can, uh, use for the um, to uh

Times when using React and Steamlify in Python for a front end. So in that case, um, to have question it is good to have question in answering this. Won't affect your overall screening score. So what I'm thinking in this approach, uh, like, um, um, see See first of all optimize web application load times. Um, first of all, we can uh, minimize the initial code size. How much code is we can reduce? Uh, it's possible we can reduce the size of the initial. You can say that JS code or CSS files. Okay. Like, by, uh, mini mini finding or compressing. Okay. We we can use some of the webpack and tools like webpack and parcel to handle that. Then then the other approach that I'm also using in my project, like, my team are also using lazy loading, uh, kind of component and resource. We can load them not the immediately basis. Like, we can load them, uh, as per the requirement. And then then we can optimize the images. You can say that we can compress and optimize the images that we're using into the web applications to reduce the size without sacrificing the quality. And we can split the code into the modules or chunks. You can say that and that then we can load them dynamically based on the user interaction and all. And we can minimize the external dependencies. We can use the server side re rendering. We can optimize the data loading, performance monitoring. That kind of stuff, we can use

How do manage state effectively? Okay. So in that case So what I'm thinking So Uh, first of all, I'm not sure about that. We can we can, uh, kind of use define the clear data contracts. Okay. We can we can centralize the state management so that we can use them properly and we can use some of the rest API services Uh, we can I'm not sure, but we can implement our synchronous data fetching and all. We can optimize the data transforms as well, and we can use the real time communication.

Vivek Gupta

Python Cloud ETL Engineer

14.00 years

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

Senior Data Engineering (Backend Developer)

Senior Consultant

Senior Consultant (Data Engineering)

Associate Consultant (Data Science & Engineering)

Consultant

Consultant

Senior Software Engineer (Data Science & Engineering)

Senior Software Engineer

Software Engineer

Achievements

Major Projects

Server Utilization

Image Search Engine and recommendation system

customer behavior utilizing

Education

Master of Technology; Computer Engineering

Bachelor of Engineering; Computer Engineering

Master of Technology; Computer Engineering

Bachelor of Engineering; Computer Engineering

AI-interview Questions & Answers