profile-pic
Vetted Talent

Vivek Gupta

Vetted Talent

Vivek Gupta is a seasoned Software Engineer with over 10 years of experience in Data Engineering and Science. Proficient in Python, SQL, Apache Spark, and machine learning frameworks, he excels in database management and predictive analytics. Gupta has led projects at Experis IT Pvt. Ltd. and Data Theta, optimizing operations, enhancing data governance, and implementing statistical methods for cost savings. He's skilled in AWS, MongoDB, and Azure tools, with a history of streamlining ETL processes and developing AI-based solutions. Gupta holds a Master's in Computer Engineering and a Bachelor's in the same field from Rajasthan universities.

  • Role

    Python Cloud ETL Engineer

  • Years of Experience

    14.00 years

Skillsets

  • PyTorch
  • Deep Learning
  • Hadoop
  • Hive
  • Keras
  • NumPy
  • pandas
  • PowerBI
  • PySpark
  • Data Modeling
  • R
  • Seaborn
  • Spark
  • Tableau
  • TensorFlow
  • data transformation
  • Data Loading
  • Backend - 10 Years
  • Azure
  • MySQL - 12 Years
  • AWS - 3 Years
  • Data Engineering - 6 Years
  • Data Processing - 6 Years
  • Data Processing - 6 Years
  • SQL - 8 Years
  • SQL - 8 Years
  • Big Data - 6 Years
  • Big Data - 6 Years
  • Apache Spark - 6 Years
  • Apache Spark - 6 Years
  • Python - 10 Years
  • Python - 10 Years
  • Backend - 10 Years

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Python Cloud ETL Engineer (Remote)AI Screening
  • 61%
    icon-arrow-down
  • Skills assessed :SFMC, Streamlit, API, AWS, ETL, JavaScript, Python, React Js, SQL
  • Score: 55/90

Professional Summary

14.00Years
  • Jun, 2023 - Present2 yr 11 months

    Senior Data Engineering (Backend Developer)

    Experis IT Pvt. Ltd. (Client: AT&T)
  • Mar, 2023 - May, 2023 2 months

    Senior Consultant

    Data Theta
  • Mar, 2023 - May, 2023 2 months

    Senior Consultant (Data Engineering)

    Data Theta
  • Sep, 2018 - May, 20223 yr 8 months

    Consultant

    ARRJ MS Pvt. Ltd
  • Sep, 2018 - May, 20223 yr 8 months

    Consultant

    Dezired Solutions
  • Jun, 2022 - Mar, 2023 9 months

    Associate Consultant (Data Science & Engineering)

    TCS
  • Jul, 2014 - Jul, 20162 yr

    Senior Software Engineer (Data Science & Engineering)

    ARRJ MS Pvt. Ltd
  • Aug, 2008 - Jul, 20167 yr 11 months

    Senior Software Engineer

    ARRJ MS Pvt Ltd.
  • Aug, 2008 - Jun, 20145 yr 10 months

    Software Engineer

    NextGen Compusoft Ltd.

Applications & Tools Known

  • icon-tool

    MongoDB

  • icon-tool

    Azure Data Lake Storage Gen2 (ADLS)

  • icon-tool

    AWS CloudWatch

  • icon-tool

    Amazon S3

  • icon-tool

    Amazon Redshift

  • icon-tool

    Azure Active Directory

  • icon-tool

    Microsoft Azure SQL Database

  • icon-tool

    Azure Data Factory

  • icon-tool

    GraphQL

  • icon-tool

    React

  • icon-tool

    Node.js

Work History

14.00Years

Senior Data Engineering (Backend Developer)

Experis IT Pvt. Ltd. (Client: AT&T)
Jun, 2023 - Present2 yr 11 months
    Designed, developed, and implemented a server utilization dashboard to provide enterprise-wide visibility into on-prem hardware and infrastructure capacity utilization metrics, enabling stakeholders to optimize performance and reduce costs. Developed backend infrastructure to process 500,000 transactions/day, increasing throughput by 40% in 3 months. Created Power BI reports and dashboards to visualize and analyze server utilization metrics

Senior Consultant

Data Theta
Mar, 2023 - May, 2023 2 months
    • Streamlined data governance process: Implemented multiprocessing solution to improve concurrent data frame
    • Writing, resulting in 15% faster storage & retrieval.
    • Maximizing Data Effectiveness: Through automation, I enhanced visibility and efficiency by 20% via automated
    • Logging and statistical reporting for running jobs.

Senior Consultant (Data Engineering)

Data Theta
Mar, 2023 - May, 2023 2 months
    Automated logging and statistical reporting for running notebooks, improving visibility and efficiency by 20%. Conducted ETL on data from Cosmos DB to Data Bricks and stored in a data lake, streamlining data ingestion and enabling faster analytics by 30%. Implemented multiprocessing in a job to write data frames to multiple database locations simultaneously, optimizing data storage and retrieval by 15%.

Associate Consultant (Data Science & Engineering)

TCS
Jun, 2022 - Mar, 2023 9 months
    Designed and implemented a robust data pipeline using Azure Data Factory, connecting multiple data sources (including on-premises databases and cloud data stores) to a centralized data lake, improving data accessibility and accuracy. Schedule, automate, and monitor data pipelines using Azure and databricks. Run data quality checks, and work with data pipelines in production.

Consultant

Dezired Solutions
Sep, 2018 - May, 20223 yr 8 months
    • Leveraged AWS machine learning services like Amazon SageMaker and AWS Deep Learning AMIs to implement deep learning and transfer learning models, enhancing search accuracy by 85% and efficiency by 25% on the AWS Cloud.
    • Built predictive models using Azure Machine Learning Studio's drag-and-drop interface and automated machine learning capabilities (AutoML), contributing to a 25% increase in target user engagement on Azure-hosted applications.
    • Conducted comprehensive analysis of customer behavior by integrating structured data from AWS databases (Amazon RDS, Amazon Redshift) and unstructured data from Azure Data Lake Storage, leading to a 25% reduction in customer churn for cloud-hosted services.
    • Optimized the data modeling workflow by integrating Azure Synapse Analytics (formerly Azure SQL Data Warehouse), Azure Databricks, and AWS AI/ML services, resulting in a 30% accelerated development pace for cloud-based data projects.
    • Spearheaded the implementation of an AWS data lake on Amazon S3, coupled with Azure Data Factory for data ingestion and Azure Databricks for data processing, resulting in a 50% improvement in data accessibility and enabling more data-driven decision-making.
    • Architected and deployed highly scalable and cost-effective machine learning pipelines on Azure using services like Azure Machine Learning, Azure Databricks, Azure Kubernetes Service (AKS), and Azure Functions, accelerating model training and deployment processes.

Consultant

ARRJ MS Pvt. Ltd
Sep, 2018 - May, 20223 yr 8 months
    Designed and implemented a machine learning model using user behavior logs to predict anomalous behavior that may indicate insider threats or policy violations, enhancing security measures and risk management protocols. Anomaly detection model achieved 95% accuracy in identifying anomalous behavior. Utilized Azure Databricks to process big data and implement machine learning models for predictive analysis, optimize data processing, and improve decision-making capabilities. Machine learning models reduced data processing time by 50%. Designed ETL and algorithm to detect anomalies in IoT sensors health data on AWS. Anomaly detection algorithm identified 98% of all anomalies in IoT sensor data. Developed a backend service for recommendation systems and object detection on unstructured data using unsupervised machine learning techniques. This enabled more accurate and efficient data processing, which improved decision-making capabilities by 15%. Collaborated with team members to perform cohort analysis on fraud detection, identifying patterns and trends in fraudulent behavior. This led to the development of effective risk management and prevention strategies, which reduced fraud by 10%.

Senior Software Engineer (Data Science & Engineering)

ARRJ MS Pvt. Ltd
Jul, 2014 - Jul, 20162 yr
    Ingested data from various sources, including SQL, Google Analytics API, and Salesforce API, using Python to create data views for BI tools such as Tableau, improving data accessibility and accuracy by 90%. Designed and implemented RESTful APIs using Python and Django that could handle over 100,000 users, enabling smooth and efficient data communication between applications and improving user experience by 20%. Developed a classification model to predict customer loan eligibility by applying machine learning algorithms such as Decision Tree, Gradient Boosting, and XGBoost, improving decision-making capabilities and risk management protocols by 15%.

Senior Software Engineer

ARRJ MS Pvt Ltd.
Aug, 2008 - Jul, 20167 yr 11 months
    • Enhanced the data ingestion process utilizing Python, resulting in a remarkable 90% increase in accuracy and
    • Facilitating valuable insights through effective Data Mining techniques.
    • Enhanced data communication between applications by 20% by crafting and deploying RESTful APIs using Python
    • And Django, promoting seamless interaction and interoperability.
    • Fine-tuned data mining procedures, leading to a notable 15% enhancement in data accuracy and compliance
    • Standards, ultimately bolstering overall data quality and integrity.

Software Engineer

NextGen Compusoft Ltd.
Aug, 2008 - Jun, 20145 yr 10 months
    Created micro-services and Web Services (incl. SOA/SOAP/REST/XML) to improve customer experience, resulting in a 15% increase in customer satisfaction. Deployed and integrated software engineered by the team, resulting in a 20% reduction in deployment time. Updated scripts to streamline continuous integration practices, resulting in a 30% increase in the number of successful deployments. Optimized database performance by identifying and resolving performance bottlenecks.

Achievements

  • Reduced AT&T cost waste by identifying and decommissioning non-essential, under-utilized servers and VMs

Major Projects

3Projects

Server Utilization

AT&T
Jun, 2023 - Present2 yr 11 months
    • Utilize Databricks on on-prem hardware for monitoring hardware and infrastructure capacity metrics.
    • Leverage Databricks' unified analytics platform to seamlessly ingest, process, and analyze data from various sources.
    • Integrate Databricks with data pipelines to efficiently collect metrics and perform real-time processing.
    • Derive actionable insights from the processed data, ensuring consistent visibility into infrastructure utilization.
    • Empower informed decision-making and optimization strategies using Databricks for on-prem hardware and infra-capacity utilization metrics.

Image Search Engine and recommendation system

Dezired
Aug, 2020 - Nov, 20211 yr 3 months
    • Developed an AI-based search engine and recommendation system using AWS services.
    • Utilized Amazon SageMaker for machine learning model training and deployment.
    • Leveraged Amazon Elasticsearch Service for fast and scalable search capabilities.
    • Integrated with Amazon Personalize to enhance the system with personalized recommendations.
    • Provided users with tailored content suggestions based on their preferences and behavior.
    • Conducted thorough testing and optimization to ensure high performance and accuracy.

customer behavior utilizing

Dezired Solutions
Jun, 2019 - Jul, 20212 yr 1 month
    • Led a thorough customer behavior analysis initiative utilizing structured and unstructured data sources.
    • Valuable insights and patterns were discovered through the analysis.
    • Implemented targeted interventions based on the insights gained.
    • Achieved a significant 25% reduction in customer churn as a result.
    • Stabilized the customer base and improved overall business performance measurably.

Education

  • Master of Technology; Computer Engineering

    Rajasthan Technical University (2018)
  • Bachelor of Engineering; Computer Engineering

    University of Rajasthan (2008)
  • Master of Technology; Computer Engineering

    Rajasthan Technical University, Kota (2018)
  • Bachelor of Engineering; Computer Engineering

    University of Rajasthan, Jaipur (2008)

AI-interview Questions & Answers

so starting with my background, I completed my bachelor's in computer engineering in 2008. After that, I immediately started working in the IT industry. Until 2016, I worked as a back-end engineer and on things related to the data, such as the database and all other stuff. In those eight years, I spent a lot of time on back-end services, including designing high-scalable back-end APIs for clients and designing a scalable system so that we could develop the product according to quality. At that time, I worked on some frameworks, such as JavaScript for both front-end and back-end, complete UI, and full-stack development. I also used Python, SQL, and some other technologies related to data, as well as some basic machine learning algorithms. On the cloud side, I started with Azure at that time, but only in a limited capacity. From 2016 to 2018, I went for my master's and then completed it in two years in AI and machine learning. During my master's, I also published a research paper in one of the reputed journals, Taylor and Francis. After completing my master's, I again started my career in data engineering and data science. In 2008, I started my cloud data engineer work, where I first started working in 2015-16 and already took exposure to the ETL process and complete data pipeline. I started cloud data engineering in 2018 and worked on both Azure and AWS cloud. I gained experience with some orchestration tools, such as Azure Data Factory, as well as Snap Analytics and Databricks on the Azure side, using Spark features to optimize transactions, working on high-scalable loads, and working on complete end-to-end pipelines. I handled a lot of projects from 2018 to 2022 while working for a single company. In 2022, I took another switch to another company and have been working with a different company in a different domain, including e-commerce and healthcare, as well as pharmaceutical domains. Right now, in my current project, I'm working on a server utilization dashboard, where we're completely designing the system from getting the data from the API, pulling it into blob storage, making transformations, and showing it through Power BI. Right now, I'm using Spark, Databricks, Python, and SQL as part of my tech stack.

Some of the advanced SQL techniques that are beneficial for optimizing Python scripts are broadcast joins as well as some Spark optimization techniques I use. Okay. And, using common table expressions, joins effectively and windows functions, and other functions are quite useful for optimizing the ETL scripts of the complete pipeline. Basically, using SQL techniques we can optimize a lot of stuff, like indexing, which is another optimization technique we use in our ETL process. So, let's talk about other things we're using or can follow for optimizing Python ETL scripts. We can use some Spark configurations. We can optimize partitions, perform repartitioning, and all the stuff. Indexing, I already took an example of. You can say that optimize joins, avoid selecting distinct things, and use where clauses to filter out data, and highly avoid nested queries. You can say that use base CTES and the most important thing is to use optimized joins, not using joins directly; the optimization should be there.

Can you propose a method for doing incremental data load in Python? Okay, so if we talk about this case, the method is to first of all, define incremental load as an example of strategies that can be used to optimize the Python pipeline, like, to minimize resource usage. So first of all, you can say that drag the last loaded time stamp or the ID. This means that when it's loaded, we can use some of the incremental data to query the data source for records that have been added or updated since the last load using the last loaded time stamp. Then, we can use the load increment data to load the data into the pipeline, update the latest loaded time stamp, and schedule the incremental loads. We can handle data deletions if the data source has data deletion, and then optimize the queries to retrieve the incremental data with the minimum resource size. Like, we can index the columns using the filter criteria to select only the necessary columns, and use efficient joins. And then, we can monitor and tune the performances to measure the performance of the incremental load.

Strategy to handle load into SQL database. Okay, so in that case, the thing is, can you mention a strategy to handle exceptions in Python? Okay. See, the point is, yes, we can handle exceptions while managing the data and all the stuff. So the thing is, like, we can use a try-except block, which is a simple thing. We can wrap the code responsible for loading data into the database, and then this allows us to catch and handle any exceptions that occur during the data loading process. We can use the try part first. In the try block, we can write the code to connect to the database, and then using the cursor we can execute the queries. We can then handle the exceptions in the except block by printing out the errors that occur, which allows us to handle the loading of data into the database.

Paginate API request in a Python script to ETL. Okay, a way to efficiently paginate. In that case, what I'm thinking is, see, the thing is in the script we can see a strategy like first of all we have to understand pagination parameters like it supports query parameters like page or per page or offset. So first of all we have to understand how the API handles pagination and this will help us understand the request's structure. Okay, then we can set up the pagination loop like we can create a loop that iterates over the pagination results until all the data is fetched. This loop should increment the pagination parameters with each iteration to fetch the next page of data. Then we can implement rate limiting if the API has rate limits and implement it accordingly. We can write the script to avoid these limits. Then we can handle errors and exceptions to handle any error or exception that occurs during the pagination request. And after that, we can optimize the request frequency. We have to experiment with the optimal frequency of the request to balance between the minimization of the time to fetch all the results and avoiding overbilling of the API server. So that is the process we can follow to efficiently paginate API requests in a Python script.

The approach that we can use here is a highly distributed scalable system. The approach that we can use here is a highly distributed scalable system, which can leverage the feature of distributed processing, like Apache Spark, obviously, as I am also using that stuff in my current project. And then we can modularize the ETL components. Like, we can break down the ETL process into smaller tasks or you can say that components that can be independently scaled. Like, we can scale them independently. For example, we can separate the extraction, transformation, and loading steps into individual modules or services so that we can scale each component independently based on demand. We can run them independently and scale them independently. Then, obviously, we can use cloud services. We have a lot of cloud services, so we can use them for data storage and processing. We can use S3, cloud storage, data lake storage, and store and process huge amounts of data. These services offer scalability, reliability, and managed infrastructure. So, we can use them also. And then we can use some optimized data storage, like, based on the characteristics of the data. For example, if we have columnar data format, so we can use Redshift or Presto. And then we can use growing data volumes over green data volume. We can process through the incremental processing of the load. Okay. Like, we have to design a deal process to support incremental processing. You can say that only new or changed data is processed for each ETL, and then monitoring follows. We can monitor and fine-tune the performance, handle fault tolerance, and leverage caching and memorization. We can use caching and data governance.

A simple Python code block designed to send a batch of messages to these appear to be oversight. Explain the potent, could lead to error on an exception behavior. So explain the potential issue. Explain the okay. So, basically, it is using your boto3 library. And through the Lambda, we are getting, creating list of dictionary. So in that case so what I'm thinking? Okay. So in this code block, The problem is, So here the problem is So here's, like, the issue is, I think there is a issue with the invoke method of the lambda client. Like, this is this is, like, an evident or in the string pass to the function name parameter. That is, your process message function. Like, the, closing code. Explain the potential issue. It will be an oversight that could lead to the exception behavior. So it's a improper way to you can say that's string interpolation in the function name parameter, of invoke method. Especially, there is a mismatch, character code function into the string name string.

K. Look at this. K. So in this sequel query, what I'm feeling is sales lag revenue. Order by month. Okay. See, first of all, the window function written here is not valid because depending on the SQL, you can say the dialect being used. The lag function may not be supported or may require specific syntax here. K. So, first of all, we have to verify that the SQL dialect supports the lag, and then it should require the lag function to have an ordering to determine the previous month of revenue. So if the month column in the sales data table is not properly ordered, then there's a tie between the values, the result of the lag function won't be efficient. So, and the third thing, we have to check the data quality over there. Like, if we have missing null values in the revenue or in the month column or any inconsistency, like duplicate months or something like that. So this could lead to unexpected results. So, first of all, how can we debug that? We have to check the direct compatibility. Okay, like, how can we use the lag function over and then we have to inspect the data in the sales data table to ensure that the revenue and month column contain valid data, normal values, and properly formatted. And then we have to test the query or execute the query in the SQL environment. And for the best thing, we can use if the lag is not supported, we can use the base CT. We can use the CTs to execution.

To debug a Python application experiencing performance issues during complex SQL data transformations, we can see that there are several efficient ways to do so. One of the most efficient ways is to use profiling tools like cProfile or line profiler to analyze the execution time of different parts of the Python application and identify which function or section of the code is contributing to the performance issue. To identify the performance bottlenecks, we first need to pinpoint the specific area of the code or SQL query where the performance degradation is happening. We can use profiling to identify the functions and SQL statements that are causing the issue. Once we have identified the performance bottlenecks, we can optimize the Python code to improve its efficiency. We can also analyze the SQL queries and optimize them using database-specific features like explain and analyze to understand how the database engine is executing the SQL queries. Additionally, we can cache the results of computationally expensive data transformations that don't change frequently to improve performance. Testing incremental testing and benchmarking are also useful for identifying performance issues and optimizing the application.

Using React and Steamlify in Python for a front end, to have questions in answering this won't affect your overall screening score. So what I'm thinking in this approach is to optimize web application load times. First of all, we can minimize the initial code size. We can reduce the size of the JavaScript code or CSS files by finding and compressing them. We can use tools like Webpack and Parcel to handle that. Another approach that I'm also using in my project, like my team, is lazy loading of components and resources. We can load them not immediately, but as per the requirement. We can also optimize the images by compressing and optimizing them without sacrificing the quality. Additionally, we can split the code into modules or chunks and load them dynamically based on user interaction. We can minimize external dependencies. We can use server-side rendering, optimize data loading, and performance monitoring.

How do you manage state effectively? Okay, so in that case, what I'm thinking is, first of all, I'm not sure about that. We can define clear data contracts, okay. We can centralize state management so that we can use it properly and use some of the REST API services. We can implement synchronous data fetching and all. We can optimize data transforms as well, and use real-time communication.