Vetted Talent

Jaimin Patel

Vetted Talent

An accomplished Data Scientist with over 13 years of extensive experience, adept at conceptualizing, architecting, and sustaining scalable Machine Learning models encompassing automated training, validation, monitoring, and reporting mechanisms. Proficient in a spectrum of advanced domains including GenAI, Natural Language Processing, Pattern Recognition, Sequence Analysis, Time Series, and Prediction. Demonstrated track record of leading and executing more than 20 end-to-end machine learning and AI initiatives, contributing expertise to each project's successful implementation. seeking a challenging position as a Data Scientist where I can leverage my expertise in developing and deploying complex data science and analytics projects to support the commercial organization.

Role
Data Engineer
Years of Experience
12.7 years

Skillsets

Time Series
R
rag
Random Forest
Regression
Rnn
Scikit-learn
Seaborn
SQL - 12 Years
Tensor flow
Tensor flow
Python - 12 Years
XgBoost
XgBoost
Apache Spark - 10 Years
Hive - 10 Years
Databricks - 8 Years
ETL - 12 Years
Airflow - 5 Years
Data Analytics - 12 Years
Data Engineering - 12 Years
LangChain
Cnn
Computer Vision
CUDA
decision tree
Deep Learning
Descriptive modelling
GenAI
K-Means
Keras
kNN
Classification
LLM
Matplotlib
NLP
NumPy
OpenCV
pandas
Predictive Modelling
Prescriptive modelling
PySpark - 12 Years

Vetted For

10Skills

Roles & Skills
Results
Details

Data Engineering Lead With Migration / Data Warehousing Experience - Onsite, BangaloreAI Screening
78%

Skills assessed :Problem Solving Skills, PySpark, Spark Tool, SparkSQL, Data Engineer, Data Migration, S4 Hana, SAP, Azure Data Factory, Data Modelling
Score: 70/90

Professional Summary

12.7Years

Jun, 2024 - Present1 yr 4 months
Staff Data Analytics Engineer (Architect)
Avalara
May, 2021 - May, 20243 yr
Senior Data Scientist
Visa
Apr, 2018 - May, 20213 yr 1 month
Senior Data Engineer (Lead)
Maersk
May, 2012 - Jun, 20153 yr 1 month
Software Engineer (Data Analytics)
Tech Mahindra
Jul, 2015 - Mar, 20182 yr 8 months
Application Development Analyst (Data Analytics)
Accenture

Applications & Tools Known

Hive
Spark
Teradata
Informatica
MSBI
Microsoft Azure
SQL DB
Power BI
Azure ML
Hadoop
Sqoop
Tableau
AWS
Oracle
Hive
Sqoop
Kafka
Airflow
Power BI
MSBI
Tableau
Tableau
Tableau
SAP BO
Postman
SAS
SAS
Abinitio
Abinitio
GCP
Docker
Kubernetes
Docker
Jenkins
Devops
Kubeflow
Kubeflow
MLFlow

Work History

12.7Years

Staff Data Analytics Engineer (Architect)

Avalara

Jun, 2024 - Present1 yr 4 months

Architected and implemented scalable, cloud-based data pipelines on AWS to support real-time analytics, reducing ETL processing time by 50%. Partnered with CXOs and senior leadership to define data strategies, enhancing data integration, reporting, and decision-making capabilities. Led PoCs on tools such as Coalesce, Honeydew, Omni, and Vertex AI, improving data processing speed by 30%. Integrated Vertex AI and ML solutions to develop predictive analytics models, improving forecasting accuracy and operational efficiency. Automated data governance and reporting workflows, saving 3 reducing report runtimes by 60%. Integrated Vertex AI and other ML tools to develop predictive analytics solutions, improving forecasting accuracy and operational efficiency. Automated data governance, quality checks, and reporting workflows, reducing manual effort by 3 hours/day/resource and optimizing report runtimes by 60% through intelligent analytics solutions.

Senior Data Scientist

Visa

May, 2021 - May, 20243 yr

Partnered with senior management to deliver actionable insights, driving informed decisions for acquirers and merchant banks. Developed and deployed scalable data pipelines using PySpark and Shell scripting, ensuring efficient data storage on HDFS. Created comprehensive design documentation (HLD, LLD) and conducted unit/A/B testing for robust solution delivery. Automated ML model workflows using Airflow, improving efficiency and data processing reliability. Delivered critical insights through Tableau dashboards, empowering data-driven business strategies. Established CI/CD pipelines, enhancing deployment speed, system reliability, and scalability. Collaborated with global teams to analyze regional data patterns, improving model accuracy and insights. Designed a high-performance ML model for fraud detection, reducing risks for financial institutions. Developed and optimized models using Decision Trees, Gradient Boosting, and SVM, improving model precision through hyperparameter tuning. Integrated advanced AI techniques like RAG and Transformer architecture with Kubernetes and Docker, achieving transaction speeds of 0.3 milliseconds.

Senior Data Engineer (Lead)

Maersk

Apr, 2018 - May, 20213 yr 1 month

Build and maintain optimal data pipeline architecture, assemble large, sophisticated data sets that meet functional / non-functional business requirements as a part of Merger and Acquisition Implemented statistical modeling and process control to enhance data-driven decision-making processes for budgeting and forecasting which enhanced the accuracy from 78% to 96%, enabling the port managers to run the operations smoothly. Communicated complex technical concepts to non-technical stakeholders, including C-suite, facilitation the understanding and buy-in for data driven insights. Spearheaded the design and implementation of advanced analytical models, resulting in efficient budgeting with improvement of 90% of allocation of funds and 55% improvement in revenue leakage identification. Oversaw data governance initiatives to ensure data accuracy, consistency, and compliance with regulatory standards. With the implementation of time series model for budgeting and forecasting, there was an improvement in the allocation of funds to the respective port managers by 90%, enabling continuity of business and minimal disruption. Gathered requirement from various stakeholders including the CTO and CFO to optimize the data solutions and improvise the existing flows. Migrated the entire finance application to Azure cloud and optimized the rum time from 7 days to 26 hours for month end reports.

Application Development Analyst (Data Analytics)

Accenture

Jul, 2015 - Mar, 20182 yr 8 months

Developed custom data analytics solutions to address specific client needs and challenges. Developed predictive prepayment probability Model, to identify if the customer will pay back the loan prior to the tenure or default. This model reduced the score generation time from 6 months (using MATLAB and multiple user inputs) to 2 days (using R and fully automated) and enhanced the model accuracy to 98%. Presented findings and recommendations to various stakeholders including key decision makers to drive informed decision making. Collaborated cross-functionally with stakeholders to identify business requirements and deliver actionable insights. Analyzed large data sets to identify trends, patterns, and opportunities for process optimization. Created a compilation of encryption and decryption algorithms suitable for safeguarding banks data and comparing them with BASEL data used in regulatory reporting.

Software Engineer (Data Analytics)

Tech Mahindra

May, 2012 - Jun, 20153 yr 1 month

Responsible for many business-critical reports and data. Timely and accurate delivery of data to end users. Performed requirement gathering, current-state analysis, data mining and in-depth root cause analysis of several business process and over 100 TB of enterprise data across entire Telecommunication life cycle. Built and deployed Telecom churn prediction model to identify potential customer and revenue loss. We linked it to various factors related to call logs, any associations, and corporates. This enabled the client to enable the marketing team with potential customers and offers. Performed POC based on SMS to see how likely a customer is to read a message and aiding potential marketing strategies. Furnished data-driven insights by developing a PoC on Classification and Sentiment Analysis of unstructured customer feedback data using ML and NLP and churn prediction.

Achievements

Led the Risk and Fraud Solutions and Migration of Data Pipelines to Spark and Cloud at VISA
Designed and developed complex data pipelines for Finance and HR portfolios at Maersk GSC
Re-designed financial pipeline on Azure Cloud to process over 100 TB of data
Developed various data masking algorithms at Accenture
Led a team to develop and execute Business Intelligence, Data Warehousing, Big Data, and Analytics Solutions
Peer selection framework
Compliance framework
P&L engine
Encryption and decryption of banks data
Automated processes for model performance alerts

Education

Bachelor of Engineering in Electrical and Electronics
Nitte Meenakshi Institute of Technology, Bangalore (2011)
Masters of Technology in Data Science and Engineering
Birla Institute of Technologies and Sciences, Pilani, India (2024)

Certifications

Teradata 14 certified professional
Professional scrum product owner - i

AI-interview Questions & Answers

Hi. I'm Jemin. I have over 12 years of experience in data and analytics space. I have experience working across data warehousing, data engineering, and data sciences, and also have worked quite a lot on extensive migration projects from on prem to on prem, on prem to cloud, cloud to cloud, and cloud to on prem. I have worked across technologies such as Microsoft Azure, AWS, Teradata, Informatica, uh, SAP, S4HANA, uh, BIVW, SAP. Uh, on the Big Data stack, I've worked on Hadoop, Hive, Big, Airflow, uh, Spark, and Kafka. On the machine learning front, I have worked across, uh, machine learning models such as classifications, regressions, time series, deep learning models such as neural networks, transformers, uh, and, uh, artificial intelligence such as, uh, deep learning, uh, natural language processing, uh, large language models, and various other generative AI applications as well. In terms of, uh, domains, I've worked in finance, insurance, banking, payments, transportation and logistics, and telecommunications.

The message we would take to secure, uh, sensitive financial data during migration from SAP Azure would be to write an implicit encryption logic while we are moving the data from on prem SAP to Azure or even SAP systems to Azure directly to inherently have an encryption and a very strong encryption algorithm so it is difficult for them to decrypt it back while the data is in movement. And once it has been loaded onto Azure, we can always have the decryption logic written there so the data has been transformed back to the original form. Now if there is a particular need for you to not have the the sensitive financial data to be displayed at all points, you can continue to have the encrypted data flowing from SAP to Azure and also within Azure in order to protect it from unwanted access and also to ensure that we are safe and there is no unauthorized access to the data. Even if there is, there is nothing sensitive or, uh, personalized information that is being

Uh, we would generally normalize data when you have a wide array of dimensions and these dimensions hold various metrics and for you to minimize the number of joints and also to make sure that the data is respectively unique in each of these dimensions, we would end up normalizing the model here data models. Now whenever there is a wide variety of data that is coming in but, uh, you have that necessity of constantly accessing them and also, irrespective of the redundancy or irrespective of the duplication that it causes and the overhead that it brings to the system we can continue to use it in a denormalized form when you want to have the data access completely in one place you don't want to have repeated joins over and over again at multiple instances during the entire data pipeline flow and thereby we would end up using denormalized form of data model

Uh, when we are migrating large datasets from s four HANA to Azure or any other cloud, uh, whenever, uh, so the general option or rather the general perspective that people follow is to move it over the bandwidth or move it over the network. So, the challenges that we face moving over the network is based on your subscriptions or based on the uh, services that you've opted from those cloud service providers, these bandwidths can be limited and also depending on the file type data volume, the sheer length of the data now, it can be a 10 field table or a 100 field table or a 1000 field table. So moving a 10 field table might be relatively similar than moving a 1000 field table. So these are the key aspects that you need to keep in mind when you're trying to move from S4HANA on prem to cloud via a network. So the, uh, preferred approach is for you to create a separate sandbox which is directly connected to Azure and where you have the necessary connectors, the necessary drivers installed, and have them read directly from the sandbox itself onto the Azure network or the the Azure Data Lake. Now the advantage which this bring would this would bring is the total isolation of the data migration, uh, full availability of both the servers in terms of, uh, pulling and pushing the data onto ADLS as well as SAP S4HANA. So this is how you would address performance bottlenecks in real time data application, and we've also achieved this in one of my previous organizations when we migrated from, uh, on prem Teradata to Microsoft Azure's cloud stack.

Uh, whenever you are moving data across multiple data warehouses in data lakes, for you to maintain data lineage, there is always a manual method where whenever any change has been implemented or any change has been made, you have appropriate documentation maintained that would give you complete information about the data lineage and with any changes done, these documentation will also be uh, modified. Uh, if that is not, uh, not available, you can always use various data lineage tools that are provided by various service providers, which would help you help you connect it on, uh, to Spark if at all you are using big data stack. Teradata has an inbuilt mechanism itself where you can just go to dbc. Tables and see the whole data lineage as to how the data is flowing and which the table is being written or used in which of the script or which of the macro stored procedure, etcetera. Uh, whenever it comes to leaks, we generally end up using, uh, so Hadoop or any of the big data stack such as Hadoop or Hive or, uh, Spark, Kafka, etcetera. For them, you have specified dedicated tools that would help you achieve it. If not, you would always build, uh, data pipeline to process them. So you can always keep track of those data pipelines. So in Azure, you have Azure Data Factory where you configure the steps as to what should run when and where. And basis on that, you can maintain a track as to how the data has been flowing. And when time comes to it, you can always trace back the data.

So whenever you want to load incremental data between s four HANA and target warehouse so since we are talking about warehouse, we can always implement SCDs or slowly changing dimensions, and we can maintain it in SCD two format where if the record does not exist in the target data warehouse table, it would insert the data. Else, it would update the data if it is already existing and if any of the key metrics or parameters have changed within that particular record. The other way for you to do it is to feed the data into the target data warehouse provided you have created a partitioning mechanism on the target data warehouse web on that particular the partition has been created on a date basis or daily basis or whatever the frequency with which you intend to load and whenever the new dataset comes it would load only in those particular partitions the challenge or the drawback with this partitioning approach is if there is a change in any of the record that has already been processed and you would end up duplicating it. So you can if so the other way to look at it is you can have a flag of sorts which would help you identify in case of duplication that this is the latest record. If not, you can always maintain it in SCD 2 to avoid any sorts of confusion. And as and how the data comes, it is either inserted or updated based on the changes.

Since the data has been passed of sorts of array or more like a data dictionary of sorts, uh, that is you have an array of an array, uh, it would throw an error because it would fail to understand what field represents to what. So it is advisable that we break them separately and, uh, also kind of give ID is equal to 123, name is equal to John Doe, Jane Doe, Mike Brown, and the salary is equal to 12,000, 15,000, and a blank. Uh, that way it would understand what each of these fields is, and it would create a data frame without an issue.

Since we have given a return condition there and a DataFrame dot with column so basically after so when the return function is trying to return the DataFrame it is also simultaneously trying to do all the transformations that has been written in the return condition so the best way to do is for you to create the DataFrame before the return condition and then return the final data frame with this what happens is all the necessary transformations etcetera are done in cache and then stored And whenever the it invokes the return function, it would return it back to the, uh, called function. And then whatever it has been asked to do that, it would do. So it is advisable that you don't give such a long list of, uh, transformations in the return function and keep it simple as either returning the data frame or returning the final

So whenever you you have to prioritize migration tasks, you need to first understand the entire application and how it has been designed in the SAP system. And 2, it you also have to ensure that what is the final objective of that particular application or of that particular flow. So if it is a critical financial application which we end up sending to say, exchanges or security, we have to first identify all the key dependencies, uh, which lead us to this application. Now when I tell key dependencies, it could be dependency on multiple data sources along the flow of the data pipeline or it can be key dependencies in terms of user inputs any manual user inputs that you may have. So whenever you have to identify and plan any migration tasks for any project, uh, involving SAP module, you first need to identify all the source systems that is feeding into that particular application or data pipeline. 2, you need to identify all the stakeholders who are responsible for any manual inputs or any manual activities that are being performed through the flow. It could be just passing in some sort of an exchange rate or currency value, anything of sorts that irrespective, we have to identify all the key manual steps that are involved and identify what it would take for us to automate them while we are migrating it onto the next platform 3, uh, break down the entire data pipeline into critical, medium critical, and less critical items, uh, meaning all the transformations all the critical business transformations that are happening that are responsible for your final reports and all the calculations and metrics that flow by, you need to break them separately, uh, work on them first and simultaneously see if you can also work on low critical and medium critical items so we can, uh, in panel, try to cover all the grounds that we are trying to

So while you are using Azure Data Factory, the the best way for you to, uh, handle errors is by 1 custom writing custom error solutions and, uh, during the configuration stage wherever you define the Azure data pipeline fact, uh, Azure Data Factory pipelines. And 2, so whenever there is job failure, it would, uh, fail the job and we can always get alerts, etcetera. But if it is a non critical step or if something can wait and the next steps could run we could always pass conditions telling okay this job has failed but since there is no dependency on this the next steps would perform and then we can uh

So whenever you are trying to balance between immediate data availability and, uh, resource optimization, Uh, the approach would always be focusing on the resource optimization because, uh, it is not just a question of 1 application or that particular uh, data pipeline in question. It is across the entire node or entire network where all the data processing, etcetera, are happening. So even if there is a scope of immediate data availability but if it is consuming high volume of resources thereby impacting other teams, other applications on the network, it is always advisable to focus more on resource optimization. Uh, it is possible for you to optimize resources and also get immediate data availability, but you'll have to optimize your entire data pipeline, your entire dataflow, set up certain configurations. For example, if you're trying to join with small data, you can always use broadcasting, uh, so that the smaller volume tables are not, like, uh, moved into and outside the cache over and over again and kept in the cache until the entire processing has completed.