Vetted Talent

Tarang Gupta

Vetted Talent

Data Scientist with 5+ years of experience in building data intensive products from data ingestion to data processing, predictive modeling, visualizations and deployment at scale to deliver actionable insights and solve complex AI/ML problems.

Role
Machine Learning Engineer III
Years of Experience
7.17 years

Skillsets

Boosting
Hypothesis testing
Jenkins
Keras
Logistic Regression
LSTM
Mistral
NLTK
NumPy
pandas
Rnn
SQL
Transfer Learning
Word2Vec
AWS Batch
Bagging
Gru
Cloudera
EC2
ECS
Flask
GenAI
Lambda
LangChain
LLM agents
OpenAI
Oracle
S3
SciPy
Xturing
HuggingFace
RedHat
Recommendation Engines
TensorFlow - 6.0 Years
PyTorch - 6.0 Years
Docker - 6.0 Years
BERT
GPT
LLAMA
Neural Networks
NLU
PySpark
Scikit-learn
spaCy
Jupyter
Linear Regression
Tree methods
clustering
Python - 7.0 Years
NLG
NER
POS Tagging
clustering
Jupyter
Redis
Python
PostgreSQL
A/B testing
Cnn
conversational AI
DNN
Gensim
Git

Vetted For

16Skills

Roles & Skills
Results
Details

Machine Learning Engineer, AI/ML, Search & Discovery (Remote)AI Screening
66%

Skills assessed :Collaboration, Communication, CI/CD, Data preprocessing, Deep Learning, Feature Engineering, Model evaluation, Natural Language Processing, PyTorch, Reinforcement Learning, TensorFlow, Good Team Player, machine_learning, NLP, Problem Solving Attitude, Python
Score: 59/90

Professional Summary

7.17Years

Jan, 2026 - Present 4 months
Senior Data Scientist
Walmart Global Tech
Jan, 2025 - Jan, 20261 yr
Machine Learning Engineer III
Procore Technologies
May, 2023 - Jan, 20251 yr 8 months
Sr Data Scientist
IQVIA
Nov, 2018 - Aug, 20212 yr 9 months
Asst System Engineer
Tata Consultancy Services
Aug, 2021 - Oct, 20221 yr 2 months
Software Engineer - Machine Learning
Gartner
Oct, 2022 - Feb, 2023 4 months
Data Scientist II
Tata 1mg

Applications & Tools Known

Python
MySQL
PostgreSQL
Jira
LLM
LangChain
Pyspark
T-SQL
Oracle SQL Developer
Docker
Git
Pandas
Jupyter
Core Java
Scikit-learn
Keras
Pandas
NLTK
OpenAI
AWS Batch
EC2
ECS
S3
Lambda
SQL
Oracle
Redis
Jenkins
Jupyter
PyTorch
Pandas
Gensim
NLTK
OpenAI
ECS
Lambda
SQL

Work History

7.17Years

Senior Data Scientist

Walmart Global Tech

Jan, 2026 - Present 4 months

Machine Learning Engineer III

Procore Technologies

Jan, 2025 - Jan, 20261 yr

Spearheaded a Proof of Concept (POC) for semantic search, utilizing Databricks Vector Search to enhance information retrieval. Modernized the project's Python environment and dependency management by evaluating tools like Poetry and uv and executing the transition from Poetry to uv for improved performance & efficiency. Contributed to the dataset selection and preparation process for a company type prediction model, aimed at determining customer progression through the onboarding funnel.

Sr Data Scientist

IQVIA

May, 2023 - Jan, 20251 yr 8 months

Automated report creation of FDA YouTube meetings using GPT-4o and developed and implemented a solution for speech separation to identify speakers using pyannote & whisper, achieving an 85% reduction in manual effort. Built a PubMed agent leveraging LangChain and Chroma vector database to efficiently handle PubMed-related queries. Developed a FastAPI-based middleware API, enabling efficient data retrieval for the UI team and providing real-time results from individual LLM agents. Fine-tuned Mistral and the Llama family of models on the PubMed & MMLU datasets using PEFT techniques like LORA and QLORA, rigorously evaluating their performance for benchmarking purposes. Fine-tuned Mistral, Llama, CodeLlama, and Phind LLM models on the WikiTableQuestions (WTQ) dataset using PEFT techniques to generate python code. Executed a Proof of Concept (POC) for a virtual doctor use case involving Conversation AI for automated medical consultations, leveraging ChatDoctor consultation data and constructing a RAG pipeline utilizing LangChain and Pinecone. Designed and implemented XGBoost model to identify patients diagnosed with nephropathy disease, achieving 12% precision at 5% recall bin. Designed and implemented a Logistic regression model to classify patients diagnosed with stroke disease, conducting a comparative analysis against an intricate XGBoost model. Presented the results and insights to relevant stakeholders. Applied K-means clustering to categorize patients based on nephropathy progression, assessing optimal clustering through silhouette and elbow curve analyses. Applied dimensionality reduction techniques PCA and UMAP to visualize patient disease progression data and identify potential separable classes. Designed and implemented an ETL based rule engine using Pyspark to detect individuals at risk of Hip arthroplasty disease within a healthcare context.

Data Scientist II

Tata 1mg

Oct, 2022 - Feb, 2023 4 months

Enhanced API to serve inferences of MAB RL model to multiple platforms. Improved the data pipeline for capturing widget clicks and impressions according to changing business requirements. Analyzed user behavior on 1mg Homepage to calculate click through rate (CTR).

Software Engineer - Machine Learning

Gartner

Aug, 2021 - Oct, 20221 yr 2 months

Designed ETL data pipelines to calculate engagements of assets by Gartner customers from multiple data sources like Oracle, Postgres, SQL Server and applied prioritization logics based on their engagements for customer retention. Built CI/CD pipelines using Git & Jenkins for deployment in AWS Batch, EC2, ECS and managed cloud infrastructure through Terraform scripts. Supported development of machine learning model for customer risk analysis using XGBoost Classifier and refactored data science code for scalable deployment with MLOps practices. Performed POCs of Big data technologies like Spark, EMR to check feasibility for business use cases.

Asst System Engineer

Tata Consultancy Services

Nov, 2018 - Aug, 20212 yr 9 months

Supported Chatbot development using RASA Framework for customer queries and deployed on Slack for 24/7 support. Contributed to a stacked Deep Neural Network based Hand Gesture Recognition model to remotely control Television for a television brand customer. Automated XML code generation for Soap webservice to update bulk user data of Planview.

Achievements

Designed and implemented XGBoost model with 12% precision at 5% recall.
Implemented Logistic regression model for stroke disease classification.
Applied K-means clustering for patient categorization.
Utilized PCA and UMAP for patient data visualization.
Developed ETL based rule engine with Pyspark.
Spearheaded cross-national datasets analysis for rare diseases identification.
Executed PoC for virtual doctor using Conversational AI.

Major Projects

2Projects

PubMed Agent

Built a PubMed agent leveraging LangChain and Chroma vector database to handle PubMed-related queries effectively.

Virtual Doctor Consultation

Executed a Conversation AI POC leveraging ChatDoctor consultation data and constructed an RAG pipeline utilizing LangChain and Pinecone.

Education

MSc: Artificial Intelligence and Machine Learning
Liverpool John Moores University (2022)
Postgraduate Diploma
IIIT Bangalore
Bachelor of Computer Applications
Maharishi Markandeshwar University (2018)
Data Analysis with Pandas and Python
Udemy
Applied Machine Learning Algorithms
LinkedIn
DeepLearning.AI TensorFlow Developer
Coursera
Generative AI Fundamentals
Databricks
Finetuning Large Language Models
deeplearning.ai
Prompt Engineering for Developers
deepleaning.ai

Certifications

tensorflow Developer
Deeplearning.ai
Deeplearning.ai tensorflow developer - coursera
Applied machine learning algorithms - linkedin
Finetuning large language models - deeplearning.ai
Generative ai fundamentals - databricks
Data analysis with pandas and python - udemy
Prompt engineering for developers - deepleaning.ai
Deeplearning.ai tensorflow developer
Applied machine learning algorithms
Finetuning large language models
Generative ai fundamentals
Data analysis with pandas and python
Prompt engineering for developers

Interests

Technology Research

AI-interview Questions & Answers

Hello. Hi. I'm Tanig Gupta, and I'm currently working with IQVIA as a senior data scientist, where I'm helping healthcare professionals make better decisions through AI insights for patient wellness. So, prior to IQVIA, I worked with companies like Protata One MG, which is an Indian healthcare startup, and Gartner and Tata Consultancy Services. So I got really good exposure to working with a variety of datasets, ranging from tower data to national data processing, or computer vision kind of problems, right from data consulting systems themselves. And at that time, I also took a deep dive into the area of artificial intelligence by completing my master's in the field of artificial intelligence itself. Then I joined Gartner, where I worked in the customer analytics area, where I worked on building multiple data ML pipelines for customer prioritization and customer risk analysis. I was also a contributor to a document recommendation agenda, which is available on the gartner.com website. Then I joined IQVIA. In IQVIA, I'm primarily working with healthcare datasets, which are electronic medical records. And with that, we're trying to predict rare diseases in patients, so that doctors can identify them beforehand and save patients. Also, I'm working on the benchmarking of multiple large organ models for healthcare datasets. When working on a trial of 7,000,000,000 models, I'm trying to find the dataset, which is a PubMed-based dataset. I'm trying to do batch modeling by training multiple models and choosing which one works best for the healthcare dataset. So, yeah, that's all about me and my past experience. I've done my undergrad from a bachelor's program in India, which is Maharishi Markandeshwar University in Ambala. I also have a degree from Sarimpot, which is in Uttar Pradesh. So, yeah, that's all about me. Thank you.

What method would you employ to combine predictions of multiple machinery models? So the method I would choose would be stacking. So when we have multiple machine learning models, we can definitely choose stacking as an option where we leverage the power of different machine learning models and come up with the best one having contributed all these models and choose the one which has the highest probabilities. So we actually make the prediction, choose the one which is having the maximum score or the maximum probability. Those predictions have been chosen. This is what we call stacking of the mission.

Can you improve the training speed of a deep reinforcement learning model without compromising its performance? I think in this case, I should be using the dash normalization layers as well, so I can boost up the process.

I approach building a new network model to process multilingual text data by considering sequence modeling with RNNs. I can choose some RNNs, such as LSTM or GRU layers, to process this multilingual data. I can play around with the architecture of it, for example, using 2 LSTMs or 1 GRU, or stacking multiple LSTMs. I can also leverage large pre-trained models today and fine-tune them on a particular task.

Right balance between procedure and the call for a classification problem. So the right balance is something I have to plot precision against the recall curve, and maybe I can have a multiple I mean, I'm choosing the right threshold to divide my dataset, considering a binary classification problem. So there I can plot the recall and proceed and see where they intersect. But it's also driven from the business point of view, maybe I'm going to focus more on my recall, not on the precision. Or maybe I can focus more on the precision, not on the recall. So it's vice versa. In that case, my approach could vary or my threshold, which I choose to classify my data points. You know, that also can vary. So one way to choose is the AUC score, which gives my overall robustness of my model, and then I can simply keep on checking my precision and recall at each threshold. Or maybe I take 10%, 20%, 30% like that. And then I see what makes more sense to the business, and accordingly, I choose the right threshold and the right balance between the precision and recall.

Converter trained to test the model for more mobile-friendly format. Yeah, we can do quantization for the same. We can go with it. We can first use the mobile-friendly format, we can use TensorFlow Lite to build the model. This will actually be very lightweight model. We can convert the model to ONNX format, which is heavily designed for tackling these kinds of problems, the quantization model, which is very friendly for mobile devices and low-level devices like edge devices as well. In that case, it works absolutely fine to improve the model.

We have loaded the model. We are loading the model. The reason we've got the model is because the model. Okay? We have the static method. This is here. Prediction, we are making a prediction. The model is dynamic. If the model is done, then we have an exception, "model not found." Else, we return the model's predictions. Okay. I think instead of having the model as a class method, although we are looking at this. This is in a static method, first of all, and we're trying to use a class variable. So, which is not possible either - the model has to be a class method, not a static method. Or we simply remove the static method here or can define the model as self.model, which is globally available in the whole class. I can use that one. Or I simply make it a class method because I cannot do something like having a static method. I can't just use the class variable here. I cannot do that. Let's validate the rules of a static method. So, this is the main problem here in design.

The Python function, intended for future scaling in a machine learning cross machine learning preprocessing pipeline and the Python screen for using Python production and then feature scaling, machine learning preprocessing pipeline, machine learning programming, scaling of RAM, identifying and explaining potential issues. So if we're just scanning the data frame in a column. Okay. We're going to do the minimum-maximum of it. And the column is equal to the menu. Okay. We are simply standardizing it manually. We are doing column minus minimum value divided by the max minus minimum value. Okay. It's like converted to not z-score, but the min-max standard. I didn't write it. Okay. First of all, this function is only for features which are of numerical nature. That is the continuous features. But I'm not sure how we are using it because there could be a possibility that the data frame in the data frame, there's a categorical feature coming. In that case, this function won't work, and this will produce errors because we can't simply extract a minimum or maximum of any category feature. So, first, that's a point. And the intent of features is mainly machine learning processing. So I think that's a major issue. And in this case, I'm completely converting my data frame, and the column is completely changing and converted to min-max scalar values, which I have here. Okay. The min-max function is fine. The min-max scaler is fine. But we could use the scikit-learn implementation of the min-max scaler as well. So that would also work. That would be more activated, more faster, because that works in a vectorized fashion. So these two points I would say I will consider to improve to this particular function.

How you would use PyTorch to implement a feature that could perform style transfer between 2 minutes? Implement a feature that could perform style transfer between 2 minutes. Like, might want to implement GANs here. So I think I can implement GANs to do the same

Of the use of graph neural networks and their potential use. Graph neural networks can be used when we might want to start a solution linking multiple documents, maybe multiple datasets. And you want to learn something from a variety of datasets. Maybe I have 3 or 4 datasets to solve a particular problem, and I want to establish the connection of learning. I want to learn something from the variety of datasets. And in those kinds of scenarios, I can learn from graph networks.

How might you apply convolutional neural networks to an unconventional dataset such as audio time series? I can apply them, but I think for time series is a sequential problem. So I don't think we can apply CNNs for time series. We foresee time series. We can apply RNNs, but not the CNNs. But for audio kind of problem, when you want to extract some high-level information from the audio, we can definitely leverage CNNs. We can leverage CNNs not only for information, but also for audio and textual data as well. That is when we want to extract some high-level features and then might want to pass it on to RNNs because audio is also a sequential problem. But we can add initial layers of CNNs to extract the high-level information, then pass it on to a sequential modeling.

Tarang Gupta

Machine Learning Engineer III

7.17 years

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

Senior Data Scientist

Machine Learning Engineer III

Sr Data Scientist

Data Scientist II

Software Engineer - Machine Learning

Asst System Engineer

Achievements

Major Projects

PubMed Agent

Virtual Doctor Consultation

Education

MSc: Artificial Intelligence and Machine Learning

Postgraduate Diploma

Bachelor of Computer Applications

Data Analysis with Pandas and Python

Applied Machine Learning Algorithms

DeepLearning.AI TensorFlow Developer

Generative AI Fundamentals

Finetuning Large Language Models

Prompt Engineering for Developers

Certifications

tensorflow Developer

Deeplearning.ai tensorflow developer - coursera

Applied machine learning algorithms - linkedin

Finetuning large language models - deeplearning.ai

Generative ai fundamentals - databricks

Data analysis with pandas and python - udemy

Prompt engineering for developers - deepleaning.ai

Deeplearning.ai tensorflow developer

Applied machine learning algorithms

Finetuning large language models

Generative ai fundamentals

Data analysis with pandas and python

Prompt engineering for developers

Interests

AI-interview Questions & Answers