profile-pic
Vetted Talent

Tarang Gupta

Vetted Talent
Data Scientist with 5+ years of experience in building data intensive products from data ingestion to data processing, predictive modeling, visualizations and deployment at scale to deliver actionable insights and solve complex AI/ML problems.
  • Role

    Machine Learning Engineer III

  • Years of Experience

    7.17 years

Skillsets

  • Boosting
  • Hypothesis testing
  • Jenkins
  • Keras
  • Logistic Regression
  • LSTM
  • Mistral
  • NLTK
  • NumPy
  • pandas
  • Rnn
  • SQL
  • Transfer Learning
  • Word2Vec
  • AWS Batch
  • Bagging
  • Gru
  • Cloudera
  • EC2
  • ECS
  • Flask
  • GenAI
  • Lambda
  • LangChain
  • LLM agents
  • OpenAI
  • Oracle
  • S3
  • SciPy
  • Xturing
  • HuggingFace
  • RedHat
  • Recommendation Engines
  • TensorFlow - 6.0 Years
  • PyTorch - 6.0 Years
  • Docker - 6.0 Years
  • BERT
  • GPT
  • LLAMA
  • Neural Networks
  • NLU
  • PySpark
  • Scikit-learn
  • spaCy
  • Jupyter
  • Linear Regression
  • Tree methods
  • clustering
  • Python - 7.0 Years
  • NLG
  • NER
  • POS Tagging
  • clustering
  • Jupyter
  • Redis
  • Python
  • PostgreSQL
  • A/B testing
  • Cnn
  • conversational AI
  • DNN
  • Gensim
  • Git

Vetted For

16Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Machine Learning Engineer, AI/ML, Search & Discovery (Remote)AI Screening
  • 66%
    icon-arrow-down
  • Skills assessed :Collaboration, Communication, CI/CD, Data preprocessing, Deep Learning, Feature Engineering, Model evaluation, Natural Language Processing, PyTorch, Reinforcement Learning, TensorFlow, Good Team Player, machine_learning, NLP, Problem Solving Attitude, Python
  • Score: 59/90

Professional Summary

7.17Years
  • Jan, 2026 - Present 4 months

    Senior Data Scientist

    Walmart Global Tech
  • Jan, 2025 - Jan, 20261 yr

    Machine Learning Engineer III

    Procore Technologies
  • May, 2023 - Jan, 20251 yr 8 months

    Sr Data Scientist

    IQVIA
  • Nov, 2018 - Aug, 20212 yr 9 months

    Asst System Engineer

    Tata Consultancy Services
  • Aug, 2021 - Oct, 20221 yr 2 months

    Software Engineer - Machine Learning

    Gartner
  • Oct, 2022 - Feb, 2023 4 months

    Data Scientist II

    Tata 1mg

Applications & Tools Known

  • icon-tool

    Python

  • icon-tool

    MySQL

  • icon-tool

    PostgreSQL

  • icon-tool

    Jira

  • icon-tool

    LLM

  • icon-tool

    LangChain

  • icon-tool

    Pyspark

  • icon-tool

    T-SQL

  • icon-tool

    Oracle SQL Developer

  • icon-tool

    Docker

  • icon-tool

    Git

  • icon-tool

    Pandas

  • icon-tool

    Jupyter

  • icon-tool

    Core Java

  • icon-tool

    Scikit-learn

  • icon-tool

    Keras

  • icon-tool

    Pandas

  • icon-tool

    NLTK

  • icon-tool

    OpenAI

  • icon-tool

    AWS Batch

  • icon-tool

    EC2

  • icon-tool

    ECS

  • icon-tool

    S3

  • icon-tool

    Lambda

  • icon-tool

    SQL

  • icon-tool

    Oracle

  • icon-tool

    Redis

  • icon-tool

    Jenkins

  • icon-tool

    Jupyter

  • icon-tool

    PyTorch

  • icon-tool

    Pandas

  • icon-tool

    Gensim

  • icon-tool

    NLTK

  • icon-tool

    OpenAI

  • icon-tool

    ECS

  • icon-tool

    Lambda

  • icon-tool

    SQL

Work History

7.17Years

Senior Data Scientist

Walmart Global Tech
Jan, 2026 - Present 4 months

Machine Learning Engineer III

Procore Technologies
Jan, 2025 - Jan, 20261 yr
    Spearheaded a Proof of Concept (POC) for semantic search, utilizing Databricks Vector Search to enhance information retrieval. Modernized the project's Python environment and dependency management by evaluating tools like Poetry and uv and executing the transition from Poetry to uv for improved performance & efficiency. Contributed to the dataset selection and preparation process for a company type prediction model, aimed at determining customer progression through the onboarding funnel.

Sr Data Scientist

IQVIA
May, 2023 - Jan, 20251 yr 8 months
    Automated report creation of FDA YouTube meetings using GPT-4o and developed and implemented a solution for speech separation to identify speakers using pyannote & whisper, achieving an 85% reduction in manual effort. Built a PubMed agent leveraging LangChain and Chroma vector database to efficiently handle PubMed-related queries. Developed a FastAPI-based middleware API, enabling efficient data retrieval for the UI team and providing real-time results from individual LLM agents. Fine-tuned Mistral and the Llama family of models on the PubMed & MMLU datasets using PEFT techniques like LORA and QLORA, rigorously evaluating their performance for benchmarking purposes. Fine-tuned Mistral, Llama, CodeLlama, and Phind LLM models on the WikiTableQuestions (WTQ) dataset using PEFT techniques to generate python code. Executed a Proof of Concept (POC) for a virtual doctor use case involving Conversation AI for automated medical consultations, leveraging ChatDoctor consultation data and constructing a RAG pipeline utilizing LangChain and Pinecone. Designed and implemented XGBoost model to identify patients diagnosed with nephropathy disease, achieving 12% precision at 5% recall bin. Designed and implemented a Logistic regression model to classify patients diagnosed with stroke disease, conducting a comparative analysis against an intricate XGBoost model. Presented the results and insights to relevant stakeholders. Applied K-means clustering to categorize patients based on nephropathy progression, assessing optimal clustering through silhouette and elbow curve analyses. Applied dimensionality reduction techniques PCA and UMAP to visualize patient disease progression data and identify potential separable classes. Designed and implemented an ETL based rule engine using Pyspark to detect individuals at risk of Hip arthroplasty disease within a healthcare context.

Data Scientist II

Tata 1mg
Oct, 2022 - Feb, 2023 4 months
    Enhanced API to serve inferences of MAB RL model to multiple platforms. Improved the data pipeline for capturing widget clicks and impressions according to changing business requirements. Analyzed user behavior on 1mg Homepage to calculate click through rate (CTR).

Software Engineer - Machine Learning

Gartner
Aug, 2021 - Oct, 20221 yr 2 months
    Designed ETL data pipelines to calculate engagements of assets by Gartner customers from multiple data sources like Oracle, Postgres, SQL Server and applied prioritization logics based on their engagements for customer retention. Built CI/CD pipelines using Git & Jenkins for deployment in AWS Batch, EC2, ECS and managed cloud infrastructure through Terraform scripts. Supported development of machine learning model for customer risk analysis using XGBoost Classifier and refactored data science code for scalable deployment with MLOps practices. Performed POCs of Big data technologies like Spark, EMR to check feasibility for business use cases.

Asst System Engineer

Tata Consultancy Services
Nov, 2018 - Aug, 20212 yr 9 months
    Supported Chatbot development using RASA Framework for customer queries and deployed on Slack for 24/7 support. Contributed to a stacked Deep Neural Network based Hand Gesture Recognition model to remotely control Television for a television brand customer. Automated XML code generation for Soap webservice to update bulk user data of Planview.

Achievements

  • Designed and implemented XGBoost model with 12% precision at 5% recall.
  • Implemented Logistic regression model for stroke disease classification.
  • Applied K-means clustering for patient categorization.
  • Utilized PCA and UMAP for patient data visualization.
  • Developed ETL based rule engine with Pyspark.
  • Spearheaded cross-national datasets analysis for rare diseases identification.
  • Executed PoC for virtual doctor using Conversational AI.

Major Projects

2Projects

PubMed Agent

    Built a PubMed agent leveraging LangChain and Chroma vector database to handle PubMed-related queries effectively.

Virtual Doctor Consultation

    Executed a Conversation AI POC leveraging ChatDoctor consultation data and constructed an RAG pipeline utilizing LangChain and Pinecone.

Education

  • MSc: Artificial Intelligence and Machine Learning

    Liverpool John Moores University (2022)
  • Postgraduate Diploma

    IIIT Bangalore
  • Bachelor of Computer Applications

    Maharishi Markandeshwar University (2018)
  • Data Analysis with Pandas and Python

    Udemy
  • Applied Machine Learning Algorithms

    LinkedIn
  • DeepLearning.AI TensorFlow Developer

    Coursera
  • Generative AI Fundamentals

    Databricks
  • Finetuning Large Language Models

    deeplearning.ai
  • Prompt Engineering for Developers

    deepleaning.ai

Certifications

  • tensorflow Developer

    Deeplearning.ai
  • Deeplearning.ai tensorflow developer - coursera

  • Applied machine learning algorithms - linkedin

  • Finetuning large language models - deeplearning.ai

  • Generative ai fundamentals - databricks

  • Data analysis with pandas and python - udemy

  • Prompt engineering for developers - deepleaning.ai

  • Deeplearning.ai tensorflow developer

  • Applied machine learning algorithms

  • Finetuning large language models

  • Generative ai fundamentals

  • Data analysis with pandas and python

  • Prompt engineering for developers

Interests

  • Technology Research
  • AI-interview Questions & Answers

    Hello. Hi. I'm Tanig Gupta, and I'm currently working with IQVIA as a senior data scientist, where I'm helping healthcare professionals make better decisions through AI insights for patient wellness. So, prior to IQVIA, I worked with companies like Protata One MG, which is an Indian healthcare startup, and Gartner and Tata Consultancy Services. So I got really good exposure to working with a variety of datasets, ranging from tower data to national data processing, or computer vision kind of problems, right from data consulting systems themselves. And at that time, I also took a deep dive into the area of artificial intelligence by completing my master's in the field of artificial intelligence itself. Then I joined Gartner, where I worked in the customer analytics area, where I worked on building multiple data ML pipelines for customer prioritization and customer risk analysis. I was also a contributor to a document recommendation agenda, which is available on the gartner.com website. Then I joined IQVIA. In IQVIA, I'm primarily working with healthcare datasets, which are electronic medical records. And with that, we're trying to predict rare diseases in patients, so that doctors can identify them beforehand and save patients. Also, I'm working on the benchmarking of multiple large organ models for healthcare datasets. When working on a trial of 7,000,000,000 models, I'm trying to find the dataset, which is a PubMed-based dataset. I'm trying to do batch modeling by training multiple models and choosing which one works best for the healthcare dataset. So, yeah, that's all about me and my past experience. I've done my undergrad from a bachelor's program in India, which is Maharishi Markandeshwar University in Ambala. I also have a degree from Sarimpot, which is in Uttar Pradesh. So, yeah, that's all about me. Thank you.

    What method would you employ to combine predictions of multiple machinery models? So the method I would choose would be stacking. So when we have multiple machine learning models, we can definitely choose stacking as an option where we leverage the power of different machine learning models and come up with the best one having contributed all these models and choose the one which has the highest probabilities. So we actually make the prediction, choose the one which is having the maximum score or the maximum probability. Those predictions have been chosen. This is what we call stacking of the mission.

    Can you improve the training speed of a deep reinforcement learning model without compromising its performance? I think in this case, I should be using the dash normalization layers as well, so I can boost up the process.

    I approach building a new network model to process multilingual text data by considering sequence modeling with RNNs. I can choose some RNNs, such as LSTM or GRU layers, to process this multilingual data. I can play around with the architecture of it, for example, using 2 LSTMs or 1 GRU, or stacking multiple LSTMs. I can also leverage large pre-trained models today and fine-tune them on a particular task.

    Right balance between procedure and the call for a classification problem. So the right balance is something I have to plot precision against the recall curve, and maybe I can have a multiple I mean, I'm choosing the right threshold to divide my dataset, considering a binary classification problem. So there I can plot the recall and proceed and see where they intersect. But it's also driven from the business point of view, maybe I'm going to focus more on my recall, not on the precision. Or maybe I can focus more on the precision, not on the recall. So it's vice versa. In that case, my approach could vary or my threshold, which I choose to classify my data points. You know, that also can vary. So one way to choose is the AUC score, which gives my overall robustness of my model, and then I can simply keep on checking my precision and recall at each threshold. Or maybe I take 10%, 20%, 30% like that. And then I see what makes more sense to the business, and accordingly, I choose the right threshold and the right balance between the precision and recall.

    Converter trained to test the model for more mobile-friendly format. Yeah, we can do quantization for the same. We can go with it. We can first use the mobile-friendly format, we can use TensorFlow Lite to build the model. This will actually be very lightweight model. We can convert the model to ONNX format, which is heavily designed for tackling these kinds of problems, the quantization model, which is very friendly for mobile devices and low-level devices like edge devices as well. In that case, it works absolutely fine to improve the model.

    We have loaded the model. We are loading the model. The reason we've got the model is because the model. Okay? We have the static method. This is here. Prediction, we are making a prediction. The model is dynamic. If the model is done, then we have an exception, "model not found." Else, we return the model's predictions. Okay. I think instead of having the model as a class method, although we are looking at this. This is in a static method, first of all, and we're trying to use a class variable. So, which is not possible either - the model has to be a class method, not a static method. Or we simply remove the static method here or can define the model as self.model, which is globally available in the whole class. I can use that one. Or I simply make it a class method because I cannot do something like having a static method. I can't just use the class variable here. I cannot do that. Let's validate the rules of a static method. So, this is the main problem here in design.

    The Python function, intended for future scaling in a machine learning cross machine learning preprocessing pipeline and the Python screen for using Python production and then feature scaling, machine learning preprocessing pipeline, machine learning programming, scaling of RAM, identifying and explaining potential issues. So if we're just scanning the data frame in a column. Okay. We're going to do the minimum-maximum of it. And the column is equal to the menu. Okay. We are simply standardizing it manually. We are doing column minus minimum value divided by the max minus minimum value. Okay. It's like converted to not z-score, but the min-max standard. I didn't write it. Okay. First of all, this function is only for features which are of numerical nature. That is the continuous features. But I'm not sure how we are using it because there could be a possibility that the data frame in the data frame, there's a categorical feature coming. In that case, this function won't work, and this will produce errors because we can't simply extract a minimum or maximum of any category feature. So, first, that's a point. And the intent of features is mainly machine learning processing. So I think that's a major issue. And in this case, I'm completely converting my data frame, and the column is completely changing and converted to min-max scalar values, which I have here. Okay. The min-max function is fine. The min-max scaler is fine. But we could use the scikit-learn implementation of the min-max scaler as well. So that would also work. That would be more activated, more faster, because that works in a vectorized fashion. So these two points I would say I will consider to improve to this particular function.

    How you would use PyTorch to implement a feature that could perform style transfer between 2 minutes? Implement a feature that could perform style transfer between 2 minutes. Like, might want to implement GANs here. So I think I can implement GANs to do the same

    Of the use of graph neural networks and their potential use. Graph neural networks can be used when we might want to start a solution linking multiple documents, maybe multiple datasets. And you want to learn something from a variety of datasets. Maybe I have 3 or 4 datasets to solve a particular problem, and I want to establish the connection of learning. I want to learn something from the variety of datasets. And in those kinds of scenarios, I can learn from graph networks.

    How might you apply convolutional neural networks to an unconventional dataset such as audio time series? I can apply them, but I think for time series is a sequential problem. So I don't think we can apply CNNs for time series. We foresee time series. We can apply RNNs, but not the CNNs. But for audio kind of problem, when you want to extract some high-level information from the audio, we can definitely leverage CNNs. We can leverage CNNs not only for information, but also for audio and textual data as well. That is when we want to extract some high-level features and then might want to pass it on to RNNs because audio is also a sequential problem. But we can add initial layers of CNNs to extract the high-level information, then pass it on to a sequential modeling.