profile-pic
Vetted Talent

Tarang Gupta

Vetted Talent
Data Scientist with 5+ years of experience in building data intensive products from data ingestion to data processing, predictive modeling, visualizations and deployment at scale to deliver actionable insights and solve complex AI/ML problems.
  • Role

    Machine Learning Engineer III

  • Years of Experience

    7 years

Skillsets

  • Boosting
  • Hypothesis testing
  • Jenkins
  • Keras
  • Logistic Regression
  • LSTM
  • Mistral
  • NLTK
  • NumPy
  • pandas
  • Rnn
  • SQL
  • Transfer Learning
  • Word2Vec
  • AWS Batch
  • Bagging
  • Gru
  • Cloudera
  • EC2
  • ECS
  • Flask
  • GenAI
  • Lambda
  • LangChain
  • LLM agents
  • OpenAI
  • Oracle
  • S3
  • SciPy
  • Xturing
  • HuggingFace
  • RedHat
  • Recommendation Engines
  • TensorFlow - 6.0 Years
  • PyTorch - 6.0 Years
  • Docker - 6.0 Years
  • BERT
  • GPT
  • LLAMA
  • Neural Networks
  • NLU
  • PySpark
  • Scikit-learn
  • spaCy
  • Jupyter
  • Linear Regression
  • Tree methods
  • clustering
  • Python - 7.0 Years
  • NLG
  • NER
  • POS Tagging
  • clustering
  • Jupyter
  • Redis
  • Python
  • PostgreSQL
  • A/B testing
  • Cnn
  • conversational AI
  • DNN
  • Gensim
  • Git

Vetted For

16Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Machine Learning Engineer, AI/ML, Search & Discovery (Remote)AI Screening
  • 66%
    icon-arrow-down
  • Skills assessed :Collaboration, Communication, CI/CD, Data preprocessing, Deep Learning, Feature Engineering, Model evaluation, Natural Language Processing, PyTorch, Reinforcement Learning, TensorFlow, Good Team Player, machine_learning, NLP, Problem Solving Attitude, Python
  • Score: 59/90

Professional Summary

7Years
  • Jan, 2025 - Present1 yr 2 months

    Machine Learning Engineer III

    Procore Technologies
  • May, 2023 - Jan, 20251 yr 8 months

    Sr AI Engineer

    IQVIA
  • Oct, 2022 - Apr, 2023 6 months

    Data Scientist II

    Tata 1mg
  • Nov, 2018 - Aug, 20212 yr 9 months

    Asst System Engineer

    Tata Consultancy Services
  • Aug, 2021 - Oct, 20221 yr 2 months

    Software Engineer - Machine Learning

    Gartner Inc.

Applications & Tools Known

  • icon-tool

    Python

  • icon-tool

    MySQL

  • icon-tool

    PostgreSQL

  • icon-tool

    Jira

  • icon-tool

    LLM

  • icon-tool

    LangChain

  • icon-tool

    Pyspark

  • icon-tool

    T-SQL

  • icon-tool

    Oracle SQL Developer

  • icon-tool

    Docker

  • icon-tool

    Git

  • icon-tool

    Pandas

  • icon-tool

    Jupyter

  • icon-tool

    Core Java

  • icon-tool

    Scikit-learn

  • icon-tool

    Keras

  • icon-tool

    Pandas

  • icon-tool

    NLTK

  • icon-tool

    OpenAI

  • icon-tool

    AWS Batch

  • icon-tool

    EC2

  • icon-tool

    ECS

  • icon-tool

    S3

  • icon-tool

    Lambda

  • icon-tool

    SQL

  • icon-tool

    Oracle

  • icon-tool

    Redis

  • icon-tool

    Jenkins

  • icon-tool

    Jupyter

  • icon-tool

    PyTorch

  • icon-tool

    Pandas

  • icon-tool

    Gensim

  • icon-tool

    NLTK

  • icon-tool

    OpenAI

  • icon-tool

    ECS

  • icon-tool

    Lambda

  • icon-tool

    SQL

Work History

7Years

Machine Learning Engineer III

Procore Technologies
Jan, 2025 - Present1 yr 2 months
    Spearheaded a Proof of Concept (POC) for semantic search, utilizing Databricks Vector Search to enhance information retrieval. Modernized the project's Python environment and dependency management by evaluating tools like Poetry and uv and executing the transition from Poetry to uv for improved performance & efficiency. Contributed to the dataset selection and preparation process for a company type prediction model, aimed at determining customer progression through the onboarding funnel.

Sr AI Engineer

IQVIA
May, 2023 - Jan, 20251 yr 8 months
    Automated report creation of FDA YouTube meetings using GPT-4o and developed and implemented a solution for speech separation to identify speakers using pyannote & whisper, achieving an 85% reduction in manual effort. Built a PubMed agent leveraging LangChain and Chroma vector database to efficiently handle PubMed-related queries. Developed a FastAPI-based middleware API, enabling efficient data retrieval for the UI team and providing real-time results from individual LLM agents. Fine-tuned Mistral and the Llama family of models on the PubMed & MMLU datasets using PEFT techniques like LORA and QLORA, rigorously evaluating their performance for benchmarking purposes. Fine-tuned Mistral, Llama, CodeLlama, and Phind LLM models on the WikiTableQuestions (WTQ) dataset using PEFT techniques to generate python code. Executed a Proof of Concept (POC) for a virtual doctor use case involving Conversation AI for automated medical consultations, leveraging ChatDoctor consultation data and constructing a RAG pipeline utilizing LangChain and Pinecone. Designed and implemented XGBoost model to identify patients diagnosed with nephropathy disease, achieving 12% precision at 5% recall bin. Designed and implemented a Logistic regression model to classify patients diagnosed with stroke disease, conducting a comparative analysis against an intricate XGBoost model. Presented the results and insights to relevant stakeholders. Applied K-means clustering to categorize patients based on nephropathy progression, assessing optimal clustering through silhouette and elbow curve analyses. Applied dimensionality reduction techniques PCA and UMAP to visualize patient disease progression data and identify potential separable classes. Designed and implemented an ETL based rule engine using Pyspark to detect individuals at risk of Hip arthroplasty disease within a healthcare context.

Data Scientist II

Tata 1mg
Oct, 2022 - Apr, 2023 6 months
    Enhanced API to serve inferences of MAB RL model to multiple platforms. Improved the data pipeline for capturing widget clicks and impressions according to changing business requirements. Analyzed user behavior on 1mg Homepage to calculate click through rate (CTR).

Software Engineer - Machine Learning

Gartner Inc.
Aug, 2021 - Oct, 20221 yr 2 months
    Designed ETL data pipelines to calculate engagements of assets by Gartner customers from multiple data sources like Oracle, Postgres, SQL Server and applied prioritization logics based on their engagements for customer retention. Built CI/CD pipelines using Git & Jenkins for deployment in AWS Batch, EC2, ECS and managed cloud infrastructure through Terraform scripts. Supported development of machine learning model for customer risk analysis using XGBoost Classifier and refactored data science code for scalable deployment with MLOps practices. Performed POCs of Big data technologies like Spark, EMR to check feasibility for business use cases.

Asst System Engineer

Tata Consultancy Services
Nov, 2018 - Aug, 20212 yr 9 months
    Supported Chatbot development using RASA Framework for customer queries and deployed on Slack for 24/7 support. Contributed to a stacked Deep Neural Network based Hand Gesture Recognition model to remotely control Television for a television brand customer. Automated XML code generation for Soap webservice to update bulk user data of Planview.

Achievements

  • Designed and implemented XGBoost model with 12% precision at 5% recall.
  • Implemented Logistic regression model for stroke disease classification.
  • Applied K-means clustering for patient categorization.
  • Utilized PCA and UMAP for patient data visualization.
  • Developed ETL based rule engine with Pyspark.
  • Spearheaded cross-national datasets analysis for rare diseases identification.
  • Executed PoC for virtual doctor using Conversational AI.

Major Projects

2Projects

PubMed Agent

    Built a PubMed agent leveraging LangChain and Chroma vector database to handle PubMed-related queries effectively.

Virtual Doctor Consultation

    Executed a Conversation AI POC leveraging ChatDoctor consultation data and constructed an RAG pipeline utilizing LangChain and Pinecone.

Education

  • MSc: Artificial Intelligence and Machine Learning

    Liverpool John Moores University (2022)
  • Postgraduate Diploma

    IIIT Bangalore
  • Bachelor of Computer Applications

    Maharishi Markandeshwar University (2018)
  • Data Analysis with Pandas and Python

    Udemy
  • Applied Machine Learning Algorithms

    LinkedIn
  • DeepLearning.AI TensorFlow Developer

    Coursera
  • Generative AI Fundamentals

    Databricks
  • Finetuning Large Language Models

    deeplearning.ai
  • Prompt Engineering for Developers

    deepleaning.ai

Certifications

  • tensorflow Developer

    Deeplearning.ai
  • Deeplearning.ai tensorflow developer - coursera

  • Applied machine learning algorithms - linkedin

  • Finetuning large language models - deeplearning.ai

  • Generative ai fundamentals - databricks

  • Data analysis with pandas and python - udemy

  • Prompt engineering for developers - deepleaning.ai

  • Deeplearning.ai tensorflow developer

  • Applied machine learning algorithms

  • Finetuning large language models

  • Generative ai fundamentals

  • Data analysis with pandas and python

  • Prompt engineering for developers

Interests

  • Technology Research
  • AI-interview Questions & Answers

    Hello. Hi. Uh, I'm Tanig Gupta, and I'm currently working with IQVIA, senior data scientist, where I'm helping the health care professionals to take better decisions through the AI insights for the patient wellness. Uh, so prior going to IQV, I worked companies like Protata One MG, which is an Indian health care start up, and, uh, Gartner and Tata Consultancy represent. So I got really good exposure to work with variety of datasets starting, um, ranging from tower data to the national data processing or the computer vision kind of problems right from the data consulting system itself. And, uh, the that time, also, just to be deep dive into the area in the field of artificial intelligence, I got myself enrolled and complete my masters in the from the in the field of artificial intelligence itself. And then I joined Gartner where I worked in the customer analytics area where I worked on building multiple data ML pipelines, uh, for, uh, the customer, uh, prioritization and the customer risk analysis. Um, I'll be and one of the recommendation agenda which is an document recommendation agenda from one of the gartner.com website. And then I joined and then IQVIA. In IQVIA, I'm primarily working with the health care data set, which is electronic medical records. And with that, uh, we are trying to predict the rare disease of the patients, uh, so that we can the doctors can, uh, identify that beforehand so that we can save the patient beforehand. And, also, uh, working on the, uh, national reprocessing or the, uh, we're working on the benchmarking of the multiple, uh, large organ modems for, uh, only health care datasets. When we are looking in this trial of 7,000,000,000 model, I'm trying to find the dataset, which is an, uh, which is a PubMed based dataset. Uh, I'm trying to do batch model of by training multiple models and choose which one works best best for the health care dataset. So, yeah, that's all about me, uh, my past experience. And academics have I've done my undergrad from, uh, bachelor's, uh, from India only, which is Maharishi market research university, uh, in Ambala. Uh, I heard from Sarimpot, uh, which is in Uttar Pradesh. So, yeah, that's all about me. Thank you.

    What method would you employ to combine predictions of multiple machinery models? So the method I would choose would be stacking. So when we have multiple machine learning models, we can definitely choose stacking as an option where we leverage the power of different different machine learning models and come up with the best one having contributed all these, uh, models and choose the one which makes have which is having the highest probabilities. So we actually make the predictions maybe if I have 3 models, maybe 1 decision tree, 1 random forest, 1 x g boost, and then 1 learning solution as well. Out of these 4, I make the prediction, choose the one which is having the maximum score or the maximum probability. Those predictions have been choose. This is what we call as stacking of the mission

    Can you improve the training speed of a deep reinforcement learning model without compromising its performance? Uh, speed. So I think in this case, I think I should be using the, uh, the dash normalization layers as well So I can boost up the process.

    How do you approach building a max building a new network model to process multilingual text data? I think in case it says, uh, this multilingual data would be of sequence sequence modeling. I can choose, uh, some RNNs. Uh, recurring network, I can go with the LSTM or GRE layers I can choose. I can play around with them. I can choose. I can play around with the architecture of it. Uh, maybe 2 LSTM or 1 GL would be making sense or maybe the stacking of those multiple LSTMs, uh, I can choose. Uh, so, yeah, I think these my approach would be, uh, doing a sequence modeling using the RNNs for the same. I can also leverage the large m v models today, uh, the pre trained models and maybe fine tune it on a, uh, particular task. There are also something I can think of.

    Right balance between procedure and the call for a classification problem. So right balance is something I have to plot precision and the call curve, and maybe I can, um, have a multiple I mean, I'm, uh, choosing the right balance between precision recall is actually choosing the right threshold, uh, to to divide my dataset if considering a binary classification problem. So there I can, uh, plot the recall and proceed and see where we are there wherever they intersect. But it also, uh, driven from the, uh, business point of view, maybe I'm going to focus more on my recall, not on the precision. Or maybe I can focus more on the precision, not on the recall. So it's vice versa. In that case, my approach could could vary or my threshold, which I choose to classify my, uh, uh, data points. You know, that also can vary. So one way to choose is the, uh, AUC score, which is which, uh, gives my overall robustness of my model, and then I can simply keep on checking my precision and recall or, uh, at each threshold. Or maybe I take 10% as threshold and divide my dataset and 20% 30% like that. And then I see that what makes more sense to the business, and accordingly, I choose the right threshold and the right balance between the

    Convertor trained to test the model for more mobile friendly format. Yeah. We can do quantization for the same. We can go with it. We can, first of all, use if you mobile friendly format, we can use it TensorFlow Lite to build the model. This will actually be very, um, very lightweight model. We can convert the model to ONIX format, ONIXS format ONIX format, which is which is heavily designed for tackle these kind of problems, which is the quantization model quantized model, uh, mister quantized model, which is very friendly for the mobile kind of for the, uh, low level devices like or maybe edge devices as well. In that case, those works absolutely fine, uh, to improve the

    Part of machine learning prediction service. What is the potential design problem here? How could you address it? Prediction service. Model is none static method. We have load the model. We are loading the model. The reason that we've got model is because the model. Okay? Okay. We have the static method. This is here. Uh, prediction, we are making, uh, predict model prediction model is dynamic model is done, then we have exception model not found. Else, return the model predictions. Okay. I think instead of having the model as a class method, although we are looking here. Okay. This is in a static method, first of all, and we're trying to use a class variable. So which is not possible either a slot model has to be a class method, not a static static method. Or we simply remove the static method here or can define a model as self dot model, which is globally available in name whole class. I can use that one. Or I simply make it a class method because I cannot do something like having a static method. I can I'm just using the class variable here. I cannot do that. Let's validate the rules of static method. So this is the main problem here in design.

    The Python function, uh, intended for future scaling in a machine learning cross machine learning preprocessing pipeline and the Python screen potential using Python production and then features scaling, machine learning preprocessing pipeline, machine learning programming, scaling of RAM, identifying the identify and explaining potential issues. So if we're just scanning, I think the data frame in the column. Okay? We're gonna do the minimum maximizing of it. And okay. And column is equals to menu. Okay. We are simply standardizing it manually. We are doing column minus minimum value divided by the max and minus minimum value. Okay. It's like converted to not z score, but the mini max standard. I didn't write it. Okay. Okay. First of all, uh, this function is only for the features, which are numerical nature. That is the continuous features. But I'm not sure how we are using it because there could be a possibility that data frame in data frame, there's a categorical features coming. In that case, this function won't work, and this will produce the errors because they'll they'll at no point, we can simply extract a minimum or minimum or maximum of any category feature. So, uh, first is that point. And, uh, intent of features mainly machine learning processing. So I think that's a major issue. And in this case, I'm completely converting my data frame, and the column is completely changing and converted to a minmax scalar values, which I have here. Okay. The minmax function is fine. The minmax scaler is fine. But we could use leverage scikit learn implementation of min max scaler as well. Uh, so that would also work. That would be more activated, more, uh, faster, uh, because that works in a vectorized fashion. So the the these 2 points I would say I will consider to improve to this particular function.

    How you would use PyTorch to implement a feature that could perform style transfer between 2 minutes? Implement a feature that could perform style transfer between 2 minutes. Like, uh, might want to implement GANs here. So I think I can implement GANs to do the same

    Of use of graph neural networks and potential use. Graph neural networks can be used when we might want to, uh, start solution link between multiple documents, uh, maybe multiple datasets. And you want to learn something from variety of datasets. Uh, maybe I may have 3 or 4 datasets to solve a particular problem. I want to establish the connection of learning. I want to learn something from the variety of datasets. And those those kind of scenarios I can, uh, learn from the graph networks.

    How might you apply convolutional neural networks to an unconventional dataset such as audio time series? I can apply them, but I think for time series is a sequential problem. So I don't think we can apply CNNs for time series. We foresee time series. We can apply RNNs, but not the CNNs. But for audio kind of problem, when you want to extract some, uh, high level information from the audio, we can definitely leverage the CNNs. Uh, we can leverage the CNNs not only for information, but for audio and textual data as well. That is when we want to extract some high level features and then might want to pass it on to the RNNs because audio is also a, uh, sequential problem. Uh, but we can these initial layers, we can add off CNNs to extract the hello information, then pass it out to a sequential modeling