profile-pic
Vetted Talent

Rohit Rawat

Vetted Talent

I am currently working as a Data Scientist at Grainger Canada, deploying machine learning models for customer-facing applications and working towards productionalizing LLMs. I have a Masters in Data Science, two degrees in Mathematics and have worked as an Analytics Manager in the FinTech domain. I'm a Databricks Certified Machine Learning Associate.

I've developed machine learning models for recommending mutual fund schemes and used clustering methods to segment customers. Presently, I'm interested in MLOps and LLMs. I actively engage in Kaggle Competitions to discover novel methods to tackle varied data problems. My philosophy is to practice what I have learned and learn what I have not read before.

  • Role

    Senior Applied Scientist-I (Machine Learning Engineer)

  • Years of Experience

    9.3 years

  • Professional Portfolio

    View here

Skillsets

  • LLMs - 2 Years
  • VLMs
  • vLLM
  • Triton
  • SLMs
  • Pinecone
  • NoSQL
  • Milvus
  • HuggingFace
  • FastAPI
  • Azure
  • LoRA
  • rag
  • Kubeflow
  • Databricks
  • CI/CD - 5 Years
  • LangChain - 2 Years
  • Airflow
  • Flask - 5 Years
  • Docker - 5 Years
  • Spark
  • Kubernetes - 2 Years
  • Git
  • SQL - 5 Years
  • Python - 5 Years
  • AWS - 5 Years
  • Snowflake - 2 Years
  • PyTorch - 5 Years
  • MLFlow - 5 Years

Vetted For

12Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Data Scientist (Remote)AI Screening
  • 79%
    icon-arrow-down
  • Skills assessed :Communication Skills, Jira, Retrieval-Augmented Generation, Computer Vision, Deep Learning, PyTorch, TensorFlow, GitLab, machine_learning, NLP, NO SQL, Python
  • Score: 71/90

Professional Summary

9.3Years
  • Apr, 2025 - Present1 yr 2 months

    Senior Applied Scientist - I

    Accrete
  • Oct, 2024 - Apr, 2025 6 months

    Machine Learning Engineer

    Accrete
  • Aug, 2022 - Oct, 20242 yr 2 months

    Data Scientist

    Grainger Canada
  • Jun, 2017 - Jul, 2017 1 month

    Data Analyst Intern

    BASIX Sub-K iTransactions Litd
  • Jul, 2018 - May, 20212 yr 10 months

    Analytics Manager

    SBI MUTUAL FUND
  • Apr, 2022 - Jul, 2022 3 months

    Research Intern

  • Feb, 2014 - Dec, 2014 10 months

    Volunteer

    Shades of Happiness
  • Aug, 2013 - Apr, 20151 yr 8 months

    Team Member

    Enactus Hindu College

Applications & Tools Known

  • icon-tool

    SOLR

  • icon-tool

    MLflow

  • icon-tool

    Databricks

  • icon-tool

    Streamlit

  • icon-tool

    GitHub Copilot

  • icon-tool

    Locust

  • icon-tool

    Splunk

  • icon-tool

    Grafana

  • icon-tool

    CRM

  • icon-tool

    Power BI

  • icon-tool

    Google Colab

  • icon-tool

    SQL Server

  • icon-tool

    Google Analytics

  • icon-tool

    PowerBI

  • icon-tool

    SQL Server

  • icon-tool

    ETL

  • icon-tool

    Flask

  • icon-tool

    MLflow

  • icon-tool

    Superset

Work History

9.3Years

Senior Applied Scientist - I

Accrete
Apr, 2025 - Present1 yr 2 months

Machine Learning Engineer

Accrete
Oct, 2024 - Apr, 2025 6 months
    Agentic Video Curation: Built an end-to-end pipeline for automated short-form video curation using multimodal retrieval with ColPALI and Qwen3-VL-8B-Instruct, semantic chunking, and vector indexing in Pinecone. Engineered a stateful system using LangGraph and GPT-4o for agentic orchestration, enabling intent-driven video and thumbnail generation. Improved burned-in caption recognition by applying LoRA to Phi-3.5 Vision-Instruct, increasing ROUGE-2 by 10%. Fine-tuned YOLOv11 for logo detection on synthetic sports video backgrounds, achieving 80% improvement over baselines. Built an MCP-based vision model serving layer with Gradio interface for product demos.

Data Scientist

Grainger Canada
Aug, 2022 - Oct, 20242 yr 2 months
    Built and deployed ML microservices for search intent optimization using NER and a BGE neural re-ranker, improving relevance and retrieval by 25%. Developed an LLaMA-3.1-8B-Instruct product description generator with RAG and Milvus, increasing engagement by 20% and conversions by 11%. Migrated FastText classification to MLflow and Databricks with PySpark, reducing training time by 37% and improving recall by 8%. Improved production reliability with Locust load-testing and Grafana monitoring.

Research Intern

Apr, 2022 - Jul, 2022 3 months

Analytics Manager

SBI MUTUAL FUND
Jul, 2018 - May, 20212 yr 10 months
    Fine-tuned DistilBERT on 30K+ customer reviews for sentiment classification, reducing negative-feedback response time by 40%. Built clustering and segmentation models (K-Means, DBSCAN) for investors and brokers, enabling targeted notifications and increasing platform usage by 25%. Developed a collaborative filtering fund recommendation system, increasing digital sales by 11%. Managed two data analysts and improved SQL reporting efficiency.

Data Analyst Intern

BASIX Sub-K iTransactions Litd
Jun, 2017 - Jul, 2017 1 month

Volunteer

Shades of Happiness
Feb, 2014 - Dec, 2014 10 months
    Part of their Each One Teach One and NEEV initiatives. My responsibility involved weekly fours of teaching one or two students.

Team Member

Enactus Hindu College
Aug, 2013 - Apr, 20151 yr 8 months
    Worked on the Marketing Team of Project SHRESHTH. Initiative was to socially and economically empower women belonging to the underprivileged section of society through a small-scale incense-stick manufacturing business.

Achievements

  • Top 2% (16/829) in Kaggle Team competition, WiDS Hackathon 2022 | Forecast Energy Consumption
  • Top 3% (87/3537, Solo Silver) in Kaggle competition, PetFinder.my |CNN | Predict Image Popularity
  • Received DST-INSPIRE scholarship (Top 1% in India ISC Exams) from Govt. of India

Major Projects

1Projects

Head Impact Detection in Sports Videos

Apr, 2022 - Jul, 2022 3 months
    Developed a machine learning pipeline for identifying head collisions in contact sports, integrating YOLOv5 for helmet localization and ResNet 3D for temporal analysis, achieving a 0.91 recall score and reducing manual video analysis time by 60%.

Education

  • M.S. Data Science

    The University of British Columbia (2022)
  • M.Sc. Mathematics

    Indian Institute of Technology Bombay (2018)
  • B.Sc. Mathematics

    Hindu College, University of Delhi (2016)

Certifications

  • Databricks generative ai fundamentals

  • Databricks certified machine learning associate

  • Databricks lakehouse fundamentals

  • Building transformer-based nlp applications (nvidia)

  • Deep learning specialization (coursera)

  • Sql intermediate (hackerrank)

  • Machine learning in production (deeplearning.ai)

Interests

  • Running Marathon
  • Books
  • AI-interview Questions & Answers

    Hi, I'm Rohit. I've been working as a data scientist at Granger for the past 2 years. My total work experience is 5 years. I've been in the data science domain for about 5 years now. My academic background is in mathematics. I have a master's in mathematics, and I also have a master's in data science. So, my interest lies in both natural language processing and computer vision. In my current job, I'm working on improving the research experience of our customers. My previous job involved working on a variety of machine learning models, including customer sentiment analysis, recommendations, and customer segmentation.

    So in my academic project during my master's of data science, our problem was to detect head impacts of players within a sports video. It was a twofold problem. First, we had to detect helmets in the images of the video frame by frame. The second part was detecting whether two helmets interacted in the video and if that interaction was a head impact or not. It was a combination of two computer vision problems. The first was Yolo v 5, which detected the helmet, and the second was a ResNet 3D model, which detected whether an impact occurred. We primarily used libraries from PyTorch and torchvision.

    In terms of classes, when I'm trying to detect anomaly detection with PyTorch, I think you'll basically use the Torch library. And in addition to using Torch, you can use metrics from scikit-learn, since you are now detecting anomalies. So it's a prediction problem. Therefore, you have to assess whether recall is important to you or precision is important to you. And based on that, you can make use of precision recall, accuracy, and even F1 score to some extent. In terms of validation, I would say we'll have to assess what kind of dataset we have. We can use a cross-validation strategy where we try to make, for example, 5 folds in the dataset that we have. Let's say we make 5 folds, so you'll train on 4 sets and validate on the 5th set. And then, again, you will move on to another set of 4 parts of the dataset, treating the other one as the validation set. Likewise, you'll make 5 folds and you'll train the model on PyTorch 5 times, and the validation score could be the average that we take on this 5-fold problem.

    So, by vector databases, they can be really helpful, and we want to derive better context for our machine learning problem. Let's say our AI is detecting or not detecting. Let's say it's trying to generate product descriptions. So instead of just using a plain large language model, it would be a better idea to use a training set. A training set could be, for example, generating product descriptions of your product catalog that you have. What you can do is store every product discussion as a dictionary, and you can store each individual dictionary into a vector database using some chunking strategy. And when you pass in a prompt, you try to find out which 10 best chunks or 10 best dictionaries you can retrieve from the vector database, which can augment our prompt and give you better item retrieval or better item description.

    Python tools with new useful. Okay. I think it depends on the use case. Let's say you have a problem where you want it to be really fast and the inferences happen in batches, not in real time. Let's say the latency that you want in the inference time that you want in the machine learning pipeline is very small. So it makes sense that you use a very light version of tokenization and text sentiment classification. For tokenization and text cleaning and processing, you might use spaCy, which is lighter. And for sentiment analysis, you could go with a bag of words or even a lighter word-to-vector strategy that would be a little lighter. But let's say you have a problem where it's not real time. It happens in batches, and you're not really worried about the inference time. So it makes sense that you will use a bigger model here. A popular library or assortment of models that are present in Hugging Face is available. So you can use the transformers library from Hugging Face. And from there, you can use auto-tokenizer and an auto-text classification model. So both of them, you can download using the Hugging Face CLI and get the tokens, and use both of them to get pre-trained models, and then you can use these two tools for text organization and also for sentiment analysis.

    What is your approach for training? I think, for an imbalanced dataset, it makes sense to sort of upsample the dataset. Let's say the imbalanced dataset has 2 labels: 1 is positive and 1 is negative. The negative one has a lower count. Let's assume the positive ones are 90% and the negative ones are 10%. There's a higher propensity that the model will be skewed towards the positive ones. So it makes sense either to downsample the 90% positive cases or to upsample the 10% negative cases. This helps reduce the gap between the positive and negative reviews in the training set and reduces the imbalance. In terms of ensuring the model's performance remains robust, you ensure that the distribution of positive and negative features you feed into the network remains consistent. And if you're using any sampling or randomization to pick samples, ensure you select a seed. For example, you can use sklearn and set a random state, such as 24 or 25, and make sure you fix it so that anytime you downsample or upsample, you get the same training examples for your model's performance. This way, you'll ensure the model remains robust, and you can use PyTorch or TensorFlow to achieve this.

    Given the PCA, one of the problems here might be that since you're dealing with PCAs, it's important that all the values in your extreme dataset are numeric. So it makes sense that you filter out in the pipeline only use PCA on the numeric sets. That way you sort of avoid that problem because PCA can't deal with categorical values. But if you intend to use categorical values, then it makes sense that you use a label encoder so as to convert that into a numeric field. And then you can use PCA on that whole entire set. So it all depends on how the extreme set that you have is structured.

    A section of call open. And then this call. I think a basic error that it's not it won't be able to capture is when, let's say, the maximum and the minimum value is the same. Then you can see that maximum minus minimum would be 0, and that has been used as a denominator here. So that would return an error. So there might be cases that the maximum and the minimum value of the dataset is the same, so we should try and catch that error here, then we could avoid that error.

    A new device, a Python works, should it apply? Oh, wow. So, yes. So, you'll have to combine 2 pipelines here, side by side. If you're trying to append this, what you can do is you're trying to extract insights from a visual thing. So there are 2 ways to do this. Either you can convert that visual data, let's say it's an image, to using a CNN, you can convert that 2D or 3D image into a flat vector, and then you can append that into the NLP embedding that you might have for the dataset. Or, at the other way, you're trying to extract insights from the image. So, what you can do is feed it to a vision LM. A vision transformer would sort of give descriptions of images, and what you can do is use the first pipeline to generate descriptions out of the images and then append that to the textual data that you have, and then you have a string of textual data plus textual data which is derived from the visual data. And then it becomes a single NLP problem. And that NLP problem can be used to either generate a summary or if you want to assess what's the mood or anything else that you want to do with it. So, there are 2 approaches to it. One is appending, converting the image or any other visual data that you have to a flat vector and then appending it to the NLP embedding that you have. Or, what you can do is, I would say, the better idea is to generate a description of that image, however long you want it to be, and then upload it to the textual data. So, you have a single piece of textual data, and then it becomes an easier way to just make a NLP pipeline for this.

    Can you illustrate? So, version control is extremely important if you want to fall back to a working version anytime you make a release. You make a major release, a minor release, and version control helps you revert to a version that was actually working. There are cases when you're working in a remote environment and some changes might break the pipeline. It's better to have a fallback mechanism. A fallback mechanism could be a previous version of the code in GitLab, and then you can pick that up and build the entire pipeline from it because it was already working. It's very important to ensure that you have a lot of integration and unit tests attached to it so that you can ensure the previous pipeline was already working, and then move on to the next pipeline.

    How would you ensure your live moments by the end? Okay. So, to ensure that the performance is real-time when using stream data, it's very important how you store that data. So, then it becomes a little easier to retrieve the data as well. So, if you're using a vanilla search engine, there might be cases where you have to search for the problem in a larger space. So, a better approach would be to use an NLP-related algorithm. It could be entity recognition or multilevel classification, so you can recognize some valuable things that might help the search engine to get the algorithm running in real-time, and then you can use