
I am currently working as a Data Scientist at Grainger Canada, deploying machine learning models for customer-facing applications and working towards productionalizing LLMs. I have a Masters in Data Science, two degrees in Mathematics and have worked as an Analytics Manager in the FinTech domain. I'm a Databricks Certified Machine Learning Associate.
I've developed machine learning models for recommending mutual fund schemes and used clustering methods to segment customers. Presently, I'm interested in MLOps and LLMs. I actively engage in Kaggle Competitions to discover novel methods to tackle varied data problems. My philosophy is to practice what I have learned and learn what I have not read before.
Senior Applied Scientist - I
AccreteMachine Learning Engineer
AccreteData Scientist
Grainger CanadaData Analyst Intern
BASIX Sub-K iTransactions LitdAnalytics Manager
SBI MUTUAL FUNDResearch Intern
Volunteer
Shades of HappinessTeam Member
Enactus Hindu College
SOLR

MLflow
.png)
Databricks

Streamlit
.png)
GitHub Copilot

Locust

Splunk
.jpg)
Grafana

CRM

Power BI

Google Colab

SQL Server

Google Analytics

PowerBI

SQL Server

ETL
.png)
Flask

MLflow

Superset
Hi, I'm Rohit. I've been working as a data scientist at Granger for the past 2 years. My total work experience is 5 years. I've been in the data science domain for about 5 years now. My academic background is in mathematics. I have a master's in mathematics, and I also have a master's in data science. So, my interest lies in both natural language processing and computer vision. In my current job, I'm working on improving the research experience of our customers. My previous job involved working on a variety of machine learning models, including customer sentiment analysis, recommendations, and customer segmentation.
So in my academic project during my master's of data science, our problem was to detect head impacts of players within a sports video. It was a twofold problem. First, we had to detect helmets in the images of the video frame by frame. The second part was detecting whether two helmets interacted in the video and if that interaction was a head impact or not. It was a combination of two computer vision problems. The first was Yolo v 5, which detected the helmet, and the second was a ResNet 3D model, which detected whether an impact occurred. We primarily used libraries from PyTorch and torchvision.
In terms of classes, when I'm trying to detect anomaly detection with PyTorch, I think you'll basically use the Torch library. And in addition to using Torch, you can use metrics from scikit-learn, since you are now detecting anomalies. So it's a prediction problem. Therefore, you have to assess whether recall is important to you or precision is important to you. And based on that, you can make use of precision recall, accuracy, and even F1 score to some extent. In terms of validation, I would say we'll have to assess what kind of dataset we have. We can use a cross-validation strategy where we try to make, for example, 5 folds in the dataset that we have. Let's say we make 5 folds, so you'll train on 4 sets and validate on the 5th set. And then, again, you will move on to another set of 4 parts of the dataset, treating the other one as the validation set. Likewise, you'll make 5 folds and you'll train the model on PyTorch 5 times, and the validation score could be the average that we take on this 5-fold problem.
So, by vector databases, they can be really helpful, and we want to derive better context for our machine learning problem. Let's say our AI is detecting or not detecting. Let's say it's trying to generate product descriptions. So instead of just using a plain large language model, it would be a better idea to use a training set. A training set could be, for example, generating product descriptions of your product catalog that you have. What you can do is store every product discussion as a dictionary, and you can store each individual dictionary into a vector database using some chunking strategy. And when you pass in a prompt, you try to find out which 10 best chunks or 10 best dictionaries you can retrieve from the vector database, which can augment our prompt and give you better item retrieval or better item description.
Python tools with new useful. Okay. I think it depends on the use case. Let's say you have a problem where you want it to be really fast and the inferences happen in batches, not in real time. Let's say the latency that you want in the inference time that you want in the machine learning pipeline is very small. So it makes sense that you use a very light version of tokenization and text sentiment classification. For tokenization and text cleaning and processing, you might use spaCy, which is lighter. And for sentiment analysis, you could go with a bag of words or even a lighter word-to-vector strategy that would be a little lighter. But let's say you have a problem where it's not real time. It happens in batches, and you're not really worried about the inference time. So it makes sense that you will use a bigger model here. A popular library or assortment of models that are present in Hugging Face is available. So you can use the transformers library from Hugging Face. And from there, you can use auto-tokenizer and an auto-text classification model. So both of them, you can download using the Hugging Face CLI and get the tokens, and use both of them to get pre-trained models, and then you can use these two tools for text organization and also for sentiment analysis.
What is your approach for training? I think, for an imbalanced dataset, it makes sense to sort of upsample the dataset. Let's say the imbalanced dataset has 2 labels: 1 is positive and 1 is negative. The negative one has a lower count. Let's assume the positive ones are 90% and the negative ones are 10%. There's a higher propensity that the model will be skewed towards the positive ones. So it makes sense either to downsample the 90% positive cases or to upsample the 10% negative cases. This helps reduce the gap between the positive and negative reviews in the training set and reduces the imbalance. In terms of ensuring the model's performance remains robust, you ensure that the distribution of positive and negative features you feed into the network remains consistent. And if you're using any sampling or randomization to pick samples, ensure you select a seed. For example, you can use sklearn and set a random state, such as 24 or 25, and make sure you fix it so that anytime you downsample or upsample, you get the same training examples for your model's performance. This way, you'll ensure the model remains robust, and you can use PyTorch or TensorFlow to achieve this.
Given the PCA, one of the problems here might be that since you're dealing with PCAs, it's important that all the values in your extreme dataset are numeric. So it makes sense that you filter out in the pipeline only use PCA on the numeric sets. That way you sort of avoid that problem because PCA can't deal with categorical values. But if you intend to use categorical values, then it makes sense that you use a label encoder so as to convert that into a numeric field. And then you can use PCA on that whole entire set. So it all depends on how the extreme set that you have is structured.
A section of call open. And then this call. I think a basic error that it's not it won't be able to capture is when, let's say, the maximum and the minimum value is the same. Then you can see that maximum minus minimum would be 0, and that has been used as a denominator here. So that would return an error. So there might be cases that the maximum and the minimum value of the dataset is the same, so we should try and catch that error here, then we could avoid that error.
A new device, a Python works, should it apply? Oh, wow. So, yes. So, you'll have to combine 2 pipelines here, side by side. If you're trying to append this, what you can do is you're trying to extract insights from a visual thing. So there are 2 ways to do this. Either you can convert that visual data, let's say it's an image, to using a CNN, you can convert that 2D or 3D image into a flat vector, and then you can append that into the NLP embedding that you might have for the dataset. Or, at the other way, you're trying to extract insights from the image. So, what you can do is feed it to a vision LM. A vision transformer would sort of give descriptions of images, and what you can do is use the first pipeline to generate descriptions out of the images and then append that to the textual data that you have, and then you have a string of textual data plus textual data which is derived from the visual data. And then it becomes a single NLP problem. And that NLP problem can be used to either generate a summary or if you want to assess what's the mood or anything else that you want to do with it. So, there are 2 approaches to it. One is appending, converting the image or any other visual data that you have to a flat vector and then appending it to the NLP embedding that you have. Or, what you can do is, I would say, the better idea is to generate a description of that image, however long you want it to be, and then upload it to the textual data. So, you have a single piece of textual data, and then it becomes an easier way to just make a NLP pipeline for this.
Can you illustrate? So, version control is extremely important if you want to fall back to a working version anytime you make a release. You make a major release, a minor release, and version control helps you revert to a version that was actually working. There are cases when you're working in a remote environment and some changes might break the pipeline. It's better to have a fallback mechanism. A fallback mechanism could be a previous version of the code in GitLab, and then you can pick that up and build the entire pipeline from it because it was already working. It's very important to ensure that you have a lot of integration and unit tests attached to it so that you can ensure the previous pipeline was already working, and then move on to the next pipeline.
How would you ensure your live moments by the end? Okay. So, to ensure that the performance is real-time when using stream data, it's very important how you store that data. So, then it becomes a little easier to retrieve the data as well. So, if you're using a vanilla search engine, there might be cases where you have to search for the problem in a larger space. So, a better approach would be to use an NLP-related algorithm. It could be entity recognition or multilevel classification, so you can recognize some valuable things that might help the search engine to get the algorithm running in real-time, and then you can use