
I am Midhilesh, a seasoned Data Scientist with over 5 years of experience crafting data-driven solutions, specializing in Regression Models and Data Mining Algorithms. Proficient in Python, MySQL, NLP, TensorFlow, and Pyspark, I have also delved into Software Development, contributing to web applications for machine learning using Streamlit and Django. Personally, my curiosity extends to exploring diverse data types, staying updated on AI advancements, and actively participating in coding competitions on platforms like Hackerearth, Codechef, and Kaggle. Excited about the prospect of contributing to your team and company, I am confident that my skills and passion align seamlessly with your objectives.
Senior ML Engineer
MicrosoftSenior ML Engineer
WalmartML Engineer II
Dell TechnologiesData Scientist
TcsSr. Data Scientist
Ernst & Young
MySQL

Git

Python

MongoDB

Visual Studio Code

Apache

PostgreSQL
REST API

Javascript
AWS (Amazon Web Services)

Azure Machine Learning Studio

AWS Athena
.png)
Docker

Kubernetes

Apache Airflow

MLFlow

Multithreading

Multiprocessing

OOPS

Natural Language Processing

NLP

LLMs

Transformers

Neural Networks

Machine Learning
Pyspark

Django

Recommender Systems

Hive

Hadoop

AWS

Airflow

MLFlow

CICD
.png)
Flask
.png)
FastAPI

Redis

Apache Kafka
Celery

RabbitMQ

Postgres

Pandas

Scikit-learn

Keras

MLFlow

Redis

Airflow

Plotly

CloudWatch

Spark

Zephyr

Redis

AWS Glue

Plotly
Beautiful Soup
.png)
Jenkins
.jpg)
Grafana

Splunk

Apache Cassandra

Redis

Plotly

Camelot

S3

Azure Cognitive Services

Redis
Yeah, I completed my graduation in 2015 and in 2016 I joined TCS, then I've been a part of TCS for about 5 years, then I worked on NLP topic modeling using LDA. So I used Spark, MLlib and for training and validating, creating endpoints using Lambda. So everything I used SageMaker for data transformations, I used AWS Athena. And later I worked on a hospitality project where I'm trying to, it's a predictive modeling project where I try to predict what kind of hotel, what kind of room and basically predicting the user behavior, whether he will be able to find out the recorded coupons, sorry, discount coupons, whether he will take those or not. So that's one project. And later I worked in EY, there I've been a very short amount of time in EY and I worked on data engineering and a little bit of backend engineering, working on data pipelines, extracting data from the invoices which they provided. And currently I'm working at Dell Technologies, here I've been working on cross-sell upsell recommendations, like when an item is added to the cart, it also says that people who bought this also bought this, it's such a kind of scenario. And next best action using LSTM and Siamese Networks and Transformers, like what kind of brand or category user actually is going to buy next, like if he has visited a website or different webpages on the dell.com website, then what is the next best item he is actually going to buy. The other one was like an intent model, so a classification model, so here we used execution classification to analyze what kind of user he is, whether he is coming to shop or whether he is coming to browse. The shop instance is purchasing some items on the website, browsing is basically like he is looking for some services regarding his laptop. And from the past, I've used the entire end-to-end, I built data pipelines, model pipelines, inference pipelines, created roadmaps, ran illustrations to the stakeholders. So using Docker, Kubernetes, MLflow for model versioning, Airflow for scheduling and all these things I generally use for the entire projects which I've been working here. And from the past 6-7 months, I am currently working on LLMs, basically building the RAC pipelines for content ingestion, we are using Apache Kafka for, I've started using Apache Kafka recently, but the remaining parts like creating embeddings, creating chunkings, using lang chain, recursive text splitters and other NLP browsing techniques and also creating taggings using GPT-3.5 turbo and using open-source LLMs which is already hosted on our on-prem servers like Lama 2, Zephyr models currently, which I am currently using and about to test for Lama 3 as well. So I created a pipeline for this, so all the data which has been loaded into the DL, created embeddings, chunking strategies after that. So these vectors will be loaded to PostgreSQL DB and from there while retrieving using chain of thoughts, kind of prompt engineering, making it, testing and making it in a better manner. So this is my, and also I am part of the team who provides SDKs for feature store and all, so this is my entire experience.
Let's say we have a multi-lingual dataset that definitely needs to be categorized like which data was there, which languages were there, so we categorize the dataset into different languages, so definitely preprocessing for one language will never be able to work for other languages, so any language itself the punctuation is actually the same, so I would go for different text preprocessing techniques or something like that maybe, so like removing punctuations, stop words using NLTK, phrase matches using spaCy, this kind of techniques we can be able to employ using this, maybe LLMs have much better opportunities for this, but I haven't exactly worked on this in the past, so let me take a scenario on this part. Okay, will it work if it is probably in English, I think this also works with other languages, okay maybe we can come and say probably step by step answers, for example text preprocessing we can use some NLP techniques using text normalization, for example removing punctuations, lower case letters, standardization, so these kinds of things we can be able to use and we can do tokenizing like splitting the text into words, splitting the huge paragraphs into some kind of sentences, so sometimes this tokenization can be language specific, so different languages will probably have like different techniques, so I would probably check into any LLMs which actually use this for better processing, so part of speech tagging is different, memory recognition is different here, find out which language and using some kind of encoding, maybe check some pre-trained models for these tasks, so for that I can be able to do some checking like tags, phrase matches in this, so basically what we can do in English, we can try to emulate the same in other languages itself but the idea is the same, like chunking, splitting, lemmatization, stemming and everything is mostly different but the way of doing is probably a little bit not different but the remaining idea is actually the same.
Reinforcement learning I haven't used, I'm not very well aware of reinforcement learning in my experience. I haven't worked on that, so I cannot comment much on that part. But one thing where we can use reinforcement learning is, we are actually trying to build a Dell chat application. So for that, we are creating embeddings and loading into PyTorch Geometric's Vector. So based on that, once we retrieve the embeddings, once this application has been taken to production, users might be able to ask the chat application, chatbot, like "My laptop is getting this kind of issue, so what should I do?" So it will provide some answers. So based on the answers, sometimes the user might be satisfied or not satisfied. So based on that, we can ask the user to provide feedback, provide a review, whether the answers are actually what we're looking for, did we solve the problem? If we put these kinds of questions, we can use that as a chain of thoughts and send some kind of input and review to LLMs. So, at that point of time, if we add a reinforcement learning structure to that problem, I think it can understand, just like how we use in chat, the user can say, "Your answer is wrong, your answer is not exactly what I'm looking for." If we add these kinds of words to the problem, we can use reinforcement learning at this point. So the Dell chat application is like our company's content monetization team. So ideally, at some point of time, reinforcement learning has to be added to that part. But as of now, I cannot answer this question more clearly, I know the idea of reinforcement learning can be used here, RLHF, so this can be done. But I cannot give more details on this part because I don't have much experience with it, I don't have experience with reinforcement learning. But the idea behind it I can tell, but I cannot go into the contents.
Yeah, knowledge graphs are actually very useful for when there are lots of summaries involved in travel websites like TripAdvisor or Booking.com or Expedia, these kinds of travel aggregators or maybe Trivago. So when users give lots of reviews, we can generate tags from that. So let's say you have a knowledge graph. In the knowledge graph, we can mention the tags. Each tag can be a relationship between one point to another point. Let's say if two users have given feedback, we can find how these two users are similar. So based on the keywords they have used, based on the content they have used, what is the polarization of the summary they have given the feedback. So based on all these factors, if we create a knowledge graph in an embedded space like a huge dimensional space. Once a new user comes to the website and asks about like the best hotels on the beach side view or something like that. I'm looking for something with this kind of facilities. So based on the feedbacks provided by the users, the points on the embedding space calculate. The backend calculation of knowledge graphs are probably DFS and BFS. So based on those calculations, where can this query be embedded into the knowledge graph? This creates a relationship between the points in the knowledge graph. I'm not exactly sure what the backend algorithm is that calculates this part, I haven't worked on that. I've been working on the vector embeddings more than the knowledge graph part. But ideally, I think this is how we can use the knowledge graph in AI development. Now, users keep commenting, keep asking the queries, keep asking answers, and posting their reviews. So based on those reviews, especially the tags, the metadata, what is the summary? What kind of words are in this summary? So based on all those points, we can find some content. Let's say two words are almost similar in word to vector embeddings. In the knowledge graph, that space is there. So how can we connect this summary to that summary? Which is the closest one? So whatever is the closest one, that review we will get, we will see the user can see. That's how I think TripAdvisor also shows the reviews if I ask about some point, it automatically gives summaries, feedbacks, and what other people give. And also in a concise manner without deleting the context of the user. TripAdvisor actually provides us the content for this as well, and knowledge graphs can be useful.
Transformer-based model for language translation. Language translation, got it, one set if you give to another set, so basically this is an encoder-decoder model as the starting step, so we can understand based on using self-attention mechanisms probably, the core idea behind the transformer is the self-attention mechanism, of course, which allows inputs to interact with each other and this is the significance of each input independently based on their position in that sequence. So first, I will probably go for data preparation, so where each data contains lots of samples and probably use different kinds of libraries in PyTorch or maybe TensorFlow, it doesn't matter which language. So basically, the model generally consists of several layers; if I want to use a transformer model, the first step is creating embedding layers, which convert inputs to tokens, then positional encoding, which adds personal information to the input embedding. Since transformers do not have recurrent layers, the encoder will be there, which is composed of a stack of similar layers with sublayers of attention mechanisms and feed-forward neural networks. After that, the decoder; whatever is happening in the encoder, almost the same will happen in terms of the decoder, but with an additional sublayer that actually performs self-attention or multihead attention over the encoder's output. So the final layer is the decoder output, which decodes the output to the size of the vocabulary. So I think this is how we try to, on average, do anything on a high level, actually not on a high level. So the loss function can be cross-entropy loss, since it is a classification type, so the optimizer can be AdaGrad or Adam optimizer or RMSprop, then do some evaluation and testing based on a separate test set, like taking different samples; if you are doing Spanish to English, take different Spanish texts that have not been seen and work on that. So a metric like the BLEU score or something, which actually assesses this.
Generally, overfitting can be addressed. So, generally, the overfitting can actually be addressed based on some kind of scenarios where if you have more data, try to reduce the data a little bit less. So, or sometimes the model might have been learning too much complexity and too many patterns in the data. So try to reduce some of the complexities, use dropout regularization parameters so that it cannot learn entire data, so that way we can prevent overfitting. I am saying this in general, not just in personalized recommender systems. So, maybe L1, L2 regularization, monitoring the validation loss, and doing some hyperparameter tuning are some of the techniques we generally use on a machine learning model. But specifically for recommending personalized travel itineraries, the best idea is actually to understand the metrics and understand what kind of predictions we are actually giving. So based on that, we can be able to once the model is deployed, get data from the product analytics team, get data from the consumption analytics team, and see what kind of recommendations we are actually giving. So are the users satisfied with that? So we can ask some user annotations, and users can give some kind of answers to us. So based on that, we can do these kinds of things for personalized travel itineraries, anything mostly based on recommendations. Why? Because I told to deploy the basic model first and start building from there, because user engagement will be much different than whatever we see. So based on that, if you want to train a model, whatever happens in the previous thing, previous years might not be the same at the current time. This is not some kind of a different ML model project, like describing patterns and all, but here it actually changes a lot. So, that's why we deploy the model, get some analytics, figure out what has gone wrong, whether the data drift is too much. That is one point. So that way we can reduce overfitting and use the normal overfitting techniques we use, like regularization, dropout layers, feature selection, feature engineering, reduce some of the complexities, if there is any multicollinearity between features, reduce these things.
So, I am not exactly sure what "select k best" actually is, probably I am thinking of "select k best" which is based on the chi-square test for statistical relation, but sometimes this is not exactly good, you cannot even say that this might always give a negative impact on a model evaluation, but it also sometimes provides positive results, but along with that you actually have to think about seeing whether there is too much of a relationship between the other variables. For example, if there is too much collinearity between the other variables, so we can be able to reduce some features, that we can also be able to do by using the variable inflation factor, some of the regression variables, continuous variables, we can be able to reduce if two are actually giving the highest importance almost similar importance to the target variable, I think one variable we can delete. So, in that way we can also be able to use, but doing it in this manner, this probably works, but we cannot be able to say exactly, this is going to give the wrong answer. So, the matrix has given us an accuracy score, I assume that for this project, accuracy score works, but ideally, this may not work. So, that is one step, and also for the feature selection part, we can use LASSO regression, I think LASSO is probably LASSO regression, which actually tries to reduce the number of features, so that also we can be able to use. A negative impact on the model in the sense, in case if this is giving a negative impact, mostly if you probably lose a lot of information from the other features, which may be very helpful for the model. So, in that way also, this might give a negative impact, but in general, we cannot be able to say exactly, this is going to be what kind of negative, how much negative impact, and what kind of negative impact on the model evaluation part. And one more thing, is like data leakage might be usually here, because we are using interest splitting at the chart itself. So, do not validate in that case, do not validate on the seen data, in this next test. So, do your validation on the unseen data, which is not even actually seen right now, gets these kinds of feature engineering stuff. So, what exactly happens is like, to the test with the outside world, then only we can be able to understand, so how much negative impact it does, but as of now, we cannot be able to completely assess, like how much impact it does, but there might be an issue, because we are not considering other features.
Okay, this is the best, the best one is like MLflow, or some people may use Kubeflow. So MLflow is probably in my idea, which is mostly the best one, which is an open-source one. And also, it can be seamlessly integrated with different cloud environments, such as AWS, Azure, Databricks, and GCP. So multiple data scientists, since I am one of them, who is actually building the platform for our team to use MLflow. So I'm creating a code, which basically works for any kind of project in MLflow, like they have to use this code, but they also have to use only these functions, always these functions. You can use any other functions, but finally call your function in one of the MLflow operations functions, which we are providing. So version control is the best one. Once the model has been pushed, I mean, like once your code has been pushed to the repo, so automatically the CSED pipeline will try to run the code and find out the best model based on the hyperparameters, which is actually running in the background. So that model will be stored to MLflow as the production model, not the archive. So whatever the user-defined metrics, let's say based on the best precision, if I'm running five test runs on my experiment, meaning five models, let's say I'm running linear regression with five hyperparameters and logistic regression with five hyperparameters. And again, XGBoost with some five hyperparameters or maybe deep learning. So based on all these things, whichever has been giving the highest accuracy or precision or recall or whatever the metrics or user-defined metrics, which we use, especially for recommendation kind of projects, like click rate, conversion rate, click-to-purchase conversion rates, so these kinds of things. So whichever the best model will go to production. So by this way, any other data scientists or multiple data scientists in the future can work on the models, which have actually been stored in MLflow. So we all should have one MLflow instance, almost all the data scientists who are using, even for different projects, they can pull the model, fetch the model, or they can log the model. They can use the previously loaded model somewhere as they specifically find, they can use that model and judge that model. And they can collaborate with different people, especially, I mean, like in other scenarios, MLflow should be the best option for version control for model part. But for code part, I think GitLab should be the best choice, as I'm working on that. There are different techniques like probably Jenkins and things also we use with GitHub, but currently, I'm using GitLab, which has everything inbuilt for CI, CD, no need to go for some other tool for CD, you can have everything for CI, CD in GitLab itself. So based on this, multiple data scientists will collaborate with different teams, sorry, projects and models, mission and project.
As I said, I have not been able to work on this type of project, but I have worked on reinforcement learning before. However, since it is a chat-like conversational AI-like approach, we can push our models to the repo. I am sorry, I am confused. So, the user will interact with the website, and key travelers will keep asking questions, getting responses, and saying this is wrong, this is right. Whatever the user mentions as wrong or right, it has to take this feedback into the back-end efficiently. Once the user starts something, the responses I can generate have to be in cache, maybe Redis or something, because my token limit is 1 million tokens or 16,000 tokens or 70,000, whichever the amount of tokens, let's say 100,000 tokens. 100,000 tokens is a very big amount of tokens because the back-end keeps summarizing this content by any LLM, which is actually done. We can build a reinforcement learning system there, where it says wrong, and immediately it has to retrain and check and go to the RAG pipeline built in the back-end, and send some more and do a bit more calculation, like changing the temperature of the query. Just do top k instead of top k documents, like 3, make it 4 or make it 5. More content will be retrieved, and we will get it. From that, we try to get the best response out of this, like based on some kind of score, I exactly did not remember what kind of score is that, but based on that score, you analyze each of those responses generated by the RAG, retrieved from the RAG, and now you generate this answer to the user, like based on whichever is the best one. Again, he says something wrong, now go back, and ask him like what exactly he is looking for if he keeps sending those answers. So based on his keywords, if he has sent something, get that more information as a new token. This token also will have some kind of keywords, tags, summaries, polarizations, positive and negative, everything. So based on that, I can now retrieve with adding this and previous responses. Now the previous responses should be in the cache, of course, because you have to know that the previous response is a wrong one, now I have to make choices. So at this point, we can be able to use a reinforcement learning system, but I have not actually worked on reinforcement learning, but this is how, I think this is the spot, I think we can be able to use the reinforcement learning.
I have done a lot of work, for example, on one of the NLP projects which I deployed in TCS. So, using topic modeling, the agenda is like this: once the user raises a ticket, it automatically creates some kind of tags for the ticket. This is based on the tags, especially called topics in LDA, which automatically load the tickets to the right agent. As a result, the business SLA has been reduced from 3 to 5 days for this project. The issue we are facing is that if a user raises a ticket like "my Ultimatics has been blocked" or "my Ultimatics access has been blocked," what happens is that Ultimatics is a keyword for a TCS website. The system thinks this person's account got locked, so it automatically sends the ticket to Ultimatics, but actually, he is looking for something like access, and his credentials are good, but his access has been broken. To address this issue, I made adjustments to the machine learning model by using bi-grams, tri-grams, and phrase matches instead of 1-grams. This way, whether it's a positive or negative scenario, the system can understand what exactly the user is doing. Based on our first requirements, we pushed these changes, and initially, we were getting correct results for only 33% of queries, but now we are getting around 60-70% good results. There are many situations like this where we have to change in business environments. For example, the cross-service recommendations which I have been working on. I initially set up the entire pipeline, and since the pipeline is working well and is scalable and reliable, we were able to scale to Canada, the UK, and EMEA, other European countries, and APJC as well. In Canada and the UK, we were able to deploy the models within 14 days, or two weeks. To accommodate changing business requirements, we should have a robust pipeline in place, so that once something new comes up, we can easily deploy it. I believe that is the way to do it. I have even tried to modify the machine learning model and add blacklist items to the model, doing some kind of filters to sort out the reviews, to sort out the feedbacks, and finally giving the right predictions, cross all recommendations. I have also used rule-based use cases and rule-based techniques, like the final scoring part for the recommendation system. For different projects, there will obviously be different business requirements, so I have used that.
Yeah, this part is like, you know, we can have integration tests, unit tests, entirely built in the code pipeline itself, like CICD tools. So once you have the entire data code, for every code you're creating, try to write a unit test for that. In this manner, code coverage will be covered. Every organization will have DevOps maturity scores. We also have such scenarios. Based on that, you have to provide the code coverages. Code coverage actually creates unit test cases. Unit tests have to be run before the model is pushed to production, to ML flow, so everything. Check for vulnerabilities in the data, whether data distortions are occurring currently. For this, we can use DVC for data engineering, data pipelines, and for automated testing, we can use integration tests. This will not run for the entire data, it will only check whether our pipeline is performing well from start to end or not, whether it's running unit test cases or not, it's running properly, code coverage is there or not, is it properly creating the Docker image, creating the secrets in the Kubernetes, hosting the API. All these things we can run as part of integration tests, or if we're looking for a machine learning pipeline, whether it's training, whether it's loading the results to the DB, whether it's loading the model to the ML flow, if it's not batch one, whether it's inferring faster. We can test all the aspects of the machine learning pipeline. The major idea behind this is to use integration tests to run the entire pipeline and check whether it's working, then move to the next stage, which is deployed to production, deployed to the Kubernetes cluster or whatever stages you have, machine learning CICD pipeline. This is how we can ensure data integrity as well as the entire pipeline, whether it's working or not.