
I am Midhilesh, a seasoned Data Scientist with over 5 years of experience crafting data-driven solutions, specializing in Regression Models and Data Mining Algorithms. Proficient in Python, MySQL, NLP, TensorFlow, and Pyspark, I have also delved into Software Development, contributing to web applications for machine learning using Streamlit and Django. Personally, my curiosity extends to exploring diverse data types, staying updated on AI advancements, and actively participating in coding competitions on platforms like Hackerearth, Codechef, and Kaggle. Excited about the prospect of contributing to your team and company, I am confident that my skills and passion align seamlessly with your objectives.
Senior ML Engineer
WalmartSenior ML Engineer
Dell TechnologiesSr. Data Scientist
Ernst & YoungData Scientist
TCS
MySQL

Git

Python

MongoDB

Visual Studio Code

Apache

PostgreSQL
REST API

Javascript
AWS (Amazon Web Services)

Azure Machine Learning Studio

AWS Athena
.png)
Docker

Kubernetes

Apache Airflow

MLFlow

Multithreading

Multiprocessing

OOPS

Natural Language Processing

NLP

LLMs

Transformers

Neural Networks

Machine Learning
Pyspark

Django

Recommender Systems

Hive

Hadoop

AWS

Airflow

MLFlow

CICD
.png)
Flask
.png)
FastAPI

Redis

Apache Kafka
Celery

RabbitMQ

Postgres

Pandas

Scikit-learn

Keras

MLFlow

Redis

Airflow

Plotly

CloudWatch

Spark

Zephyr

Redis

AWS Glue

Plotly
Beautiful Soup
.png)
Jenkins
.jpg)
Grafana

Splunk

Apache Cassandra

Redis

Plotly

Camelot

S3

Azure Cognitive Services

Redis
Yeah, I completed my grad in 2015 and in 2016 I have joined TCS, then I have been a part of TCS for about 5 years, then I have worked on NLP topic modeling using LDA. So I have used Spark, MLlib and for training and validating, creating endpoint using Lambda. So everything I have used SageMaker for data transformations, I have used AWS Athena. And later I have worked on a hospitality project where I'm trying to, it's a predictive modeling project where I try to predict what kind of hotel, what kind of room and basically predicting the user behavior, whether he will be able to, you know, find out the recorded coupons, sorry, discount coupons, whether he will take those or not. So that's one project. And later I have worked in EY, there I have been a very less amount of time in EY and I worked on data engineering and little bit of backend engineering, working on data pipelines, extracting data from the invoices which they have provided. And currently I'm working at Dell Technologies, here I have been working on cross-sell upsell recommendations, like when item is added to the cart, it also says that people who bought this also bought this, it's such kind of scenarios. And next best action using LSTM and Siamese Networks and Transformers, like what kind of brand or category user actually is going to buy the next sequence, like if he has visited website or different webpages on the dell.com website, then what is the next best item is actually going to buy. The other one was like an intent model. So classification model, so here we have used execution classification to analyze what kind of user he is, whether he is coming for shop or whether he is coming for browsing. The shop instance is purchasing some items on the website, browsing is basically like he is looking for some services regarding his laptop, which is maybe 41 or something like that. And from the past, here I have used the entire end-to-end, I build data pipelines, model pipelines, inference pipelines, create roadmaps, run illustrations to the stakeholders. So using the Docker, Kubernetes, MLflow for model versioning, Airflow for scheduling and all these things I generally use for the entire projects which I have been working here. And from the past 6-7 months, I am currently working on LLMs, basically building the RAC pipelines for content ingestion, we are using Apache Kafka for, I have started Apache Kafka very recently, but the remaining parts like creating embeddings, creating chunkings, using lang chain, recursive text splitters and other NLP browsing techniques and also creating taggings using GPT-3.5 turbo and using open-source LLMs which is already hosted on our on-prem servers like Lama 2, Zephyr models currently, which I am currently using and about to test for Lama 3 as well. So I created a pipeline for this, so all the data which has been loaded into the DL, created embeddings, chunking strategies after that. So these vectors will be loaded to PostgreSQL DB and from there while retrieving using chain of thoughts, kind of prompt engineering, making it, testing and making it in a better manner. So this is my, and also I am part of the team who provides SDKs for feature store and all, so this is my entire experience.
Yeah, let's say we have like a multi-lingual that definitely we need to categorize like which data was there, like which languages was there, so categorize the data set into different languages, so definitely preprocessing for one language will never be able to work for other languages, so any language itself the punctuation is actually same, so I would go for like different like you know text to preprocessing techniques or something like that maybe, so like you know like removing punctuations, stop words using NLTK, phrase matches using spaCy, this kind of techniques we can be able to employ using this, maybe probably LLMs have much better opportunities for this, but I haven't exactly worked to the past, so let me take a scenario on this part. Okay, will it work if it is probably in English, I think this also work with other languages, okay maybe we can come and say probably step by step answers, for example text preprocessing we can use some NLP techniques using text normalization, for example removing punctuations, lower case letters, standardization, so these kinds of things we can be able to use and that we can do tokenizing like splitting the text into words, splitting the huge paragraphs into some kind of sentences, so sometimes this tokenization can be language specific right, so different languages will probably have like different techniques, so I would probably check into any LLMs which actually use this for better processing, so part of speech tagging is different, memory recognition is different here, find out which language and using some kind of encoding, maybe check some pre-trained models for these tasks, so for that I can be able to do some checking like tags, phrase matches in this, so basically what we can do in English, we can try to emulate the same in other languages itself but the idea is same like chunking, splitting, lemmatization, stemming and everything is mostly is different but the way of doing is probably a little bit not different but the remaining the idea is actually the same.
Reinforcement learning I haven't used, I'm not very well aware of reinforcement learning in my experience. I haven't worked on that, so I cannot be able to comment much on that part. But one thing where we can be able to use reinforcement learning is, we are actually trying to build a Dell chat application. So for that, we are creating embeddings and loading into pgVector. So based on that, once we retrieve the embeddings, so once this application has been gone to production, so users might be able to ask the chat application, chatbot, like my laptop is getting this kind of issue, so what should I do? So it will provide some answers. So based on the answers, sometimes the user might be satisfied or not satisfied. So based on that, we can be able to ask the user to provide feedback, provide review, whether the answers are actually what we are looking for, did we solve the problem? If we put these kinds of questions, we can be able to use that as some kind of a chain of thoughts and that also we can send us some kind of an input and review to LLMs. So that at that point of time, if we add a reinforcement learning structure to that problem, so I think obviously, it can be able to understand like same like how we use in chat, user can say your answer is wrong, your answer is not exactly what I am looking for. If we add these kinds of words to the problem, we can be able to use reinforcement learning at this point. So the dull chat application is like at our company, it is called as content monetization team. So ideally, at some point of time, reinforcement learning has to be added to that part. But as of now, I cannot be able to answer this question much more clearly, I know the idea of reinforcement learning can be used here, RLHF, so this can be done. But I cannot be able to give more details on this part because I do not have much experience on it, I do not have experience on reinforcement learning. But the idea behind I can be able to tell but cannot be able to go into the contents of the.
Yeah, knowledge graphs are actually is very useful for when there are like lots of summaries involved in travel websites like TripAdvisor or Booking.com or Expedia, these kinds of travel aggregators or maybe Trivago. So when users give lots of reviews, we can be able to generate tags from that. So let's say you have a knowledge graph. So in the knowledge graph, we can mention the tags. So each tag can be something like of a generates relationship between one point to another point. Let's say if two users are actually been, one user has given some feedback, another user has given some feedback. So how these two users have actually been similar. So based on the keywords which they have used, based on the content they have used, what is the polarization of the summary they have given the feedback. So based on all these factors, if we create a knowledge graph on the embedded space like a huge dimensional space. So once a new user comes to the website and asks about like a best hotels, hotels on like a beach side view or something like that. So I am looking for something which I have this kind of facilities. So based on the feedbacks provided by the users, so the points on the embedding space, so it actually calculates. The backend calculation of knowledge graphs are actually probably DFS and BFS. So based on those calculations, so where can be this query, where can this query be embedded into the vector space of like in the knowledge graph, like not vector sorry, in the knowledge graph. So this creates a relationship between, I am not exactly sure what is the backend algorithm which calculates on this part, I have not worked on that. I have been working on the vector embeddings more than the knowledge graph part. But ideally, I think this is how we can be able to use the knowledge graph in the AI development. Now users keeps commenting, keep asking the queries, keep asking answers and like posting their reviews and everything. So based on those reviews, so especially the tags, the metadata, so on all these things, what is his summary? So what his summary has like a tagging, so what kind of words in this his summary has. So based on all those points, so we can be able to find some content. So let us say two words are almost similar in word to vector embeddings. So in the knowledge graph, that space is there. So how can we be able to connect to this summary to that summary? So which is the closest one? So whatever is the closest one, that reviews we will be get it, we will see the user can see. That is how actually I think TripAdvisor also shows the reviews if I ask about some point, so automatically it gives summaries, what not summaries, feedbacks what other people give. And also in a concise manner without deleting the context of the user. So TripAdvisor actually provides us the content for this as well knowledge graphs can be useful.
Transformer based model for language translation. Language translation, got it, one set if you give to another set, so basically this is like encoder decoder model as the starting step, so we can be able to understand based using self-attention mechanism probably, the core idea behind the transformer is self-attention mechanism of course, which allows inputs to interact with each other and this is the significance of each input independently based of their position in that sequence. So first I will probably go for like data preparation, so like where each the data contains like lots of samples and probably use different kinds of you know libraries in PyTorch or maybe TensorFlow, it does not matter which language. So basically, the model generally consists of several layers like if I want to use transformer model case, so convert first step is actually creating embedding layers which converts inputs to tokens, then positional encoding which is the base of all these things, adds personal information to input embedding. Since transformers do not have recurrent layers, then encoder will be there which is composed of a stack of almost similar layers with sub layers of attention mechanism and feed forward neural networks and after that decoder, so whatever is happening in the encoder. So almost the same will be happening in terms of decoder, but an additional sub layer which actually performs self-attention or multihead attention over the encoder's output. So the final layer is actually the decoder output, decoders I mean decodes the output to the size of the vocabulary. So I think this is how actually we tries to on an average anything on a high level actually not on a high level. So loss function can be maybe cross entropy loss, since it is a classification type, so optimizer can be AdaGrad or Adam optimizer or RMS prop, then do some evaluation and testing based on separate test set like to take different samples, if you are doing in Spanish to English like take different Spanish text which have not seen and work on that. So matrix like modern matrix probably like I think blue score or something which actually assesses this yeah.
Generally, overfitting is actually can be done. So, generally, the overfitting can actually be addressed based on like some kind of scenarios where if you have more data, try to reduce the data a little bit less. So or sometimes the model might actually been probably learning too much complexities and too much patterns in the data. So try to reduce some of the complexities, use dropout regularization parameters so that it cannot be able to, it need not learn entire data to, so that way we can be able to overfit. I am saying this in general, not just in personalized recommender systems. So maybe L1, L2 regularization, monitoring the validation loss and do some hyperparameter training and all the techniques which we generally use on a machine learning model is one thing. But specifically for recommending personalized travel itineraries, so the best idea of this part is actually tries to understand the metrics, understand what kind of predictions we are actually giving. So based on that, we can be able to like once the model is deployed, so getting from the product analytics team, so consumption analytics team, so what kind of recommendations we are actually giving. So are the users satisfied to that? So like we can ask some user annotations, so user probably can give some kind of answers to us. So based on that, so these kinds of things we can be able to do for personalized travel itineraries, anything, anything mostly based on recommendations. Why because I told to deploy the basic model first and start building from there because user engagement will be much, much different than whatever we see. So based on that, if you want to train like whatever happens in the previous thing, previous years might not be the same at current. This is not some kind of a different ML model project like describing patterns and all, but here it actually changes a lot. So that deploy the model, get some analytics, figure out that what has gone wrong, whether the data drift is too much is there, that is one point. So that way we can be able to reduce overfitting and the normal overfitting techniques we use like regularization, dropout layers, feature selection, feature engineering, reduce some of the complexities, if there is any multicollinearity between features, reduce these things.
So, I am not exactly sure what actually select k best is probably I am thinking like select k best which is based on chi-square to statistical relation, but sometimes this is not exactly a good you cannot even say that this is might always give a negative impact on a model evaluation, but it also sometimes provides positive results, but along with that you actually have to think about seeing whether there is a too much relationship between the other variables. For example, if there is too much collinearity between the other variables, so we can be able to reduce some features that also we can be able to do by using variable inflation factor some of the regression variables continuous variables we can be able to reduce if two are actually been giving highest importance almost similar importance to the target variable I think one variable we can be able to delete. So, in that way we can also be able to use, but doing in this manner this probably work, but we cannot be able to say exactly this is going to give wrong answer. So, matrix has given us accuracy score I assume that for this project accuracy score works, but ideally this may not accuracy score may not work. So, that is one step and also for the feature selection part we can be able to use LASSO regression I think LASSO it is probably LASSO regression which actually tries to reduce the number of features. So, penalize some of the features which are actually being less importance to the model, so that also we can be able to use. Negative impact on the model in the sense in case if this is giving a negative impact mostly if you probably you are losing lot of information from the other features which may be very helpful for the model. So, in that way also this might give negative impact, but in general we cannot be able to say exactly this is going to be what kind of negative how much negative impact and what kind of negative impact on the model evaluation part and one more thing is like data leakage might be usually here because we are using the interest splitting at the chart itself. So, do not validate in that case do not validate on the seen data in this next test. So, do your validation on the unseen data which is not even actually been right now gets these kinds of feature engineering stuff. So, what exactly happens is like to the test with the outside world then only we can be able to understand so how much negative impact it does, but as of now we cannot be able to completely assess like how much impact it does, but there might be issue because of we are not considering other features.
Okay, this is the best, the best one is like MLflow, or some people may use Kubeflow. So MLflow is probably in my idea, which is mostly the best one, which is an open source one. And also it can be seamlessly integrated with different cloud environments like AWS, Azure, Databricks and GCP. So multiple data scientists, since I am one of the, I am the one who is actually building the platform for our team to use this MLflow. So I'm creating a code, which is basically works for any kind of project in MLflow, like they have to use this code, but they have to, and also they have to use only these functions, always these functions. You can be able to use any kind of functions, but finally call your function in this one. The MLflow operations functions, which we are providing. So version control is the best one. So once the model has been pushed, I mean, like once your code has been pushed to the repo, so automatically the CSED pipeline will try to run the code and find out the best model based on the hyperparameters, which is actually running in the background. So that model will be stored to MLflow as the production model, not the archive. So whatever the user defined metrics, let's say based on the best precision, if I'm running five test runs on my experiment, means five models, let's say I'm running linear regression with five hyperparameters and logistic regression with five hyperparameters. And again, XGBoost with some five hyperparameters or maybe deep learning. So based on all these things, whichever has been giving the highest accuracy or precision or recall or whatever the metrics or user defined metrics, which we use, especially for recommendation kind of projects, like click rate, conversion rate, click rate, click to purchase conversion rates, so these kinds of things. So whichever the best model will go to the production. So by this way, any other data scientists or multiple data scientists in future can be able to work on the models, which has actually been stored in the MLflow. So we all should have one MLflow instance, almost all the data scientists who's been using, even for the different projects, they can be able to pull the model, fetch the model, or they can be able to log the model. They can use the previously already loaded model somewhere as they specifically find, they can use that model and judge that model. And they can collaborate with different people, especially, I mean, like say, other scenario, MLflow should be the best option for version control for model part. But for code part, I think GitLab should be a best choice as far as since I'm working on that. There are different techniques like probably Jenkins and things also we use with GitHub, but currently I'm using GitLab, GitLab has everything inbuilt CI, CD, no need to go for some other tool for CD, you can have everything for CI, CD in GitLab itself. So based on this, we can be able to, multiple data scientists will collaborate with different teams, sorry, projects and models, mission and project, yeah.
As I said, I have not been able to, I have worked on reinforcement learning before, but as I said, since if it is a chat-like conversational AI-like approach, so what we can do is we can push our models to the repo, sorry, not models, I am sorry, I am confused. So user will interact with the website, so key travelers will keep asking questions, so they are getting responses and they will say this is wrong, this is right. So whatever the user is mentioning in the wrong word, right, or whatever, efficiently it has to take this feedback into the back-end, right. So once the user started something, right, so the responses which I can keep on generating, so this has to be in cache, maybe Redis or something, because let us say my token limit is 1 million tokens or 16,000 tokens or 70,000, whichever the amount of tokens, 100,000 tokens let us say. So 100,000 tokens is a very big amount of tokens because back-end keeps summarizing this content by any LLM, it is actually done. So we can build some reinforcement learning system there, like it says wrong, now immediately it has to redefine and check and go to the, if it is a RAG pipeline built in the back-end, go to the pipeline and send some more and do little bit more amount of calculation like for example, change the temperature of the query. So just do like top k instead of top k, documents is like 3, make it 4 or make it 5. So more amount of content will be, we will get it. From that now, tries to get the best response out of this, like based on some kind of score, I exactly did not remember what kind of score is that, but based on that score, you analyze each of those responses generated by the RAG, retrieved from the RAG, now you generate this answer to the user, like based on whichever is the best one. So again he says something wrong, now go back, so he said like, ask him like what exactly you are looking for if he keeps sending those answers. So based on his keywords, if he has sent something, so get that more information as a new token. So this token also will have some kind of keywords, tags, summaries, polarizations, positive and negative, everything. So based on that, I can now retrieve with adding this and previous responses. Now the previous responses should be in the cache, of course, because you have to know that the previous response is a wrong one, now I have to make choices. So at this point, we can be able to use reinforcement learning system, but I have not actually worked on reinforcement learning, but this is how, I think this is the spot, I think we can be able to use the reinforcement learning.
I have done a lot of times, for example, for one of the NLP project which I have deployed in TCS. So, using topic modeling, so what the agenda is like, once the user raises a ticket, so automatically create some kind of tags to the ticket, so and this, so based on the tags especially called as topics in LDA, it automatically loads the tickets to right agent, right, but so the business SLA has been reduced from 3 to 5 days for this. The issue which we are facing is here, if a user raises something like a ticket called as my Ultimatics has been blocked, my Ultimatics access has been blocked, so what happens is Ultimatics is a keyword which is a website of TCS, so it says that this person's account got locked, it might think like that, so automatically the ticket is going to Ultimatics, but actually he is actually looking for something like an access, his credentials are actually good, but his access has been broken. So here the adjustments to the machine learning model which I made was like instead of using 1 grams, I have been using bi-grams, tri-grams and phrase matches, so whether it is a positive or it is a negative, what exactly he is doing, so these kinds of scenarios have been changed and based on our first requirements, so then we pushed, so initially it was like 33%, we are getting correct results like for out of like 100 queries, 100 tickets only 33 are getting good results, but now after these changes, around like 60-70% we are getting good results, this is one, there are different many situations which we have like changing in business environments for example, the cross-service recommendations which I have been working initially, I have set up the entire pipeline for years, so since the pipeline is working well and it is scalable, reliable, so we were able to scale to Canada, UK and EMEA, other European countries and APJC as well, Ratnavarottam APJC, so Canada, UK has been complete, has been like done within most probably like a 2 weeks, again max to max in one sprint. So within 14 days, we were able to deploy these models, so that sorry, not models, yeah models of course for Canada and UK, so this way changing the business environment, business requirements to accommodate changing business requirements time to time, I think we have, we should have a robust pipeline in place, so that once something new comes up, we should be able to easily deploy that, yeah I believe that is the way I have even try to modify a machine learning model as well as like added blacklist items to the model, doing some kind of filters to sort out the reviews, to sort out the reviews, to sort out the feedbacks, and finally giving the right predictions, cross all recommendations, so some amount of rule based use case, rule based techniques also I have used like the final scoring part for the recommendation system, so in that way for different projects, I have often different business requirements obviously will be there, so yeah, I have used that.
Yeah, this part is like, you know, we can be able to have integration tests, unit tests, this entirely has to be built in the code pipeline itself, like CICD tools. So once you have like entire data code, so for every code which you are actually been creating, so try to write a unit test for that. So in this manner code coverage will be, code coverage will be covered. So every organization will have DevOps maturity scores. So we also have such scenarios. So based on that, so you have to provide the code coverages. So code coverage actually, you know, creates, you can be able to do unit test cases. So unit test has to be run before the model is being pushed to production, pushed to ML flow, so everything. So check for vulnerabilities in the data, whether the data distortions has been occurring currently. So for this, we can be able to use like DVC for data engineering, for data pipelines, and for automated testing, we can be able to use like integration tests. So this is not going to run for the entire data, this is going to run only whether our pipeline is actually performing well from the start to end or not, like whether it is running unit test case or not, it is running properly, code coverage is there or not, is it actually properly creating the Docker image, is it creating the secrets in the Kubernetes, is it able to host API. So all these things we can be able to run as part of integration test, or if we only looking for a machine learning pipeline, whether it is training, whether it is loading the results to the DB, whether it is loading the model to the ML flow, if it is not a batch one, so whether it is actually been inferencing faster. So these kinds of things, we can be able to test all the aspects of machine learning pipeline. So basically, the major idea behind this is use integration tests to run the entire pipeline and check whether it is working, then only you move to the next stage, which is deployed to production, deployed to Kubernetes cluster or whatever the stages you have, machine learning CICD pipeline. This is how we can be able to ensure data integrity as well as like the entire pipeline, whether it is working or not.