
I’m a Senior Software Engineer in Machine Learning with 8+ years of experience building scalable backend systems and end-to-end ML platforms. Currently at Toast, I design and optimize ML infrastructure on AWS, accelerating deployment speed and improving model reliability for large-scale production systems.
My expertise spans MLOps, distributed systems, cloud architecture, and real-time model serving using tools like MLflow, TensorFlow Serving, Kubernetes, and Kafka. I’ve led initiatives that improved deployment efficiency, enhanced predictive accuracy, and strengthened system availability to 99.9% at scale.
I’m passionate about building robust ML platforms, enabling data teams, and transforming complex machine learning workflows into reliable, production-ready systems.
ML Ops Engineer III
CrowdStrikeSenior Software Engineer
ToastMachine Learning Operations Engineer III
NinjacartSoftware Engineer
TEKsystemsSoftware Engineer II
HERE TechnologiesMachine Learning Ops Engineer II
NinjacartBackend Engineer
InfosysBackend Engineer
Wipro Limited
MySQL

mlflow

whylabs

evidently.ai
.png)
Docker

Prometheus
.jpg)
Grafana

Kubernetes

Vue.js

SAML

MLFlow

Airflow

React.js

Gitlab CI/CD

MLFlow

TFX

React.js
Hey, hi. This is Rajesh. I have an experience of around 8 plus years right now in the IT industry. And, like, last 4 years, I would say, I've been mostly working on machine learning operations. And previous to that, I was more of a back-end engineer. I started my career in Wipro where I worked with Java as a primary programming language, and Oracle SQL was my back-end DB, like, which we wanted to use. And then I moved to Infosys where I worked as a MuleSoft developer for integration testing. It's basically for integration testing. And then I was looking for various opportunities related to machine learning and other AI-related fields. Then I found machine learning engineering, like, where software development associated with ML was really catching my eye a lot. And then I moved to a different company called Hearsay Maps. Over there, I started learning Python language, and also we built ML solutions. For example, we closely worked with data scientists where we helped them to deploy the models and then focus more on scaling and scale-out related functionalities, build CloudWatch dashboards, and also, I worked on alerting systems using Grafana dashboard and Prometheus. These are the major areas that I worked on. Also, during the COVID time, like, there was little shuffling that happened with respect to the teams. So, I got a chance to work on a streaming application, which is called a passive link. So what we do is, like, we get real-time updates of the location data where we process it, and then we send the data back to the back-end system, which would be updated in the navigational map. Or, in fact, people would also use it for building 3D maps. So more of that's where my journey to machine learning started. And then I got a chance, like, post-COVID, I was working in studying in Scalar, and then I was trying to upscale myself into a bunch of other things. And then I specifically caught my eye, which is machine learning operations where you build a platform and then help the data scientists to accelerate the speed of model deployment by building model-tracking servers and creating automatic pipelines and creating certain strategies like performing the deployment and all the other stuff. And then I moved on to the current company where I'm mostly involved in building the ML platform. Here, I built a strategy called 1-click deployment where it will have better UI designing as well as the back-end designing for the model to get deployed quickly. And also, I had to implement a lot of elevation strategies. Right now, I've just started with a schedule-based elevation strategy, but I'm kind of focusing on model-out-based elevations that trigger should be happening. And, yeah. Also, I'm working on some of the LLM-related applications. I'm just trying to start picking it up. My major use case right now is mimicking the ChatGPT response instead of showing the entire tokens that are generated into the front end. We have to stream the tokens to the front end and then keep working on. So, these are the works that I have done. Thank you.
So I remember, I've worked on some use cases where we wanted to understand whether they deploy containers using Docker and the Python program, whether the memory being consumed is the right proportion? For example, we configure the CPUs and memory, but we don't have an infrastructure to monitor the memory being allocated as the actual memory needed for these two instances. For that, I did a test drive where I had to build a monitoring solution regarding the memory and performance. As a sidecar, let's deploy our container adviser, which is the container adviser to help us understand the memory configuration, the RAM configuration associated with this specific container which is deployed, and then try to understand from the Python point of view, the diagnostic performance bottleneck. So certain performance bottlenecks, I mean, I mostly fix certain issues related to the backend database, where if the indexing is not right, and then you fix that code and then write a query based on the backend server. And then with respect to Python, not to store caching within the application container itself rather than put it somewhere in the external service. That will help us to avoid a lot of issues in the system itself. That's one thing. The other thing that I used is the c advisor. You can call it as the container adviser that is developed by Google. So we attached the c advisor into the Docker container, and then we made it. Apart from that, another application performance bottleneck was related to the machine learning model. So let's say when we deploy a model into a GPU machine, the first inferencing always takes time. So what we do is, we have the fast API and we spin up the instance. I don't let the instance become healthy until the model is loaded and does its first default inferencing so that the frequent subsequent requests that happen in a GPU machine or CPU machine will have a very low latency application. That's something very recently I found out, you know, that this specific use case is kind of applicable for one of the GPU-based models, and then it's also applicable for one of the CPU-based models, which is on live GBM that also took one second or something like that. So these are the various issues. And also, sometimes the external call to the server takes a lot of time on latency. So we understand, hey, what actually happens, and is it the code that's causing a problem or the external system that we're interacting with before the model influencing if that's causing the problem. So these are the different use cases that we try to do. And yeah, mostly, these are the different performance bottlenecks that I'm trying to identify again.
How could you implement the CIC? I'm not sure what you mean? To be honest, I have not worked entirely on sentence. Maybe I can take an example of CICD from GitHub and on the GitLab perspective. So whenever the code that is being committed to a specific branch, let's say, a non-main branch, I would like to call them as a feature branch. So you just deploy it and it triggers certain cases. For example, we can have unit test cases like pytest. This pytest would run the entire Python script, which is under the test folder to run it. And on top of it, we can have a similar CI/CD pipeline integration, which will check the code coverage and also fail the pipeline, the CI/CD pipeline in terms of code coverage when it is less than the expected 1. This is one of the automated testing that we can do. And, also, this is I would say, it's like a unit testing. On top of it, we can have certain integration testing as well where we can have our own fixtures being uploaded based on that to perform certain influencing. And then once the influencing is done, then verify the results. And if the results are up to mark, then go ahead and do it. That's one thing. And, three, is like related about the department of machine learning model. There are two things with respect to machine learning model. One is the code version that changes. The other one is the model version that changes. When the code version changes, you build a Docker image pointing to the existing model version. Let's say the model registry is an MLflow server. I was thinking maybe it could be some other MLflow artifact as well, but let's consider some standard as a MLflow model registry. So whenever we make a comment, we can point to a specific experiment at the model version, which is registered in a specific image also. And, you build your Docker code, download the model from the package, put together in a Docker container, and then deploy. This is one way. Or else, you download the model when the actual container instance that has been spinning up, and then start the deployment in real-time. Like, let's say when the Docker container actually starts, that time based on the model version parameter, like an environment variable. You just set it up, and then it will get the model from there and then start doing inference. So these are different ways that we can have, but I've really not worked on Zendesk. Like, I worked on I worked mostly on the GitHub and the GitLab based configurations, like the gitlab-ci.yml file.
Okay, for one thing that I can see is that we can set up a specific instance where the Jenkins-related pipeline is running, and we can tag a specific task pointing to a specific Docker image. From that Docker image, we can start executing the Docker scripts that we have. For example, Docker for setting up a repeatable build environment on machine learning projects. For example, let's say we have a project that uses PyTorch as a base image. And this base image, we can have the pip files installed, but this needs to have a base image of NVIDIA on NVIDIA with CUDA driver enabled and certain prerequisites. Right? When I try to install a GPU machine, so choose the right base image for your application and also install the libraries that are actually necessary for CUDA and any other libraries or drivers associated with NVIDIA. And then try to have the basic installation or poetry installation. So we make our project constructed in a way that it is packaged with dependencies managed by using poetry. It's like, and then only for the inference, there are specific dependencies that are required. You can install them there and then execute your entry point of your Docker, which would take the sales script to run. It can be so that the project you set up could be an online inference model, which serves through STT APIs, or else it can be via the offline job, which can be used in certain ad flow or other orchestration tools, which execute the job and then complete it. So you can use that image and then build it. But using Jenkins, I am not really sure how to set up a repeatable build environment, but we can always specifically tag to a specific image on an instance. It's like a Docker inside the Docker where you can run some of your Docker commands in Docker, which supports building environment-specific features. And for every environment, what we want to tag, let's say, AWS. You specify that key, import it into your build step, or push your Docker images to this specific environment based on the built-in ECR for which the Docker build and the Docker push is being configured.
Do you provision and scale machine learning? Okay. Here's what I understand. When do you want to provision in AWS? Right? Okay. Let's say I assume the models are deployed in AWS ECS clusters. So every ECS cluster is kind of tagged with an ASC auto scaling group. So with this auto scaling group, what we can do is we can also apply certain scaling policies for that specific service that is being created. For example, here, there are a lot of metrics that we can use. I personally have used the CPU utilization metric. Let's say for a given 5-minute period, if the CPU metric is more than 20 or let's say 40%, then trigger the auto scaling, scale out. That's something we can do. We can also do scale-in options. So these are the 2 items that I have worked on, and I personally have not tried GPU-based scaling, wherein by default, every service that we create in a cluster does not show the GPU utilization and the reward margin as well. That's something we have to enable. And provisioning? Yeah. Let's say to handle high load prediction. Right? So, under my understanding, when we say high load, it is a huge number of incoming traffic to the model. So based on the model's need, we have to set up the minimum and the maximum instances, and we have to do the performance testing first. And from the outcome of the performance testing, we can set up the minimum number of systems that should always be available for a given specific service. As soon as the request starts increasing or decreasing, based on that, we can update the scaling policy, and the instances will spin up based on that. This is 1 area. And there are other things, like, which we can do is have a certain even bridge scheduler, which can invoke Lambda functions, which will again update the services. For example, from 12 to 6, there is no load. Then that specific time, we can reduce the number of instances. And after that, we can increase the minimum number of instances as well. This is 1 of the techniques that we can follow as a cost-saving procedure. And sometimes, the GPU machines take a lot of time to load and spin up, so we can maintain some warm, cool, hot instances in the cluster. So all it has to do is spin up the Docker image directly rather than waiting for the GPU machine to spin up. For example, very recently, I checked out this. I was trying to upgrade g5 to g6 instance, which is cost-efficient in terms of cost. When I tried to do it, the CPU machine went from provisioning to pending state. From pending to the running state, it took almost more than 10 minutes to spin up. So, yeah, these are certain things that we can consider to handle high load predictions.
Okay, I am not sure if AWS are part of the AWS security features for model protection. Okay. When you say, AWS service part of the ML pipeline, I assume here we are talking about, maybe, like, the head of the SageMaker instance and history buckets where we save the models and deploying ECS cluster or, like, offline jobs that is going to run on the training jobs. Also, we try to get the data training data from the other services. So I would put all of this into a single VPN, which will have its own security groups where I would be allowing all these specific IP addresses rather than allowing all the traffic. Like, so one thing you can do is, you can set your inbound rules. The other thing you can set is your outbound rules as well. In general, outbound rules could be, like, just allow all, but inbound rules is something that would be controlled. And then, you can do when you put all your services into a single VPN and a single region, that would be better. I have not personally worked on multi-region AWS features, and I have not handled it. But, yeah, security groups, which is adding inbound and outbound rules, and then I will be checking on the VPN. And, also, I'll check what are the CIDR blocks of it. And, mostly, these are the basic security features that I would look for. Most probably, I might even look for the AWS gateway services as well and also have certain configurations for security configurations for load balancers. And I will have its own rule on for what are the specific parameters. I'll be sending the accepting the request to be routed, else presented. So, yeah, these are the major areas, like, I would be thinking of in terms of security, and mostly it deals with security groups, VPN. Yeah. Pretty much. That's it.
Your container. Sorry. I don't think I can Google it out, but what I can see is here we have mentioned container port, but this container port cannot be accessed from the actual machine port. For example, when we run our docker container, I use a specific image. Then you specify a parameter called -p, which is the port number. You give a machine port number and then the container port number. So you have to add these two. Basically, the machine's port is not actually mapped to the docker bridge through which the container port number 80 can be accessed, but that mapping is not proper, I believe. So that is the issue here. And the docker container might run, but the port number is not actually open for the machine itself to access it. So we need to have an exact mapping of instance port and the port docker container port as well. That is something that is missing here. And other than that, like, I don't see any other problem that could lead to the model service and how you might connect it. Oh, okay. How will I correct it? Adding the port linking? That should solve the problem. Yeah. Yeah, that's the thing from my end.
I think we fit the model, but we actually did not transform it. I believe the transform function is the one which actually trains the model. It is actually kind of repossing and having the data fit into the model, but we need to call fit and transform for the training. And for predictions, yes. The test is fine. I need to check the dot count function, but previously I feel we have to check if we are just splitting the training data and then start predicting on the test. But we are not actually checking the metrics of the data that we've fit into the model and verifying if the recall impression is right or not. That is one thing. Syntax wise, everything looks good to me. And, yes, I think transform is something that is being seen in the code. And, yes, we have to train the model on the trained data and then start predicting for the x test. So for the x test also, I think we need to do the fit first and then start the prediction rather than directly predicting it. I think that is something that should be done. Yeah.
Do you design an optimal high-throughput system for real-time machine learning? Right? There are many ways to do it. One is imposing the model inference. Right? For example, when we train a model in a GPU machine, but sometimes we come out with the model weights, which are way lesser. I mean, model weights are an output, but we have to deploy them on CPU machines or GPU machines either. In general, let's say the inferencing is too high in a CPU machine, we can opt for a GPU machine for model deployment. This is one thing. Or else, we have certain techniques. Like, you can convert floating-point numbers, for example, from float 32 to float 64, which will actually reduce some performance of the model, but it might increase the model inferencing speed. This is one way. And I'm not sure if there's anything like model pruning that can be done, or we can also optimize by converting the model through ONNX, through the Onnx converter, which would give us the model with certain changes in the model inferencing graph. So, while inferencing, the inferencing could be faster by using, for example, the ONNX converter, which would give us the model weights. So, these are, like, three different ways to check for real-time machine learning model inferencing. And if it's a high-throughput system, of course, I need a scaling auto-scaling. Also, understanding CPU utilization. Also, let's say if you're performing real-time inferencing. Right? For example, say the model is inferencing through STTP, make sure the latency between the network is not right. And, based on that, like, deploy the models at the specific place or the environment where it is, very much closer to the customer calls. These are different things that we can think through in terms of model inferencing. I would choose the right, lightweight models, like LightGBM models, some other models to train it. If the network is too huge, maybe serving the request using batch continuous batching, like, let's say, Ray cluster uses certain techniques called continuous batching. When the model is enabled for batching solutions, instead of inferencing each one by one, it creates a micro batch and then performs the inferencing all together at once and then sends the result to the caller. So, these are various different items that we can think of for model influencing techniques. Yeah.
What would be your I personally have not worked on recommendation systems. What would be your strategy to architect our recommendations? Okay. So, let's think of a use case first. Maybe based on that, how I would answer this. I have no idea how I should answer this. However, for example, the Hotstar streaming system. Right? That's a very big use case. So, when viewership starts increasing a lot, we can have a very nice recommendation results output for a specific location and specific types of users. The results are already saved and then shown in real-time rather than when the user logs in. You fetch the ad, and the ad results are not displayed, and we don't know what to show the user. Right? It would be better to have precomputed recommendation outputs based on user localizations or user behavior. Then, when a specific user comes into a certain pool category, we show them these ads. Yeah. Based on that, we can do this. Also, we can make sure that we don't show repeated ads to the user. So, we always have to track all this information while building the microservices approach in AWS. Yeah. Pretty much these are my thought process on this. There could be a lot of potential, but I'd like to clarify this question. And I'd like to discuss more on this. However, I'd like to say that I have not worked on recommendation systems personally. So, I'm not able to think it through in a very quick and short time. Yeah.
How would you integrate, let's say, Kafka with MSO to improve the model tracking for real time predictions. How do you integrate Apache Kafka with m also? To the model tracking? Why we have to my question is, like, why do we have to integrate Apache Kafka with MLflow? Kafka is a consumer. Okay. I mean, it publishes message to the consumers as well as it consumes and publishes. Okay. That MF0 to improve the model tracking for a real time prediction service. Maybe, for example, we can try out something like with a new model straight, if that accuracy is really right, we don't have to integrate the flow with Kafka, but we can send the new model predictions to the Kafka consumer, which can be read by on, like, you know, certain which can be read by a process, let's say, lambda function or something. If the lambda function sees the actual performance of the new model is pretty good for a given experiment than the previous 1, then it can actually go and trigger a new deployment. This is something we can do, but, for tracking a real time prediction surface. Yeah. But, from the deployment perspective, I can't think of, but what is this improving the model tracking for real time prediction service? What model tracking are we trying to do? Is it, like, a new model version that is enabled? I was like, we can emit saying, hey. There's a new model that has come. Emit this. Once the new model is available, it is available in Kafka. Based on that, create an event, trigger it, like, maybe via AWS even bridge or something to trigger certain functions. Hey. I got a new message now. Take this out. Do you wanna do something else? Provide this specific MLflow version to the AWS services like ECS clusters. Restart them so it will take up the new model version, which is new model tagging, and then spin it up. This 1 way. Or else, we need to have the AB testing kind of stuff where create a new different service with a new version altogether. And how do I put this? Yeah. That can be done so that, like, you can have a, b testing kind of thing where you can log 50% of the traffic to the model a and 50% of that to the model b and then start the inferencing. So that is something we can do it. But this is what I am thinking of, but I'm not really sure if this is what, like, being expected out of this question. Maybe I need to reclarify this question a lot than just answering it on straight. Yeah. Thank you.