
Lead Engineer, AI
LogilityTechnical Lead ML
Encora Inc.Senior Engineer
AcquiaSoftware Developer
Tech MahindraSoftware Engineer
TIBCOSenior Consultant
Capgemini
Airflow

MLflow

Kubernetes
Snowflake

GitHub Actions

SonarQube

Checkmarx

Git

Confluence

Rally
.png)
Docker

ArgoCD

Helm Charts
.png)
Flask

Django
.png)
FastAPI

Elasticsearch

Airflow

MLflow

AWS
Azure

ArgoCD

Checkmarx

Rally
.png)
Datadog
I am Sashank from Delwal. I have 11 years of experience in the IT industry. I have 8 years of experience in Python and 5 years of experience in machine learning and machine learning operations. Okay? The model operations and machine learning operations. I work on classical ML algorithms, deep learning, and other areas. And now I'm focusing more on JNI, MLMs from engineering, linear chain, AI, and all these things. Okay. So this was a brief overview about me. Thank you.
How can you optimize resource allocation in a Kubernetes cluster running heavy Python-based machine learning workloads without overprovisioning? So, what we can do is use the right sizing resource request and limits. Okay. We can use appropriate secure requests and appropriate secure limits, and also the number of nodes, and the number of nodes limited. Okay. We can do the second thing is to use node affinity on specific variations. So, use node affinity to schedule ML workloads on nodes with specific characteristics. For example, nodes with GPU or high memory nodes. Yeah. What is tolerance? 10 is like 10 tolerations. Use 10s and tolerations to control which parts can be scheduled on certain nodes, helping to isolate unmanned workloads from other less critical workloads. Then you can use auto scaling, which is HPA, automatically scale the number of replicas based on CPU or memory usage. Okay. So, this ensures your application can handle varying loads without manual interventions. Okay. Then we can use the cluster autoscaler to automatically adjust the size of the Kubernetes cluster based on the resource request. This ensures that the cluster can scale up to accommodate increased workloads and scale down to save costs when demand decreases. Then we have resource quotas and. So, what we can do is set resource quotas at the namespace level to control the aggregate resource consumption of all pods within a namespace. This prevents resource starvation and ensures fair resource distribution. Then we have efficient resource utilization. Okay. So, we can use the spot instances for noncritical or batch ML workloads. Like, we can use GPU. Okay. You can use jobs to efficiently utilize the GPU resource. Okay. Yeah. That's pretty much apart from this, we can have a monitoring and logging system so that we can continuously monitor and limit alerts if we see any hiccups. Okay.
We have different log levels like debug, info, warning, error, critical. So we have to define these log levels and use them wisely. Then we should also use a logging configuration. To use Python's built-in logging module, we can configure the log levels, formats, and handlers. Then, structured logging can be used to make logs more readable and easier to parse. Libraries like Python's JSON logger can be used to format logs in JSON. Next, we should use a centralized logging solution. We can store our logs in a space and create a Persistent Volume Claim (PVC) to ensure that logs persist even if the cluster scales up or down. Alternatively, we can dump logs into a monitoring tool like Datadog. This will allow us to view logs without needing an ELK stack or advanced knowledge.
Let's take the steps to containerize a Python-based machine learning inference service using Docker. Okay. So, what we can do is, first, you should have three things. I mean, four things you should have. You should have your source code folder, your test suite unit test suite folder. Okay. And then your source code will have your model file. Okay. Then, we will have source code. We will have a test suite. Then we should have a requirements.txt file, and we should have a Dockerfile. Okay. Why we should have a test file is we should have a test file because before every deployment, you should actually try to run the unit test. You can check if the code will run correctly. Okay. This will prevent unnecessary delay cycles. I mean, this will reduce delay cycles. Okay. Then what we can do is we can create a Dockerfile. In a Dockerfile, we can start with some base image, like Python 3.10, or depending on the requirements we have. Then we can have a working directory in that. We can copy all our code there. Then we can install the requirements with pip. Okay. And whatever endpoint we want to expose for our app, we will expose. Okay. If you want to have some environment variables there, we can set them. Okay. After that, if we want to have whatever command we want to run to run the Docker application. So at the end, we'll add that command in the CMD, with brackets, and each word in quotes. Okay. Then we will do docker build. Okay. We will build the Docker image with docker build, with the hyphen t, and then whatever repository name we want to give or build we want to give. Okay. Then we will run the container from that build, okay, locally, and we will test the inference service locally. Once this test is also done, the first one's unit test case, the second one is this buffer test. Okay. Now everything is working here. Then we can push the Docker image to its registry wherever you want to register. Okay. And then we can have Helm charts. You can deploy those Helm charts. In that Helm chart, we will have to mention the registry URL and the tag of the image. Yeah. This will be the steps.
What approach will you take to troubleshoot performance bottlenecks in a Python-based machine learning API running on Kubernetes? That's a very interesting question. One good experience is to answer this question. I can try to answer this. So, first, we have to collect metrics and know the bottlenecks. We cannot directly go and fix the automate because we have to find the bottleneck. So, for that purpose, we will set up monitoring. We can use monitoring with Grafana or have it using Datadog. Then, you can also choose to use the metrics server. That will give us stats around the resource usage metrics. Then, what should we monitor? We should monitor port metrics, like CPU and memory usage. We should also monitor the number of restarts and resource request limits. Then comes mode metrics. Overall, the source usage across our cluster. Then comes custom metrics, such as model inference time, number of content requests, and request time. And, for performance, there should be a section then. We should analyze the resource utilization, like resource for mode. And we should investigate the logs as well. We should handle the logs into Datadog or have some ELK set up where we can just aggregate and analyze our logs. We should try to find some pattern out of the logs. We should try to find a pattern out of the metrics. We should create a story out of it. One thing we should also do is profiling of the application. Once all this is set up, then we should go and run the load testing. And with the load testing, if we found any issue, we should try to recreate it. And then if we are able to recreate it, then find the bottlenecks around it using all of these things that I have described earlier. And based on that, we'll optimize the code and the configuration, whichever is required. We will also implement the resources. If we are underutilizing it, we will do it accordingly. If we are overutilizing it, and we will work on it. We will also configure network and storage.
Outline the process for converting stateless machine learning APIs in Python to stateful services in Kubernetes for complex processing needs. Okay, so first, we'll have to define the state requirements. Determine what state information is to be maintained across requests. For example, user sessions or the intermediate features between APIs or the cache or models. Then, we'll have to implement state management. We'll have to decide how and where the state will be stored. Are we going to store the state in a database or locally or in memory? Accordingly, we can modify the API to handle the state. We'll integrate a state storage mechanism, such as using MongoDB to store and retrieve this state information. Then, we have to upgrade the app. We have to update the Docker configuration to add any necessary dependencies. Then, we'll have to implement stateful logic in the application as well. And then we'll have to deploy this stateful service into Kubernetes.
Doctor's files needed that could potentially break the bill when leveraging the cashiers. Okay. The first issue is I can tell you the order of three instructions. So, we are first doing initializing from the base image. I don't see. But when we are doing the run click install. It cannot be the first thing. Okay. So, what we'll have to do is first initialize it, then we have to create a work directory, which is app. Once the app work directory is there, then we can copy the requirements and other things to app. Okay. I don't think we should use the add, but we should actually use the copy command. These are the issues.
So for this, we'll have to ensure that the enterprise properly sets up the necessary resources and tools. Okay. We have to define the testing mechanism for this. First, we'll have to define what to cache and what part to cache. Then we will choose the caching method. We can use either Memcache. If we have a data structure and just store the memory data structure, then we should use the identity cache. It is widely used for. But if you want to use distributed memory object caching, then we will go and use Memcache. Because it's Kubernetes, and it's kind of like this should have a thing that we should go with Memcache, but for Airflow things, we use Redis and we can deploy Airflow on the. It depends on the actual use case. You cannot say a blanket statement if you are using. So you should use.
We should use a multi-zone cluster to ensure high availability. So, if one availability zone goes down, the others can continue to serve. We'll use a cluster autoscaler to automatically adjust the size of the cluster based on resource usage and workload demands. Second, we should deploy the service configuration with replicas to ensure redundancy. We should also use the Horizontal Pod Autoscaler (HPA) to scale the replicas based on resource usage. For storage and data management, we can use Persistent Volumes (PVs) to manage storage for stateful components like databases or artifacts. We can also use distributed storage solutions like Amazon EFS, Google Cloud File Storage, or Azure Files for high availability. We should also think about monitoring and logging. We should create backups of critical data regularly, including artifacts, configuration files, and databases. We should also set up a front-end setup to take care of this property on a regular basis. For load balancing and traffic management, we should use an ingress controller like NGINX to manage external access to services and provide load balancing. Finally, we should have security and compliance in place. We should have role-based access control and network policies defined to control access to resources and ensure secure connections between ports.
Priorities, key considerations when selecting AWS cloud services for deploying a scalable Python machine learning application. So, when selecting AWS cloud services for deploying a scalable machine learning application, we should consider the following things. So, first, we should consider the low latency. Okay. So first, we should consider the low latency. 2nd, we should configure scalability. For scalability, we can use Amazon EC2 auto-scaling. It should also be noted that we can also use AWS Lambda, which will be serverless and event-driven architecture. We can use, like, EKS. EKS will have clusters and ports which will scale up and scale down. We should set up high availability and fault tolerance. So, like, if we use a database, then we can use Amazon RDS multi-AZ. Or we can use Amazon S3. We can also define Amazon Route 53, so that we have a high-reliability, scalable DNS service for routing traffic. Then we'll come about ELBs. So, how it will distribute the incoming traffic from multiple targets. Then we should think of performance optimization. So, for performance optimization, we have to talk about Amazon instances. We should use instances according to our workload, if it is compute-optimized or memory-optimized. If you talk about Amazon FSX, I mean, if you need a high-performance file system for high-speed processing of large datasets, and we can use this. If we have to improve the global ability and the performance of the application, then we introduce the AWS Global Accelerator. Then we'll come to data management and storage. And next thing will come on security and compliance. The same thing I also talked about in the previous answer. So, in security and compliance, we have to define IAM roles. We have to define the KMS. We should also use AWS Shield. Then it was WAF. Then for monitoring all of this, I'll introduce AWS CloudTrail, which will help to monitor all the API calls, and then we can do auditing around it. Yeah. And the important thing is cost efficiency. So, whatever we are doing, are we underutilizing or utilizing it? If we have any batch things, we can do spot instances.
On the rule of MFluence in language management of Python machine learning model life cycles in a community based platform. Elaborate on the role of MLflow in simplifying the management of Python machine learning model life cycles on a Kubernetes based platform. So, MLflow is used for tracking our model. When we train our model, we track them using MLflow and upload all the artifacts around the model, including the model itself. Once we're done with multiple experiments, we can compare these models, including the graphs and expressions stored in MLflow, across all the runs. We can then choose a model, register it, and create an endpoint. We can host that model on a Kubernetes-based platform and create an endpoint. We can use that endpoint to generate statistics for monitoring purposes, deploy those statistics to a monitoring tool like Datadog, and compare the results. If there's any data drift or concept drift, we should go back to testing the data and repeat the training process. So, everything can be done using MLflow.