
Senior Technical Leader - DevOps
Espire InfolabsDevOps Lead Engineer
AccuKnoxDevOps Technical Lead
Celestial Systems Inc.System Engineer
ValueFirstDevOps Engineer
one.comDevOps Engineer
Celestial Systems Inc.Linux System Administrator
CHI NetworksSenior Analyst
HCL Technologies Ltd.
Amazon Web Services
Azure

Kubernetes
.png)
Docker

CI/CD Tools

Terraform

Ansible

Bash

Prometheus
.jpg)
Grafana
.png)
ELK Stack

Linux

Apache

Nginx

Git
.png)
Jenkins

Github Actions

ArgoCD

GitHub Actions

Travis CI

Terraform

Serverless Framework

Linux

Nginx

Prometheus
Okay, could you help me understand more about your background and completion? Sure. My name is Divyans Shrivastava. I have a decade of experience in the DevOps and cloud space. And in these 10 years, I have worked with a variety of companies. I have worked with product-based companies, like ValueFirst Digital Media and my current organization, Equinox. I have worked with web hosting, one of the leading web hosting companies, 1.com. I've also worked with classical service-based companies, such as SCL. And by working in these companies, I've gained a good exposure, understanding, and experience working with DevOps tools, frameworks, and technology as a whole. I have also been in management positions in my last two organizations. I have led the DevOps team in my last organization. I'm also leading the team in my current organization. And I've built the DevOps team from two to 22 in my last organization. And I'm working hard in my current organization to build the DevOps framework and processes. I am proficient in designing the entire DevOps infrastructure from the first commit to production. I am proficient in managing Kubernetes clusters, creating infrastructure architecture on cloud for various business use cases. And I can also set up observability for microservices and infrastructure if needed. So, in all, my profile or experience spans across the technical as well as the managerial aspects, and I can handle both. Right? I am proficient in handling the technical aspect. I am also proficient in managing the managerial aspect. In case there is a need to lead the team or to mentor junior DevOps engineers. Yeah. Yeah. So that's all about me. My detailed skill set is mentioned in my resume, which you can go through and have a better understanding of it. Thanks.
Propose a logging strategy for a Python machine learning application on Kubernetes that balances details with storage consideration. I think a logging strategy for any application in general is very crucial. For a Python-based application, I think we should go for a centralized logging system. We should have a single server or logging aggregator where the logs from all application instances can come to. We can use ELK stack system. We can have a real case server and have the agent installed on Python instances where the Python application would be installed. From there, I think the logs would be exported to the centralized server, where it will be processed through Logstash. Then the logs could be carried through Elasticsearch and viewed on Kibana. For exporting the logs, we have different agents. We can make use of Beats or Fluentd, depending on the use case and which one is efficient for our use case. In most cases, Beats works well, so I think we can go with that. For storage considerations, it's better to offload all the logs to one place, which is our centralized server. We can rotate the logs after a month or after 45 days, depending on the storage policy of the logs that has been in place. We can offload the logs to an S3 bucket to reduce the overall EBS storage cost in case we are on AWS or disk storage cost in case we are on Azure. So the strategy will be very simple: offload the logs from the server to a centralized server and then rotate the logs after a certain interval to an archive solution that costs less from the storage perspective.
Developer's strategy to implement AB testing of new Python machine learning models in a Kubernetes environment and showing minimal impact on production. I think it's a very good question, and I would say that we can make use of the advanced deployment methodologies, like canary deployment. Right? And for canary deployment, we can do it in combination with GitOps. So, if I had to implement this solution, I would do it using GitOps methodology involving implementing Argo CD as a tool, and then implementing the canary-based deployment. Right? In canary-based deployment, only a certain percentage of the traffic is exposed to the new version, while the older version is running on the previous version or on the existing version. And then if you find that the new version is working well, you gradually increase the percentage of the traffic that would be exposed to the new version. Right? This can be done with the use of ArgoCD, which has inbuilt support for canary deployments, wherein you can mention the percentage of traffic you would like to expose to the new version, and what percentage of traffic would be exposed to the previous version. Right? For A/B testing, I think going ahead with canary deployment makes sense, as per
Can you optimize resource allocation in a Kubernetes cluster running heavy Python-based machine learning workloads without overprovisioning? See, the very first step that would be required for learning the modules, you know, would be to analyze the requirements. Right? Would be to benchmark the application and see how much CPU or the GPU that particular application needs. And based on the benchmarking, once we have the data, once we know that, okay, suppose if we are working on TensorFlow. And suppose it needs 1 GPU to process 1 task. Right? So we know that at least if we have to go with multi-processing, it would need at least 2 GPUs or 3 GPUs at a time. So based on that type of benchmarking, if we have that data, what we can do is put the resource request into the configuration for those LLM model pods. And that will actually help us to allocate the resource without overprovisioning it. And we can also put some limits. Right? That is, if it goes beyond 3 GPUs or 4 GPUs, then it should not be allowed to. So we can put the limits for that matter. In my opinion, I think benchmarking the application is very important, knowing how much it's and then how much is the memory and the CPU and the GPU requirements for the LLM models. And then, based on that, we can actually configure the resource requests and the limits. And for benchmarking, I think we can make use of tools like Locust. And there are other tools that can help us determine that particular thing.
We can monitor the health of a Python-based machine learning application on Kubernetes and trigger alerts based on custom metrics. We can do this with Prometheus, an open-source monitoring tool. We will set up monitoring with Prometheus to view metric data, such as CPU, memory, and other performance metrics. We can then send alerts based on Prometheus metrics. In fact, we can also configure auto-scaling with respect to Prometheus metric data, but that's a separate topic. We cannot configure HPE with respect to Prometheus metric data. To monitor our application, we will configure Prometheus to get metric data from all Python instances running on Kubernetes. We can use the Node Exporter to collect data from the nodes. Based on this metric data, we can set alerts. We can also integrate Grafana for visualization of those metrics. We can set alerting at the final level to visualize thresholds, panels, and charts. This can be done. There are custom exporters that help us configure custom metrics. For example, if we want to know how many other models are in process or in the queue, or how much CPU and memory ML models run takes, we can use custom exporters to get this information. We can configure these exporters, export the data to Prometheus, and then configure it to have visualization on charts. I think this approach will work without any issues.
How would you leverage Kubernetes features to scale up Python based learning inference workload efficiently? With these features to scale up Python based machine learning inference. So, as I said, there are two levels or two types of auto scaling in Kubernetes. One is horizontal pod autoscaling or vertical pod autoscaling, and then we have cluster autoscaling. Right? But the parameters to decide on when to auto scale matter. So, that is why once we have implemented Prometheus and have custom metric data, we can integrate it with HPA. And, we can make it auto scale on some standard metric or some custom metric. Right? So, once this will be configured, the pod will scale on its own. Right? Based on the scaling configuration, and the condition we'll be putting in the scaling configuration. And, it will scale up to a limit because every node can have only a certain number of pods. Once it reaches the node threshold, then, obviously, we need another level of auto scaling, and that is your node autoscaler or the cluster autoscaler. Now, with cluster autoscaler, we have two options. Either we can go with the open-source cluster autoscaler, or we can go ahead with something like more sophisticated, like Carpenter. So, Carpenter is a tool that is developed by the AWS team itself. Right? It manages auto scaling in a very different way. Instead of playing with node groups, it actually directly interacts with the EC2 APIs. And, it auto scales directly by interacting with the EC2 APIs. Right? So, and along with auto scaling, it also helps with cost optimization. Because it auto consolidates after a certain period of time, which is defined in the configuration file. Right? So, two levels of auto scaling have to be configured, horizontal pod autoscaling, vertical pod autoscaling, and then, the cluster autoscaling. And with that, I think we would be able to achieve a decent level of auto scaling with this. Right?
The Docker Messenger that could potentially break the build by leveraging caching layers, okay, from Python. See in this, first of all, the requirement dot txt is not copied to the image. Right? To the image. So when it is not copied, so without copying it, we are running the command pip install -r requirement.txt. And so this itself will break the docker build, because the requirement dot txt will not be there from before, inside this particular image. Right? So this is the first problem that I see with this Dockerfile. So I think the best solution would be that first, we should copy the requirement.txt file to the image. Once it is copied, right, like, once we do copy requirement.txt to the working directory, after that, we should run this command, run pip install -r requirement.txt. Once the pip install runs and installs all the Python modules, after that, we should copy the rest of the code, and we should build the application. So this is how the Dockerfile should be built for a Python application.
The process for converting stateless machine learning APIs for stateless machine learning APIs in Python to stateful services according to complex processing needs. Outline the process of converting stateless machine learning stateless machine learning APIs in Python to two stateful services in Kubernetes for complex processing. See, for converting any stateless service to stateful, that have to be deployed inside the Kubernetes cluster. Obviously, we would have to make use of stateful sets. Actually, stateful sets help you to have a unique identifier order of pods, in order of their numbering. And, it attaches a physical volume to every pod. So you will get those features with the stateful setting that helps you to retain the state of the pod even if it crashes right now. So, obviously, the stateful sets have to be considered. It has to be configured for deploying these stateless services as a stateful service inside the Kubernetes environment. And I think, apart from that, if there is anything, I think that depends on the overall use case. So as the description is very limited, so I am unable to think of any other solution at the moment, to be very honest, Clyde. But, yes, I think stateful sets have to be considered, and they should be utilized for this purpose. And they can be extended with the use of services or maybe ingress to expose those Python services. But, yes, under the hood, I think stateful sets would have to be utilized for this purpose.
For a 12-and-a-half literate solution for a committed solution that serves Python-based machine learning models for critical, real-time applications. Okay, so for creating a fault-tolerant solution to run a high-availability solution for a Kubernetes cluster that serves Python-based machine learning models for critical real-time applications. I think we would have to use, first of all, the GPU-based instances if data learning modules require it. And also, the thing is that the nodes should be spread across multiple availability zones. So, after selecting the right kind of nodes, we should create a Kubernetes cluster in such a way that the nodes are scattered across different availability zones. So that even if a node in one availability zone is down or is experiencing some issue, the traffic could be served from other availability zones. Likewise, when we are designing the pod, when we are deploying the pod, we should also consider the pod spread constraint. Right? So, we should mention this spread constraint topology, where we should have at least one part of the model in each node. So that even if one port crashes, the traffic could be served from other pods from other availability zones. And then we should adopt a decent level of auto-scaling mechanism to ensure that if a port is bombarded with a lot of requests and is unable to sub the traffic, the traffic could be subbed from other ports. Apart from this, I think we can also implement the probes and a pro, like we have different kinds of probes, like a startup probe and a liveness probe, and a readiness probe. And these are the probes that could be utilized to ensure that unless the port is completely up and is completely ready to receive the traffic, the traffic is not exposed to the ports. So, liveness and readiness probes can actually help us to effectively divert the traffic to the ports only when they are ready to accept the traffic or when they're healthy. Also, we'll be able to know when the ports are unhealthy, and based on that, the traffic can be routed to some other healthy ports that are part of the cluster. So, I think this is how we can design fault tolerance and high availability. And, yes, there are a lot of other things that come into play that depend on how the entire application is being designed and what are the other components involved in the overall setup. But, yes, this is where we can start, and then we can brainstorm further to create an extensive solution.
The role of MLflow in simplifying the management of Python learning model life cycles within a Kubernetes based platform.
Spark can boost data processing for large-scale battle learning in a cloud environment like AWS. I think Spark is capable of processing large amounts of data. And as Python learning modules, they actually have to process large amounts of data, but the class can integrate Spark, so that we can utilize the high processing power of Spark to process the large amounts of data. Then it can be routed to the models. Maybe, like, I'm not very sure about this, but this is what is coming to my mind as of now.