Vetted Talent

Divyansh Srivastav

Vetted Talent

Passionate DevOps Engineer with years of experience in providing and maintaining cloud infrastructure, formulating and implementing cloud solutions, and setting up CI/CD pipelines

Role
DevOps Manager
Years of Experience
10 years

Skillsets

Cloud Infrastructure - 9 Years
Shell - 10 Years
Bash Shell Scripting
CI/CD Pipeline
Amazon Web Services
Serverless Framework
Linux system administration & performance tuning
Apache/nginx/caddy web server
Jenkins - 8 Years
Shell Scripting - 10 Years
SQL - 10 Years
infrastructure as code - 8 Years
Security - 5 Years
GCP - 4 Years
Git
Architecting & implementing ci/cd pipeline
System Administration - 10 Years
Ci/Cd Pipelines - 10 Years
Amazon Web Services
Git
Kubernetes
Serverless Framework
Ansible - 05 Years
DevOps - 10 Years
Cloud
Terraform - 6 Years
Terraform - 6 Years
AWS - 10 Years
Apache
nginx
Azure - 8 Years
Git - 10 Years
Kubernetes - 7 Years
DevOps - 9 Years
Bash - 10 Years
Docker - 8 Years
Grafana
Nagios
Prometheus
CI/CD - 8 Years
ELK Stack
IAC - 7 Years
Azure DevOps - 7 Years

Vetted For

9Skills

Roles & Skills
Results
Details

Senior Software Engineer - MLAI Screening
79%

Skills assessed :Kubeflow, seldon, Spark, AWS, Docker, Kubernetes, machine_learning, Problem Solving Attitude, Python
Score: 71/90

Professional Summary

10Years

Apr, 2024 - Nov, 2024 7 months
Senior Technical Leader - DevOps
Espire Infolabs
Aug, 2023 - Mar, 2024 7 months
DevOps Lead Engineer
AccuKnox
Apr, 2021 - Aug, 20232 yr 4 months
DevOps Technical Lead
Celestial Systems Inc.
Sep, 2017 - Jan, 2018 4 months
System Engineer
ValueFirst
Jan, 2018 - Nov, 2018 10 months
DevOps Engineer
one.com
Dec, 2018 - Apr, 20212 yr 4 months
DevOps Engineer
Celestial Systems Inc.
Nov, 2016 - Aug, 2017 9 months
Linux System Administrator
CHI Networks
Aug, 2014 - Nov, 20162 yr 3 months
Senior Analyst
HCL Technologies Ltd.

Applications & Tools Known

Amazon Web Services
Azure
Kubernetes
Docker
CI/CD Tools
Terraform
Ansible
Bash
Prometheus
Grafana
ELK Stack
Linux
Apache
Nginx
Git
Jenkins
Github Actions
ArgoCD
GitHub Actions
Travis CI
Terraform
Serverless Framework
Linux
Nginx
Prometheus

Work History

10Years

Senior Technical Leader - DevOps

Espire Infolabs

Apr, 2024 - Nov, 2024 7 months

Architecting and implementing Azure Cloud Infrastructure to enforce standards, manage compliance, and ensure Azure Well-Architected Framework adherence. Developing, optimizing, and maintaining Terraform code to provision and manage scalable Azure infrastructure using AKS, Azure Database, and other Azure services.

DevOps Lead Engineer

AccuKnox

Aug, 2023 - Mar, 2024 7 months

Led and managed the DevOps Team, actively participating in sprint planning sessions. Translated product requirements into DevOps solutions, ensuring alignment with business objectives.

DevOps Technical Lead

Celestial Systems Inc.

Apr, 2021 - Aug, 20232 yr 4 months

DevOps Engineer

Celestial Systems Inc.

Dec, 2018 - Apr, 20212 yr 4 months

Understanding Customer's DevOps requirements and designing complete DevOps workflow from the first commit to production.

DevOps Engineer

one.com

Jan, 2018 - Nov, 2018 10 months

Maintenance of IT infrastructure of web hosting. Configuration management. Planning and setup of CI/CD pipeline.

System Engineer

ValueFirst

Sep, 2017 - Jan, 2018 4 months

Linux server Administration and maintenance - Production Managed Amazon Web Services cloud platform.

Linux System Administrator

CHI Networks

Nov, 2016 - Aug, 2017 9 months

Linux servers Administration and Performance Tuning. Deployment of LAMP and web hosting platform.

Senior Analyst

HCL Technologies Ltd.

Aug, 2014 - Nov, 20162 yr 3 months

Learned about the ITIL process - Incident and change management. Resolved tickets and performed changes within the stipulated time.

Achievements

Implemented GitOps in multiple projects
Provisioning and managing Kubernetes clusters
Automation of Infrastructure provisioning via Terraform
Configuration management through Ansible
Linux System Administration & performance tuning
Translated product requirements into DevOps solutions, ensuring alignment with business objectives.
Led and managed the DevOps Team, actively participating in sprint planning sessions.
Conducted interviews to hire multiple DevOps positions, contributing to team expansion and talent acquisition initiatives.

Major Projects

1Projects

Multi-tenant SAAS application architecture

Architected multi-tenant design for SAAS application in conformance with the Azure Well-Architected Frameworks using Azure Kubernetes Service

Education

Bachelors of Technology in Instrumentation and Control Engineering
SRM University (2014)
Bachelor of Technology in Instrumentation and Control Engineering
SRM University (2014)

Certifications

Hashicorp certified: terraform associate
Certified kubernetes administrator
Red hat certified engineer (rhce)
Certified kubernetes administrator certificate number: lf-q22q5yjxxh
Hashicorp certified: terraform associate (003) certificate number: b9850132-3657-45f0-945c-415279080f65
Hashicorp certified: terraform associate (003)
Red hat certified engineer (rhce) certificate number: 160-039-657

AI-interview Questions & Answers

Okay, could you help me understand more about your background and completion? Sure. My name is Divyans Shrivastava. I have a decade of experience in the DevOps and cloud space. And in these 10 years, I have worked with a variety of companies. I have worked with product-based companies, like ValueFirst Digital Media and my current organization, Equinox. I have worked with web hosting, one of the leading web hosting companies, 1.com. I've also worked with classical service-based companies, such as SCL. And by working in these companies, I've gained a good exposure, understanding, and experience working with DevOps tools, frameworks, and technology as a whole. I have also been in management positions in my last two organizations. I have led the DevOps team in my last organization. I'm also leading the team in my current organization. And I've built the DevOps team from two to 22 in my last organization. And I'm working hard in my current organization to build the DevOps framework and processes. I am proficient in designing the entire DevOps infrastructure from the first commit to production. I am proficient in managing Kubernetes clusters, creating infrastructure architecture on cloud for various business use cases. And I can also set up observability for microservices and infrastructure if needed. So, in all, my profile or experience spans across the technical as well as the managerial aspects, and I can handle both. Right? I am proficient in handling the technical aspect. I am also proficient in managing the managerial aspect. In case there is a need to lead the team or to mentor junior DevOps engineers. Yeah. Yeah. So that's all about me. My detailed skill set is mentioned in my resume, which you can go through and have a better understanding of it. Thanks.

Propose a logging strategy for a Python machine learning application on Kubernetes that balances details with storage consideration. I think a logging strategy for any application in general is very crucial. For a Python-based application, I think we should go for a centralized logging system. We should have a single server or logging aggregator where the logs from all application instances can come to. We can use ELK stack system. We can have a real case server and have the agent installed on Python instances where the Python application would be installed. From there, I think the logs would be exported to the centralized server, where it will be processed through Logstash. Then the logs could be carried through Elasticsearch and viewed on Kibana. For exporting the logs, we have different agents. We can make use of Beats or Fluentd, depending on the use case and which one is efficient for our use case. In most cases, Beats works well, so I think we can go with that. For storage considerations, it's better to offload all the logs to one place, which is our centralized server. We can rotate the logs after a month or after 45 days, depending on the storage policy of the logs that has been in place. We can offload the logs to an S3 bucket to reduce the overall EBS storage cost in case we are on AWS or disk storage cost in case we are on Azure. So the strategy will be very simple: offload the logs from the server to a centralized server and then rotate the logs after a certain interval to an archive solution that costs less from the storage perspective.

Developer's strategy to implement AB testing of new Python machine learning models in a Kubernetes environment and showing minimal impact on production. I think it's a very good question, and I would say that we can make use of the advanced deployment methodologies, like canary deployment. Right? And for canary deployment, we can do it in combination with GitOps. So, if I had to implement this solution, I would do it using GitOps methodology involving implementing Argo CD as a tool, and then implementing the canary-based deployment. Right? In canary-based deployment, only a certain percentage of the traffic is exposed to the new version, while the older version is running on the previous version or on the existing version. And then if you find that the new version is working well, you gradually increase the percentage of the traffic that would be exposed to the new version. Right? This can be done with the use of ArgoCD, which has inbuilt support for canary deployments, wherein you can mention the percentage of traffic you would like to expose to the new version, and what percentage of traffic would be exposed to the previous version. Right? For A/B testing, I think going ahead with canary deployment makes sense, as per

Can you optimize resource allocation in a Kubernetes cluster running heavy Python-based machine learning workloads without overprovisioning? See, the very first step that would be required for learning the modules, you know, would be to analyze the requirements. Right? Would be to benchmark the application and see how much CPU or the GPU that particular application needs. And based on the benchmarking, once we have the data, once we know that, okay, suppose if we are working on TensorFlow. And suppose it needs 1 GPU to process 1 task. Right? So we know that at least if we have to go with multi-processing, it would need at least 2 GPUs or 3 GPUs at a time. So based on that type of benchmarking, if we have that data, what we can do is put the resource request into the configuration for those LLM model pods. And that will actually help us to allocate the resource without overprovisioning it. And we can also put some limits. Right? That is, if it goes beyond 3 GPUs or 4 GPUs, then it should not be allowed to. So we can put the limits for that matter. In my opinion, I think benchmarking the application is very important, knowing how much it's and then how much is the memory and the CPU and the GPU requirements for the LLM models. And then, based on that, we can actually configure the resource requests and the limits. And for benchmarking, I think we can make use of tools like Locust. And there are other tools that can help us determine that particular thing.

We can monitor the health of a Python-based machine learning application on Kubernetes and trigger alerts based on custom metrics. We can do this with Prometheus, an open-source monitoring tool. We will set up monitoring with Prometheus to view metric data, such as CPU, memory, and other performance metrics. We can then send alerts based on Prometheus metrics. In fact, we can also configure auto-scaling with respect to Prometheus metric data, but that's a separate topic. We cannot configure HPE with respect to Prometheus metric data. To monitor our application, we will configure Prometheus to get metric data from all Python instances running on Kubernetes. We can use the Node Exporter to collect data from the nodes. Based on this metric data, we can set alerts. We can also integrate Grafana for visualization of those metrics. We can set alerting at the final level to visualize thresholds, panels, and charts. This can be done. There are custom exporters that help us configure custom metrics. For example, if we want to know how many other models are in process or in the queue, or how much CPU and memory ML models run takes, we can use custom exporters to get this information. We can configure these exporters, export the data to Prometheus, and then configure it to have visualization on charts. I think this approach will work without any issues.

How would you leverage Kubernetes features to scale up Python based learning inference workload efficiently? With these features to scale up Python based machine learning inference. So, as I said, there are two levels or two types of auto scaling in Kubernetes. One is horizontal pod autoscaling or vertical pod autoscaling, and then we have cluster autoscaling. Right? But the parameters to decide on when to auto scale matter. So, that is why once we have implemented Prometheus and have custom metric data, we can integrate it with HPA. And, we can make it auto scale on some standard metric or some custom metric. Right? So, once this will be configured, the pod will scale on its own. Right? Based on the scaling configuration, and the condition we'll be putting in the scaling configuration. And, it will scale up to a limit because every node can have only a certain number of pods. Once it reaches the node threshold, then, obviously, we need another level of auto scaling, and that is your node autoscaler or the cluster autoscaler. Now, with cluster autoscaler, we have two options. Either we can go with the open-source cluster autoscaler, or we can go ahead with something like more sophisticated, like Carpenter. So, Carpenter is a tool that is developed by the AWS team itself. Right? It manages auto scaling in a very different way. Instead of playing with node groups, it actually directly interacts with the EC2 APIs. And, it auto scales directly by interacting with the EC2 APIs. Right? So, and along with auto scaling, it also helps with cost optimization. Because it auto consolidates after a certain period of time, which is defined in the configuration file. Right? So, two levels of auto scaling have to be configured, horizontal pod autoscaling, vertical pod autoscaling, and then, the cluster autoscaling. And with that, I think we would be able to achieve a decent level of auto scaling with this. Right?

The Docker Messenger that could potentially break the build by leveraging caching layers, okay, from Python. See in this, first of all, the requirement dot txt is not copied to the image. Right? To the image. So when it is not copied, so without copying it, we are running the command pip install -r requirement.txt. And so this itself will break the docker build, because the requirement dot txt will not be there from before, inside this particular image. Right? So this is the first problem that I see with this Dockerfile. So I think the best solution would be that first, we should copy the requirement.txt file to the image. Once it is copied, right, like, once we do copy requirement.txt to the working directory, after that, we should run this command, run pip install -r requirement.txt. Once the pip install runs and installs all the Python modules, after that, we should copy the rest of the code, and we should build the application. So this is how the Dockerfile should be built for a Python application.

The process for converting stateless machine learning APIs for stateless machine learning APIs in Python to stateful services according to complex processing needs. Outline the process of converting stateless machine learning stateless machine learning APIs in Python to two stateful services in Kubernetes for complex processing. See, for converting any stateless service to stateful, that have to be deployed inside the Kubernetes cluster. Obviously, we would have to make use of stateful sets. Actually, stateful sets help you to have a unique identifier order of pods, in order of their numbering. And, it attaches a physical volume to every pod. So you will get those features with the stateful setting that helps you to retain the state of the pod even if it crashes right now. So, obviously, the stateful sets have to be considered. It has to be configured for deploying these stateless services as a stateful service inside the Kubernetes environment. And I think, apart from that, if there is anything, I think that depends on the overall use case. So as the description is very limited, so I am unable to think of any other solution at the moment, to be very honest, Clyde. But, yes, I think stateful sets have to be considered, and they should be utilized for this purpose. And they can be extended with the use of services or maybe ingress to expose those Python services. But, yes, under the hood, I think stateful sets would have to be utilized for this purpose.

For a 12-and-a-half literate solution for a committed solution that serves Python-based machine learning models for critical, real-time applications. Okay, so for creating a fault-tolerant solution to run a high-availability solution for a Kubernetes cluster that serves Python-based machine learning models for critical real-time applications. I think we would have to use, first of all, the GPU-based instances if data learning modules require it. And also, the thing is that the nodes should be spread across multiple availability zones. So, after selecting the right kind of nodes, we should create a Kubernetes cluster in such a way that the nodes are scattered across different availability zones. So that even if a node in one availability zone is down or is experiencing some issue, the traffic could be served from other availability zones. Likewise, when we are designing the pod, when we are deploying the pod, we should also consider the pod spread constraint. Right? So, we should mention this spread constraint topology, where we should have at least one part of the model in each node. So that even if one port crashes, the traffic could be served from other pods from other availability zones. And then we should adopt a decent level of auto-scaling mechanism to ensure that if a port is bombarded with a lot of requests and is unable to sub the traffic, the traffic could be subbed from other ports. Apart from this, I think we can also implement the probes and a pro, like we have different kinds of probes, like a startup probe and a liveness probe, and a readiness probe. And these are the probes that could be utilized to ensure that unless the port is completely up and is completely ready to receive the traffic, the traffic is not exposed to the ports. So, liveness and readiness probes can actually help us to effectively divert the traffic to the ports only when they are ready to accept the traffic or when they're healthy. Also, we'll be able to know when the ports are unhealthy, and based on that, the traffic can be routed to some other healthy ports that are part of the cluster. So, I think this is how we can design fault tolerance and high availability. And, yes, there are a lot of other things that come into play that depend on how the entire application is being designed and what are the other components involved in the overall setup. But, yes, this is where we can start, and then we can brainstorm further to create an extensive solution.

The role of MLflow in simplifying the management of Python learning model life cycles within a Kubernetes based platform.

Spark can boost data processing for large-scale battle learning in a cloud environment like AWS. I think Spark is capable of processing large amounts of data. And as Python learning modules, they actually have to process large amounts of data, but the class can integrate Spark, so that we can utilize the high processing power of Spark to process the large amounts of data. Then it can be routed to the models. Maybe, like, I'm not very sure about this, but this is what is coming to my mind as of now.

Divyansh Srivastav

DevOps Manager

10 years

Skillsets

Vetted For

Professional Summary

Applications & Tools Known

Work History

Senior Technical Leader - DevOps

DevOps Lead Engineer

DevOps Technical Lead

DevOps Engineer

DevOps Engineer

System Engineer

Linux System Administrator

Senior Analyst

Achievements

Major Projects

Multi-tenant SAAS application architecture

Education

Bachelors of Technology in Instrumentation and Control Engineering

Bachelor of Technology in Instrumentation and Control Engineering

Certifications

Hashicorp certified: terraform associate

Certified kubernetes administrator

Red hat certified engineer (rhce)

Certified kubernetes administrator certificate number: lf-q22q5yjxxh

Hashicorp certified: terraform associate (003) certificate number: b9850132-3657-45f0-945c-415279080f65

Hashicorp certified: terraform associate (003)

Red hat certified engineer (rhce) certificate number: 160-039-657

AI-interview Questions & Answers