profile-pic
Vetted Talent

Abhimanyu Prajapati

Vetted Talent

Experienced DevOps Engineer with 6 years of expertise in architecting, automating, and optimizing large-scale, mission-critical deployments. Proficient in driving end-to-end DevOps processes, including advanced configuration management and CI/CD pipelines, to enhance system reliability, scalability, and performance.

  • Role

    Senior DevOps Engineer

  • Years of Experience

    6 years

Skillsets

  • version control
  • CI - CD
  • Kubernetes
  • Linux
  • AWS
  • Kubernetes
  • Monitoring
  • Deployment
  • Google Cloud
  • MLOps
  • Terraform
  • Shell Scripting
  • Python
  • automation
  • Security compliance
  • Scripting
  • Monitoring
  • Logging
  • infrastructure as code
  • Databases
  • Container orchestration
  • Configuration Management
  • Collaboration Tools
  • cloud platforms
  • CI/CD
  • Build Tools

Vetted For

15Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Software Engineer, DevOpsAI Screening
  • 77%
    icon-arrow-down
  • Skills assessed :infrastructure as code, Terraform, AWS, Azure, Docker, Kubernetes, 組込みLinux, Python, AWS (SageMaker), gcp vertex, Google Cloud, Kubeflow, ml architectures and lifecycle, pulumi, seldon
  • Score: 69/90

Professional Summary

6Years
  • Feb, 2025 - Present 7 months

    Senior DevOps Engineer

    Sanas.ai
  • Dec, 2023 - Feb, 20251 yr 2 months

    Senior DevOps Engineer

    ZoomCar
  • Jan, 2022 - Nov, 20231 yr 10 months

    Senior DevOps Engineer

    Observe.ai
  • Feb, 2019 - Jul, 20212 yr 5 months

    DevOps Engineer

    TOTHENEW
  • Jul, 2021 - Dec, 2021 5 months

    DevOps Engineer

    Sailpoint Technologies

Applications & Tools Known

  • icon-tool

    Harness

  • icon-tool

    AWS

  • icon-tool

    SageMaker

  • icon-tool

    terraform

  • icon-tool

    Jenkins

  • icon-tool

    Okta

  • icon-tool

    Github

  • icon-tool

    bitbucket

  • icon-tool

    argocd

  • icon-tool

    Chef

  • icon-tool

    Ansible

  • icon-tool

    Docker

  • icon-tool

    Kubernetes

  • icon-tool

    AWS ECS

  • icon-tool

    Dynamodb

  • icon-tool

    Elasticsearch

  • icon-tool

    logstash

  • icon-tool

    Kibana

  • icon-tool

    loggly

  • icon-tool

    Grafana

  • icon-tool

    Prometheus

  • icon-tool

    cloudwatch

  • icon-tool

    ELK Stack

  • icon-tool

    Amazon EKS

  • icon-tool

    gke

  • icon-tool

    New Relic

  • icon-tool

    CloudWatch

  • icon-tool

    Fluentd

  • icon-tool

    Logstash

  • icon-tool

    MySQL

  • icon-tool

    DynamoDB

  • icon-tool

    MongoDB

  • icon-tool

    Bash

  • icon-tool

    Python

  • icon-tool

    Terraform

  • icon-tool

    Slack

  • icon-tool

    Jira

  • icon-tool

    GitHub

  • icon-tool

    Bitbucket

  • icon-tool

    AWS

  • icon-tool

    Harness

  • icon-tool

    ArgoCD

  • icon-tool

    GitOps

  • icon-tool

    Terragrunt

  • icon-tool

    Helm

  • icon-tool

    Kustomize

  • icon-tool

    AWS

  • icon-tool

    GCP

  • icon-tool

    Azure

  • icon-tool

    Terraform

  • icon-tool

    AWS ECS

  • icon-tool

    Prometheus

  • icon-tool

    Kibana

Work History

6Years

Senior DevOps Engineer

Sanas.ai
Feb, 2025 - Present 7 months
    Set up high-availability RKE2 cluster, migrated ML training to on-prem infrastructure, integrated scalable ML pipelines, implemented CI/CD pipelines, and enabled secure rollouts via Spinnaker.

Senior DevOps Engineer

ZoomCar
Dec, 2023 - Feb, 20251 yr 2 months
    Optimized application monitoring, reduced compute costs using AWS Graviton, migrated EKS clusters, and supported tool adoption.

Senior DevOps Engineer

Observe.ai
Jan, 2022 - Nov, 20231 yr 10 months
    Automated deployment lifecycle, achieved compliance standards, configured SSO for AWS users, and implemented autoscaling solutions in EKS.

DevOps Engineer

Sailpoint Technologies
Jul, 2021 - Dec, 2021 5 months
    Developed Terraform modules, managed Kubernetes infrastructure, and led CI/CD processes using Jenkins and ArgoCD.

DevOps Engineer

TOTHENEW
Feb, 2019 - Jul, 20212 yr 5 months
    Automated infrastructure provisioning, led cross-region cloud migration, implemented disaster recovery strategies, and managed cost optimization techniques.

Achievements

  • Implemented multi-container deployment for ML-based models
  • Created an automated pipeline in Harness
  • Automation of SageMaker pipeline with MLOPS
  • Infrastructure hardening for compliance certificates
  • Set up SSO for AWS users
  • Implemented Signoz for APM monitoring
  • Event-based auto-scaling using Keda in EKS
  • Graviton Instances setup for cost reduction
  • Karpenter setup for node autoscaling
  • Led a cost-saving initiative
  • Spearheaded the implementation of Signoz for APM monitoring, delivering detailed performance insights and strategic improvements, resulting in a 30% reduction in New Relic costs
  • Strategically configured Graviton Instances in AWS, achieving a 30% reduction in compute costs through decisive planning
  • Directed a cost-saving initiative that slashed operational expenses by 50%, ensuring high-quality outcomes from inception to execution
  • Led a comprehensive AWS EKS migration from version 1.21 to 1.28, orchestrating a seamless transition and enhancing cluster performance significantly, successfully eliminating AWS extended support costs while introducing enhanced new features
  • Evaluated Builder.ai, Cast.ai, and Redis Enterprise, playing a pivotal role in the selection and integration of these tools
  • Acted as a key SME for the cloud team, providing support and contributing to critical initiatives
  • Managed Helm charts and utilized Kustomize for scalable provisioning of EKS clusters, ensuring consistency and performance
  • Automated infrastructure provisioning and configuration management using Chef, streamlining operations and enhancing reliability
  • Led the migration of production infrastructure to a cross-region cloud platform, improving redundancy and efficiency, and implemented Disaster Recovery strategies to reduce Recovery Time Objective (RTO) and Recovery Point Objective (RPO), ensuring business continuity
  • Orchestrated the automation of the multi-cluster deployment lifecycle through Harness, enhancing efficiency and reliability, and cutting deployment time by 50%
  • Achieved SOC-2, PCI-DSS, and ISO compliance by implementing access controls, encryption, continuous monitoring, and anomaly detection, reducing vulnerabilities by 60%
  • Engineered the configuration of SSO for AWS users via Okta, resulting in a 70% reduction in support tickets and a 50% improvement in user management efficiency
  • Implemented Keda and Karpenter for autoscaling in EKS, optimizing resource usage and reducing costs by 30%
  • Automated SageMaker pipelines, boosting operational efficiency by 50% and decreasing processing time by 40% using MLOps best practices
  • Spearheaded the upgrade of the AWS EKS cluster from version 1.23 to 1.28 and adopted Istio for blue-green deployments, reducing deployment time by 80%.
  • Orchestrated a migration from AWS to GCP, cutting operational costs by 30%.
  • Implemented Signoz for APM, slashing New Relic costs by 50%.
  • Transitioned to microservices architecture and Graviton processors, cutting operational expenses by 50%.
  • Led the implementation of Keda and Karpenter, reducing operational costs by 30%.
  • Automated SageMaker pipelines, decreasing processing time by 40% and boosting operational efficiency by 50%.
  • Achieved compliance, reducing vulnerabilities by 60%.
  • Optimized SSO configuration for AWS users via Okta, reducing support tickets by 70%.

Major Projects

3Projects

Enhanced Performance Monitoring

Dec, 2023 - Present1 yr 9 months
    Spearheaded the implementation of Signoz for APM monitoring, delivering detailed performance insights and strategic improvements, resulting in a 30% reduction in New Relic costs.

Cost Optimization with Graviton Instances

Dec, 2023 - Present1 yr 9 months
    Strategically configured Graviton Instances in AWS, achieving a 30% reduction in compute costs through decisive planning.

AWS EKS Migration

Dec, 2023 - Present1 yr 9 months
    Led a comprehensive AWS EKS migration from version 1.21 to 1.28, orchestrating a seamless transition and enhancing cluster performance significantly.

Education

  • BTech in Computer Science with specialisation in Cloud Computing and Virtualization Technology

    University of Petroleum and Energy Studies, Dehradun (2019)

Interests

  • Travelling
  • Cricket
  • Biking
  • AI-interview Questions & Answers

    Okay. So, uh, my name is Abhi Manu, and, uh, I'm working as a senior DevOps engineer in Zumkhan, and I have 6 years of experience. And I started my journey at 2.10. You and there I worked on a project named Nykaa. And, uh, I was working on AWS, uh, Terraform, uh, ELK stack. Uh, we did the Doctor exercise as well. We used, uh, for monitoring, we used the Grafana Influx and Telegraph. And, uh, I have also worked on Chef. And, uh, yeah, mostly, uh, in that project, I was it's a ecommerce website. So I was handling the infrastructure part as well as the all our services were on ECS. So, um, I mean, everything from CICD, we were using Jenkins. So so CICD part, infrastructure part, and my monitoring and logging and, uh, configuration management using Chef. So these are tool I worked on, uh, in my initial years. And later on, I moved to SailPoint, and there I was working as infrastructure engineer. And, uh, I majorly worked on Kubernetes and Terraform. And from there, I moved on to Observe dot ai. In Observe dot ai, I have worked on Loggly. That's a new tool. So Loggly, uh, harness, I worked on. I have created multi multi accounts or multi cluster, multi account pipelines, automated pipelines that will automate through, let's say, from dev to QA then to production and, uh, automated all the steps. And we don't have to, uh, manually go and change anything in that. And, uh, extensively, I worked on Kubernetes. I've worked on scaling. I've implemented, uh, Carpenter. I've implemented KEDA. And there on, uh, I mean, I moved on to Zoomcar. And, uh, in Zoomcar, uh, I've been, uh, working on, uh, currently been working on AWS to GCP migration. And I'm also working on I've worked on EKS migration. Uh, not migration. You can say upgradation. I have upgraded EKS from 1.23 to 1.29, and I have worked on cost saving exercises as well. I have worked on signals and implemented signals for for our APN metrics metrics and distributed tracing. So we were getting lot of, uh, I mean, New Relic was there, but it was very, you can see, cost it's not very cost effective. So what we did is what we did was we implemented signals. And, uh, in signals, uh, we have the open telemetry is there that will fetch those APM metrics and give us the distributed tracing. And so it's currently, it's open source, and we implemented that in our prod and non prod environment. And, uh, so that saved us a lot of money from New Relic. And, yeah, a lot of cost saving exercises I have done. I have done, um, in my initial years, I have done Doctor exercises, I've told. So and, uh, so majorly, I worked on AWS, uh, GCP, and a few of the projects I've done on Azure as well for Azure DevOps and worked worked on AKS, uh, mostly around Kubernetes. And, yeah, I think, overall, this is my experience.

    So EC 2 based application, there is always a challenges, uh, when you are trying to deploy well, you're trying to do a deployment. So the best case would be to do a, uh, rolling update. And, uh, so for rolling update so basic architecture is, uh, let's say you have a load balancer and then you have a target group. And below target group, you have, uh, EC two instances running in your auto scaling groups. Right? And that auto scaling group is managed through a launch template. So, basically, if you're going to deploy your new code, you need to update the launch template. So, uh, there is a tool called, uh, Packer. So what Packer does is, uh, Packer, uh, will, uh, move I mean, you will run the Packer script, let's say, in your Jenkins instance or some other instance which you use for deployment, CICD. Uh, so Packer will build your image. So what Packer will do is it will, uh, run all the commands in an easy two instance, and they'll create a EMI for that. Right? So it will build your EMI image. And what you can do is you can use that image and replace that image in your launch template and make it a new version. And you can use that version. Uh, you can update that version once that version is updated. So the, uh, you have, uh, auto scaling auto scaling group. Right? So that will automatically gets bigger. And what you can do is, um, you can do a rolling update. So in the rolling update, you can say that let's say my 3 instances are running. So once the 3 instances are up and running, you will not delete or you will not terminate the old three instances. It's not like one is coming, other one is going down. You can also manage it such a way that, uh, your all the instances let's say if 3 are running, 3 should come up, And then only, uh, the traffic should be moved to other 3 and, uh, the rest of the 3 should be, uh, I mean, terminated. Right? So this if in this case, let's say some issues or some bugs are coming in, so we can, uh, directly, uh, you know, move back to the old version. So termination will not be taking place. Let's say we can directly, uh, switch to the, uh, previous instances. Um, see, rollback is, uh, is a problem when we are dealing with EC two instances. But, uh, overall, I think, uh, this will be, uh, this the strategy is optimal in case of an easy 2 upgradation. Or, uh, you can have a I mean, for rolling update, this is the best way. And for blue green deployment, we can use a blue green deployment, but that is not asked in the question. So to sum it up, uh, we'll have a launch template. We'll have a, um, auto scaling group. We'll have a target group, and target group will have all the target instances. And uh, we will deploy using we'll deploy updating the launch template. Once the launch template is updated, that will be reflected in the target group, and new instances will be coming up. And that, uh, that will be in a rolling update fashion. And once that is done, um, I mean, new we can have our deployment, uh, with minimum downtime.

    So okay. State plus c b k. So so AWS CDK is native to, AWS, while Terraform is, uh, you can say cloud agnostic. We can have different Terraform. We can use Terraform in different clouds, uh, let's say Google GCP. And in terms of network provisioning, uh, I think because AWS CDK is tightly coupled with the AWS environment, uh, you can directly integrate it with your VPC, uh, in your VPC. Wherever in your VPC, you will have different, uh, let's say, EC two instance, uh, which are running in your, uh, let's say, in your environment. You can directly have access to those instance wherein in Terraform, uh, you have to I mean, wherein Terraform, Terraform manages a state. Right? So let's say, uh, uh, if you have created a network, uh, let's say, VPC you have created, that state will be managed in your s three bucket or in a local wherever you are keeping it. And you can use that, um, it's a VPC ID and, uh, VPC ID, subnet ID, and security groups, uh, wherever you require. Uh, let's say you want to use that VPC ID somewhere while you are creating a load balancer or while you are creating, uh, while you're creating a c two instances so that you can, uh, use that there directly. You can dynamically provision, uh, dynamically get the data, and you can use that. And, uh, in Terraform, I think, uh, you can have, uh, I mean, see, I'll I'll go through it like this. Like, Terraform will have modules, and Terraform can use that modules to spin off many, uh, easy to or you can say many networks, uh, many VPCs, uh, whereas, uh, your CDK CDK is, uh, you can say it's a cloud development kit that is used for, uh, uh, beginner level, or you can say if you are just starting with AWS, you can use AWS CDK, which will be which will be easier for you to, uh, deploy your environment. You will have a AWS VPC present, and you can create your, uh, I mean, security groups and everything will be there, and you can create your environment there. While in Terraform, if you want to create a big environment or you can say very complex environment, that helps in, um, in Terraform, uh, as compared to the CDK.

    Design a workflow using Docker Python AWS services to a consistent, repeated environment for both development production. Okay. So, uh, there are 2 I can I can say, like, uh, there are 2, 3 services which can, uh, do that? Depends on your use case. One thing is you can run the you will have a to ensure consistent with the environment for both of the kind of production. Alright. So workflow, I would say, like, you have a code in your Docker. Let's say, uh, flow using Docker Python AWS services. So Python, I would I would say it's the service. Right? So let's say that we have a Python service and we have to run Docker containers and, uh, using the AWS services. The first and foremost thing is we can replace, uh, Lambda. So recently, Lambda has introduced Docker containers, and, uh, we can use Docker images to run Lambda. Uh, what we have to do is either, uh, with the help of, uh, GitHub actions or some other workflows like Jenkins, we can create a pipeline. So let's say I have a code in my GitHub or I have a script in my GitHub. And using a Docker image, I can create using Docker file. I can create a Docker image. Um, let's say there is a Python, uh, Python code is there, and I'm using a whiskey and creating a Docker Docker file that will have all the commands to start the Docker image. And, uh, they'll then to start the, I mean, application, I'll make that, uh, image and push it to ECR, and I can use uh, AWS Lambda to use that image that is present in the ECR. So, directly, I can deploy the code in the Lambda itself. Right? If it's, uh, I mean, uh, if it's kind of like a script or, uh, kind of for doing some particular task, that will work. Else, what I can do is, uh, using GitHub action or, uh, Jenkins itself, I can create a workflow where we are, uh, going into an instance or going into, let's say, SSH into an instance and using that same image, uh, we can deploy a new container and stop the old one. So that's the second one where we are assessing into the SSH doing as such into an instance and deploying that container. And, uh, so this will be saved in, uh, this workflow will be saved in prod and non prod. Let's say production and deployment. So only branch will be changed. You will have a different branch in GitHub. So there will be a deployment branch. There will be production branch. If you want to deploy production, you can use production branch. If you want to deployment, you can use a deployment branch. Right? And, uh, third is the you can use, uh, AWS service like, uh, ECS is there. So we'll have ECS service, uh, service file is there. The service is there, then, uh, TD is there. Target target definition is there. You can update the definition with the image, and you can deploy update the service to create the, uh, Docker images Docker containers in ECS. So ECS clusters are there, and you can deploy the service there. So directly, you can manage non prod and prod. Uh, similar way, they'll be we'll create a image. Image will be moved to ECR. Uh, ECR from there will update the target definition, and from there, uh, we'll deploy it to the ECS cluster.

    Okay. So, uh, states in Terraform can be managed, uh, I mean, there are multiple ways, but, basically, we manage state in, uh, um, we can manage state locally. And but locally, it is not very secure. It can be corrupted. It can be overwritten. Um, and what we do is we manage it in s 3 or storage. Right? So we have, uh, s 3 where we are storing our, uh, state file. And that state file, we can configure s 3 in such a way that, uh, it will have it can have multiple version. Versioning, we can enable in s 3. So that way, we will have multiple versions of the, uh, same state file. Right? So let's say in future, if one of the state file, uh, the latest state file gets corrupted, we can move back to the old one. And what we can have is we can have, uh, DynamoDB locking as well. So in locking, no 2 person can write on the same state file. Let's say if I am creating a IC two instance and my teammate is creating a IC two instance, if we are both using the same state file and updating updating it, uh, the right operation can be logged. And, uh, that way only, uh, so we can have a locking in there. So that way, no 2 person will be able to update the same file. Okay. And, uh, s three is there. We can enable replication. We can have, uh, policy around s three. Like, we can keep we can run our Terraform from a particular box, and we can allow only I mean, box means instance, and we can allow role particular role of that instance to access the s 3 so that, uh, no 2 other people can access the s 3. Or if if you are running it from your, uh, let's say every user can run it from your local, then, uh, we can use manage it using the locking. And, uh, across multiple environments, uh, let's say we have a broad development and QA is there. So we can use workspaces. Uh, different workspaces will have a different state file. Uh, let's say, we'll have to switch to workspace. We'll have to switch to let's say, if you are working on QA, we'll have to switch to QA workspace. Working on development, we have to switch to QA development workspace. And each workspace will have a different, uh, I mean, different path for a state file. And, uh, more segregation is, uh, beneficial. Uh, let's say, for example, if you are creating VPC, easy to load balance or target group in one state file, There are very, very you can say, uh, I mean, a lot more chances are there that this file will get corrupted, uh, because these are getting it's a very big state file. Right? So what we can do is we can divide the state file in multiple, uh, multiple, uh, you can say resources. We have VPC. We have EC 2. We have load balancers. So that way, all the your, um, you can say, your target area gets smaller and, uh, let's say, there is some changes in the VPC that can be easily visible. If you have a very big state file, if a state is changed in some of the places, it's very easy to get, uh, I mean, state gets, uh, it's very easy to overlook that mistake. You know, somewhere, something is connected to something. Uh, let's say load balancer is connected to a target group, and you're changing some path. And because of that, some other thing is being changed. So you may not, uh, caught it, but if you're individually doing it, it is easier to change.

    Okay. So, uh, implement a zero downtime deployment strategy for a Kubernetes also. Yeah. So 0 downtime for Kubernetes. Let's say for let's take an example that we have a Python service. Uh, we have a GitHub repositories there. And, uh, for this example, I'll be using Jenkins, and we have a GitHub repository, and in GitHub, we have, uh, one is the code repository, one is your, uh, YAML file repository. Right? We'll have a lot of YAML files out there. We'll we'll be managing the scaling, and our deployments will be there. Service file will be there. Your secrets, config map, uh, let's say, hand charts are there. Right? So we are managing hand chart for that service, and we have a code. So first step would be in our Jenkins pipeline is to fetch the code. 1st will be the we'll do a checkout of the code, then we'll get the code, and, uh, we'll create a docker image. So we'll have a docker file in the code itself, and, uh, that will fetch all the required whatever required is there, required code, and we'll have a Docker image. Docker image will be pushed to the ECR. And from that ECR, we'll, uh, we'll now we'll we'll be connecting. So let's say we have a Jenkins server. So Jenkins server will have have access to will create a user in, uh, uh, let's say, will have access to the Kubernetes. So now, um, since the deployment part, uh, as we are using Helm Chart, so we can manage it using Helm Chart. So there will be a hand deploy command is there. So So we'll fetch the image and we'll update the image. Uh, I mean, we'll fetch the Helm Chart repository. So different repository for Helm Chart. So we'll fetch that, and we'll do a checkout in the repository for that Helm Chart. And we will update the, uh, Docker image in the command itself in the hand command. Right? So this will be a helm upgrade command. Helm upgrade hyphen hyphen install, then we'll pass the parameter. Let's say, uh, Docker images there. So we'll update directly from the command itself. We'll update the Docker image with the tag. And so how what directly handle does is handle will do a rolling deployment. So once, uh, let's say you have, uh, 10 or you can say you can have 4 and ports are running. Yeah. So, uh, from Kubernetes side, it will create a port. Uh, it will create a service, and service will be let's say, it will create deployment. It will create a port. It will create a service. Service will will create a ingress, uh, ingress load balancer will be there. Uh, let's say, the plus ingress controller using that will create a load balancer. Ingress controller will have an entry with the, let's say, endpoint of the service, let's say, x y z dot com, and which will be mapped to the service. Right? So once we are doing the deployment, that service will not be touched. Only the deployment part will be touched in where we are updating the, uh, image. Right? That image is updated, so hand will deploy it in a rolling update fashion. So in rolling update, what happens is, uh, uh, 4 ports are there. Uh, let's say one goes down. Uh, not one goes down. 1 will one port will come up, and the other one will, uh, once this has come up and this has passed all the health checks, uh, let's say I have a probes out there, startup probe and, uh, I mean, different probes I am using, readiness probe, startup probes. And once that probes are completed, uh, then only, uh, the other one will go down, and this one will will start having the traffic. Same with the other 3 as well. So this will go in the rolling update fashion. So this won't take much time, and, also, the deployment will have zero downtime.

    So process. So I think this subprocess dot run is the, uh, issue. Try checking works up. Accept the process. Docker build failed to raise Docker image latest. So subprocesses run I think this comma separated thing is not correct and which is causing the issue. I mean, the syntax is not correct, uh, where the command needs to be comma separated. Um, the current syntax should be okay.

    So we have m m model service replica 3, match labels. Labels are selector is there and corresponding label is also there. Spec is container. Name model image is also there. Container port is there. Okay. So container port is 80. So what we need is a service as well. And this is only the deployment part. Service also needs to be defined, but deployment even when ML model service is defined. What crucial detail is missing? So request and limits are missing, uh, if I want to see what I else I can add in this deployment. So request and limits are missing. We can have probes as well. And, So for this deployment, we'll also require a service to expose the ML model and, uh, yeah, request and resource. And the service will service will also have, like, on what will be the node port and container port. And in here, I think, uh, resources, request and limits are missing, probes are missing. We can add mount points as well. Volume mount is missing. Yeah.

    Now would you design a system to auto scale containerized machine learning or cloud in a hybrid cloud setup? Okay. So to auto scale, uh, machine learning workloads, uh, let's say the workloads are, uh, we have, uh, Kubernetes, uh, in a hybrid cloud setup. Okay. So one thing is we have Kubernetes running. Right? So and all our workloads are running on the Kubernetes. So to auto scale, uh, you can say 2 type of auto scaling are there in Kubernetes environment. 1 is the node auto scaling, another one is port auto scaling. So port auto scaling, uh, we have, uh, we can use tools like, uh, for node and port, we can use tools like carpenter and KEDA. So, uh, I think for machine learning perspective, we can make utilization of KEDA. So what KEDA will do is KEDA will have a lot of scalars. So they will, uh, we can scale on the basis of certain parameters. Let's say CPU memory is the very basic, and we can do on the basis of request. We can do on the basis of some other, um, let's say, number of requests made to a particular API or these kind of things we can do because it can have different scalars for new relic. We can use for Prometheus, and we can use for AWS metrics as well. So these kind of, uh, I mean, scaling, we can utilize. So for port scaling, KEDA will be beneficial. And for, uh, uh, let's take an example for AWS. Let's say in AWS, I have, uh, defined scaling on the number of requests on the load balancer. Let's say, the load balancer of that machine learning part, uh, on which, uh, our target group, which is there. So let's say if I load increases from 100 request to 200 request, my machine should get scaled. So that way, I can, uh, do the scaling on the request basis. And, uh, based on the request, the board will automatically get scaled up. Okay. And other than that, I can have a carpenter for node auto scaling. I can use carpenter and can can create different templates in that and, uh, that will have, let's say, machine learning. Let's say, g g, uh, kind of machine we are using. Right? G 5 for, uh, that that will require for, uh, our machine learning application. I'll create a template for that. So whenever there is a need for, let's say, a template template is increased oh, sorry. Our board port is increased, sending ports are increased, and there's a requirement for an extra, uh, node. So Carpenter will, uh, automatically spin up that node, and that rescaling will happen. And I think this will work similarly in a, uh, in a hybrid setup as well, uh, where you are managing let's say, you're managing your cluster, uh, control plane from your, uh, in your own setup or in your data centers, and your, uh, worker nodes are there in your AWS. But, uh, this will this is, uh, uh, like, you're managing your own cluster. If you're managing using EKS, uh, then it makes things easier. But you can utilize those 2 scaling scale based on, uh, let's say, uh, I mean, using these tools, it's easier to scale.

    Discuss your experience with setting up distributed demo inferences on platform like Kubernetes based on our Kubernetes based solution. So, uh, I have worked on SageMaker's, uh, for setting up ML models on SageMaker. So I'll, uh, go through the whole process of, uh, setting up that. So what we have is uh, so, uh, let me remember. Okay. So in SageMaker, what we have is we have a data source. Let's say RDS or Redshift is there. Uh, s 3 is there. So we have a data source, and, uh, we have a SageMaker Studio. It's a SageMaker Notebooks out there. Okay. And, uh, we create, uh, artifact, uh, code artifact. Uh, we store the code artifact in GitHub, and we, uh, manage environment using ECR, and then we have we store model artifact in s 3. And then we have a SageMaker pipeline that helps in preprocessing of the jobs, and we are running AWS Lambda and training the job using EMR. And, uh, so the whole AWS pipeline we were using. And then also we had an event bridge to trigger it, Once the model artifact is there, um, in s 3 and, uh, we have SageMaker pipeline processes, everything, uh, event bridge will trigger. Right? So then we will have a model registry. Yeah. So, uh, so we will have a model registry, and, uh, model registry will have a model, and that that will have a that will I mean, Lambda will be there. So Lambda will, uh, use that model registry, then we will do our deployment. So for deployment, we will have an API gateway, and, uh, API gateway will have will talk to a Lambda, and we will I mean, Lambda will be there, and Lambda will talk to the model registry. And we have SageMaker Pipelines, and yeah, so the whole setup will be s 3, then we will have SalesMaker, uh, Studio. And, uh, in that Studio, we'll generate some code artifacts from GitHub, then ECR environment to be managed to ECR, and then with model artifact will be stored in the s 3. Then we will have SageMaker pipeline for processing of the jobs, and, uh, then we'll have SageMaker model registry that will move to Lambda. Uh, then from Lambda, we will have, uh, endpoints will be there. Then endpoints, API Gateway, Lambda endpoints, and, uh, model registry. Yeah. Yes. So I also done it on Kubernetes as well. So Kubernetes is simple, like, how we use a service. I have, uh, used multi you can say multi, uh, multi ports, uh, deployment structure is there. So multiple ports are running that are doing different different functions, uh, in a particular, uh, container. I mean, multiple, uh, one port is there. Multiple containers are there which are running and then service is

    So, uh, there are a lot of, uh, methodology lot of methodologies are there for, uh, let's say, GDPR and SOC 2. Uh, for, uh, necessarily for SOC 2 and GDPR, what we have is we have a separate account that we, uh, I mean, no developer access is provided. So, I mean, developer access is that. Read access is provided. So data which are very important, let's say, DB access is there or your s three access is there where customer data is present that are not accessed by anyone. Only the admin admin administrator access is there to some few of the people, few of the let's DBA will have your, uh, admin password and everything. He manages that. Uh, rest all, uh, for databases, let's then talk an example of services. We first will go to I'm So that is managed through let's say, if you're using in DCP, that is managed through your G Suite. You will have, uh, um, your email address is there. Based on that, you can provide the uh, user access. You will have different, uh, you will have access based on different teams. Let's say developer is there, DevOps is there, and other, uh, groups are there. Let's say platform team, other teams are there. So whichever team that requires particular access, uh, those team only will, uh, have access to that resource. Let's say they want to access their Kubernetes cluster. Right? They want to check the port logs and the port, I mean, port metrics and port logs and everything. Uh, regarding port and deployment, so they will have access to the port, read access they can have to port and logs. They won't be having access to the environment variables or you can say secrets. Uh, secrets can be managed to secret manager, and, uh, your groups will be there. So you will be managing through groups, and, uh, you need to maintain, uh, logging as well. You need to have 6 months of, uh, all the, you can say, all the data, or you can say CloudFront Cloud CloudTrail will manage only 3 months. So, uh, those data should be there so that you can have an auditing as well. Uh, you need to do auditing. Regular auditing should be there, and, uh, I'm user should have an MFA enabled. And, uh, your I mean, in terms of application, only I mean, most of your applications should be private database, should be private. Private endpoint should be, um, they should be kept in the private subnets. Only your load balances should be public, and all your traffic should come through a single, uh, you can say a single point where you're, uh, from load balancer or from a CDN, only that traffic should be allowed. If you have a public s three bucket, only the traffic from the, let's say, CDN or from load balancer, uh, that would be allowed if you're accessing x I mean, using third party is there. So you should only allow third party IPs, uh, for that, uh, s three buckets. Your services, uh, should have different user created. You should not use personal, uh, personal user for accessing the databases. And, uh, yeah, your e c two instances should not be, uh, I mean, e c two instance that deals with the public data or personal data, that should not be public and, uh, that should be kept private. And, uh, I mean, whichever services are there, you should have a for services, you should have a you can integrate tools like SonarQube so that you can check vulnerabilities in the in the code and, uh, check if any, uh, in your Docker image, you should also check the vulnerabilities. So