profile-pic
Vetted Talent

Rishabh Khandelwal

Vetted Talent
Having experience of 3+ years, skilled in different tools and technologies used in today's world for agile development. Worked as SRE in DevOps, HPC, Cloud Computing, Data-Protection, CI/CD Workflows and still willing to learn more.
  • Role

    Software Engineer

  • Years of Experience

    5.5 years

Skillsets

  • AWS
  • Grafana
  • Hyper-V
  • Jenkins
  • Livy
  • MLFlow
  • OpenShift
  • Oracle virtualbox
  • Prometheus
  • Rancher
  • Terraform
  • Zabbix
  • Kubeflow
  • GitLab CI/CD
  • Azure
  • Elasticsearch
  • ELK Stack
  • GCP
  • Kasten K10
  • Kibana
  • Linux
  • Logstash
  • RHEL
  • Spark
  • Ubuntu
  • Veeam
  • Docker
  • Docker - 5 Years
  • Kubernetes - 5 Years
  • Kubernetes - 4 Years
  • Python - 5 Years
  • Github - 3 Years
  • GitLab
  • VMware vSphere
  • Windows Server
  • Python - 5 Years
  • Docker
  • Kubernetes
  • Python
  • Docker - 5 Years
  • Kubernetes
  • Python
  • Docker
  • Kubernetes
  • Kubernetes
  • ArgoCD
  • AWS CloudFormation
  • Bash
  • C
  • C++
  • Git
  • GitHub Actions

Vetted For

8Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    DevOps Engineer (Remote)AI Screening
  • 44%
    icon-arrow-down
  • Skills assessed :AWS Certified DevOps Engineer, Certified Kubernetes Administrator, financial applications, RabbitMQ, Teraform, AWS, GCP, Kubernetes
  • Score: 40/90

Professional Summary

5.5Years
  • Jun, 2021 - Present4 yr 6 months

    Software Engineer

    SanData System
  • Sep, 2020 - Apr, 2021 7 months

    Jr. DevOps Engineer / System Administrator

    SLK Techlabs

Applications & Tools Known

  • icon-tool

    Git

  • icon-tool

    Python

  • icon-tool

    Docker

  • icon-tool

    Kubernetes

  • icon-tool

    AWS (Amazon Web Services)

  • icon-tool

    Google Cloud Platform

  • icon-tool

    Azure

  • icon-tool

    Azure Active Directory

  • icon-tool

    Terrafrom

  • icon-tool

    Jenkins

  • icon-tool

    Helm

  • icon-tool

    Spinnaker

  • icon-tool

    Zabbix

  • icon-tool

    Terraform

  • icon-tool

    Ansible

  • icon-tool

    Veeam

  • icon-tool

    Github

  • icon-tool

    Rancher

  • icon-tool

    AWS

  • icon-tool

    GCP

  • icon-tool

    OpenStack

  • icon-tool

    Ubuntu

  • icon-tool

    CentOS

  • icon-tool

    Windows

  • icon-tool

    Tomcat

  • icon-tool

    Nginx

  • icon-tool

    ArgoCD

  • icon-tool

    Hyper-V

  • icon-tool

    VMware ESXi

  • icon-tool

    vSAN

  • icon-tool

    Terraform

  • icon-tool

    ELK Stack

  • icon-tool

    Prometheus

  • icon-tool

    Grafana

Work History

5.5Years

Software Engineer

SanData System
Jun, 2021 - Present4 yr 6 months
    Automated the deployment and lifecycle management (LCM) operations of Azure Local (Formerly Microsoft Azure Stack HCI) for Arc enabled VMs, AKS Clusters, SQL-MI and AVD using Python with REST APIs. Reducing manual setup, time and improving consistency. Run build to build tests in HPE EZAI Essentials for continuous improvement. Work with developers in troubleshooting system and error log analysis to test new features in the MLOps tools to reduce QA efforts and build faster. Design POC solution for hybrid HPC Cloud Bursting solution, enabling the seamless migration of high- computing workloads from on-premise data centers to AWS and GCP, ensuring private connectivity and storage redundancy. Architected and deployed scalable data storage solutions using WEKA and Scality Clusters, facilitating efficient data hydration between on-prem infrastructure and AWS cloud storage. Responsible for designing POC solution for HPE ProLiant DL380 Gen10 servers in Google Anthos. Provide technical support to manage on-prem VMware vSphere infrastructure for both production and development environments, optimizing performance and resource allocation. Perform analysis and test functionalities in Hitachi Virtual Storage as a Service (vSTaaS), generate test reports based on outcomes. Provided Quality Assurance (QA) and functionality testing support for the development of Hitachi Kubernetes Service (HKS), identifying and resolving bugs to improve product stability. Design and test comprehensive Data Protection as a Service (DPaaS) offering utilizing Veeam and Kasten K10 for Kubernetes, ensuring robust backup and disaster recovery for critical workloads.

Jr. DevOps Engineer / System Administrator

SLK Techlabs
Sep, 2020 - Apr, 2021 7 months
    Streamlined software delivery by managing daily product deployments to QA and Production servers, utilizing CI/CD principles which improved release frequency by 40%. Led the migration of a monolithic Docker application to a microservices architecture on Kubernetes within on-premise and AWS infrastructures, enhancing scalability and system resilience. Drive and established a centralized server monitoring system, providing real-time metrics and alerting for over 100 servers to ensure 99.9% uptime for production application. Deploy and configured an Intrusion System (IPS/IDS) to monitor network traffic and generate security reports, improving the security posture of live production servers.

Achievements

  • Google Cloud Skills Boost
  • Google Cloud Skills Boost https://www.cloudskillsboost.google/public_profiles/d6ceb27c-6740-4d47-a965-046efe7b0804

Major Projects

5Projects

Automated Workload Provisioning and Lifecycle Management Operations

    Created python scripts to automate the provisioning of workloads and lifecycle management operations using REST API for Azure Local workloads: Virtual Machines, AKS Clusters, SQL-MI and AVD session hosts.

Hybrid Cloud Bursting for High-Performance Computing (HPC)

    Architected a managed services solution for bursting HPC virtual machine workloads from on-premises environments to AWS/GCP during peak demand, achieving zero downtime for critical business operations.

Data Protection as a Service (DPaaS)

    Designed architecture and POC for physical, virtual and multi-cloud infrastructure enterprise data for backups and Disaster Recovery. Tested comprehensive offering utilizing Veeam and Kasten K10 for Kubernetes, ensuring robust backup and disaster recovery for critical workloads.

Centralized Monitoring and Alerting System

    Implemented a central metrics monitoring and alerting system using Zabbix for physical & virtual Linux and Windows Servers to continuously monitor application performance and resources utilization.

MLOps Pipeline for Automated Model Training

    Created an end-to-end MLOps pipeline using Jenkins and Git to automate the training, validation, and deployment of a CNN machine learning model, ensuring consistent and reproducible results.

Education

  • Bachelor of Engineering, Electronics and Electrical Engineering

    M.B.M. Engineering College (2020)

Certifications

  • AWS

    Amazon Web Services (Dec, 2021)

    Credential URL : Click here to view
  • Linux

    CKAD - CNCF (Jan, 2022)

    Credential URL : Click here to view
  • Google

    CKA - CNCF (Sep, 2021)

    Credential URL : Click here to view
  • Aws cloud practitioner

  • Microsoft az-900

  • Kcna

  • Expertise in docker

  • Openshift applications do101

  • Aviatrix multi-cloud associate

  • Cka

Interests

  • Badminton
  • Games
  • Watching Movies
  • AI-interview Questions & Answers

    So here for straightforward services, uh, we use a persistent storage system and create a and define a storage class in Kubernetes, and which will persist all the persistent data in Kubernetes and maintaining the data consistency. So the basic requirement here is the the it's defining a storage class and, uh, setting up the persistent volumes and, uh, uh, behind the persistent volume claims for every or for all the required services, which are, uh, required for the persistent data.

    So while migrating from ECS to GKE, uh, first, when she let the we can ensure that the application the stateful application is always up, is always up. And, uh, the consideration here is that, uh, if we, uh, uh, let's suppose we have some virtual machines and are running in ACS, and, uh, during migration, all of the machines cannot be stopped at a time so that it will provide a sudden downtime in the application, which will lead to the downtime on the application. So we can tick, uh, just decrease the number of instances, the number of compute no compute nodes, uh, gradually, and while increasing the number of the nodes in GKE side so that, uh, when application goes down in e c ECS, uh, in in the meantime, uh, same parts and same parts and same application will be starting up in the, uh, g GCE console side so that there is no downtime in the application during this migration.

    Container resource limits. Uh, here, we we define the container's resource limits for the memory and the CPUs that the container is being container is using. So uh, some using some sidecar ports or some sidecar ports, uh, we can, uh, continuously monitor the resource limits in a in an application, uh, which will predict or which will not, which will define first the uses of current, uh, app current container. And based on these data, you can define how much limits resource limits can be set for that container so that, uh, we can get a optimum performance based on the cost and, uh, which which can lead to the cost saving as well.

    The message processing in rabbit time queue, uh, we can ensure that what the only in one side, the data is incoming. And, uh, while accessing it, uh, the data should be properly, uh, accessible to the customer, and it's in while accessing it, multiple users for multiple users, uh, the different parallelism can be set up so that, uh, it can be reduced in the latency.

    Setting up those auto scaling in GCP, uh, we can grab the data and the metrics is, uh, uh, where we we can monitor the CPU and the RAM utilization. Uh, and according to those utilization, the load balancing methods can be implemented. And, uh, CPU, RAM, and the disk storage, these are 3 basic things with, uh, where the metrics can be obtained and based, and the rescaling can be adjusted according to it.

    Here, the second task in that that is debug and variable engine installation is not, uh, can be a potential failure, uh, which tends to the failure of the complete playbook code.

    So in this container, I can see that the memory is limited to 512 megabytes, and the CPU is limited to 200 or 2 CPUs. So once the request or once the utilization of this port goes above these limits, then it can lead to potential failure of this port. And because beyond that, if this container is, uh, util is requiring, uh, some higher resources beyond these limits, then it will tend to failure of the board.

    For interservice communication in STO, first thing is to consider is that all the services, uh, are needed to be, uh, of the of the cluster IP service type. There is no need to use the note port or any other load balancer service type while using Steel. And, uh, the second thing would be to set up the NGINX, the ingress one ingress endpoint. So so that it will point only, uh, one endpoint.