profile-pic
Vetted Talent

Shyam Sundar

Vetted Talent
Engineering Manager - SRE with a demonstrated history of working in middleware and Linux administrations. Lead architecture design, implement and enact multiple enterprise platforms and solutions with utmost security.
  • Role

    Principal Cloud Architect - SRE

  • Years of Experience

    15.1 years

Skillsets

  • Kubernetes - 3 Years
  • Docker - 5 Years
  • AWS Services
  • Ci/Cd Pipelines
  • Cloud Observability
  • GCP services
  • infrastructure as code
  • monitoring & logging
  • Security
  • version control

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Site Reliability Engineer (Remote, Vimeo)AI Screening
  • 68%
    icon-arrow-down
  • Skills assessed :Go, DevOps, Distributed web architecture, AWS, 組込みLinux, PHP, Python, Ruby, System Design
  • Score: 61/90

Professional Summary

15.1Years
  • Jul, 2024 - Present1 yr 2 months

    Principal Cloud Architect - SRE

    Nebula Tech Solutions
  • Jun, 2022 - Jul, 20242 yr 1 month

    Cloud Solutions Architect

    BlueAlly InfoTech India
  • Jul, 2019 - Jun, 20222 yr 11 months

    Lead DevOps Engineer

    Think & Learn (Byju's Group)
  • Nov, 2011 - Apr, 20175 yr 5 months

    Linux Administrator

    Tata Consultancy Services
  • Apr, 2017 - Jul, 20192 yr 3 months

    Senior DevOps Engineer

    Cognizant Technology Solutions

Applications & Tools Known

  • icon-tool

    Github

  • icon-tool

    Jenkins

  • icon-tool

    Packer

  • icon-tool

    Nomad

  • icon-tool

    Vault

  • icon-tool

    Consul

  • icon-tool

    Humio

  • icon-tool

    Graylog

  • icon-tool

    Prometheus

  • icon-tool

    Grafana

  • icon-tool

    Datadog

  • icon-tool

    Confluent

  • icon-tool

    Splunk

  • icon-tool

    Elasticsearch

  • icon-tool

    Vercel

  • icon-tool

    Netlify

  • icon-tool

    CloudFlare

  • icon-tool

    GoDaddy

  • icon-tool

    Terraform

  • icon-tool

    AWS (Amazon Web Services)

  • icon-tool

    GCP

  • icon-tool

    Kubernetes

  • icon-tool

    Terrafrom

Work History

15.1Years

Principal Cloud Architect - SRE

Nebula Tech Solutions
Jul, 2024 - Present1 yr 2 months
    Provide technology leadership for infrastructure, security, scalability and monitoring. Hands-on experience on many AWS services related to infra and security automation.

Cloud Solutions Architect

BlueAlly InfoTech India
Jun, 2022 - Jul, 20242 yr 1 month
    Implemented Terraform, Kubernetes, and Vault for secure and scalable infrastructure management. Orchestrated distributed systems using Nomad and Consul.

Lead DevOps Engineer

Think & Learn (Byju's Group)
Jul, 2019 - Jun, 20222 yr 11 months
    Managed GitHub organisation, CI/CD pipelines, and Kafka Cloud services. Led monolith-to-microservices migration for scalable applications.

Senior DevOps Engineer

Cognizant Technology Solutions
Apr, 2017 - Jul, 20192 yr 3 months
    Managed AWS infrastructure for insurance client, including VPCs, EC2 instances, alarms, databases, and Docker containers.

Linux Administrator

Tata Consultancy Services
Nov, 2011 - Apr, 20175 yr 5 months
    Performed software upgrades, managed Oracle WebLogic and IBM stack, and maintained 24/7 Linux infrastructure.

Major Projects

3Projects

Github Terraform Integrations

BlueAlly
Feb, 2024 - Present1 yr 7 months

    The GitHub provider is used to interact with GitHub resources.

    The provider allows you to manage your GitHub organization's members and teams easily.

    The GitHub provider offers multiple ways to authenticate with GitHub API.

    The setup was to automate the complete Github actions without UI access to it.

Payment Management System

Byjus
Jan, 2022 - Jun, 2022 5 months
    1. The goal of the project is to create a complete payment system with all the basic functionality.
    2. Managed Docker containers effectively for application deployment and orchestration in a cloud environment.
    3. Designed middleware solutions to facilitate seamless communication between different components within the infrastructure.

Order Management System

Byjus
Jan, 2021 - Aug, 2021 7 months
    1. OMS application is to track the details of the orders placed by various customers.
    2. Orchestrated CI/CD pipelines with Docker, Kubernetes and Terraform to automate deployment processes.
    3. Implemented Kafka for real-time data streaming and processing to enhance system performance.
    4. Automated infrastructure provisioning using Terraform configurations to streamline deployment workflows.

Education

  • Bachelor of Technology - Information Technology

    Rajalakshmi Engineering College (2011)

AI-interview Questions & Answers

Could you please help me to understand more about your background by giving a brief intro introduction of yourself? Hi. I'm Shyam Sundar. I have around 12 years of experience in IT. So I initially started my career, uh, from TCS as a Linux administrator and then, uh, moved to Cognizant after, uh, gaining a considerable experience, um, in the base, uh, where I got the opportunity to work on the AWS cloud. So initially, I had a lot of, uh, I mean, we were actually using a lot of, uh, AWS provided, uh, cloud services. And then, you know, like, we were able to streamline, uh, only, uh, at the end of my tenure. Then I moved to BYJU'S in the year, uh, 2019, uh, where, uh, I mean, I was the first the last person who joined the back end team. I was into, uh, an application. I mean, I was actually deployed to an application, uh, monitoring, which is a b to b phasing application. Moreover, it was a kind of back end, uh, application. So I was the 1st DevOps person who joined, and then my responsibility, uh, was, uh, not only to, um, uh, support and work on the technical steps, but also to build a stringent team and then, you know, designing the architecture, providing the designs for the microservices architecture as well. So there were a lot of responsibilities, and, basically, it was a combination of several roles as such. So at the end, uh, I mean, I had a great experience in terms of the technical exposures. So initially, we started with Elastic Beanstalk, then moved to EC 2, but then we dockerized the applications through microservices architecture. And then we, uh, had a lot of, uh, requirements to move them to Kubernetes and things like that. So these were the major tools which I've been working on the, uh, infra side. When we speak about the automations, I have experience on Terraform, uh, and simple, solid stack, etcetera. Uh, in addition to that, uh, on the CACD tools, uh, I mean, I have worked on JBOT, Jenkins and GitHub actions. And, uh, for, uh, the monitoring part, I have worked on Datadog, Prometheus, Grafana, AppDynamics, etcetera. And on the log management system, I have experience working in Elastic Search, Splunk, uh, CloudWatch, etcetera. So this is the overview. And then, uh, I moved to the current organization named BlueLA in the year 2022. It's been exactly 2 years, uh, since I joined here. So here, again, it was a combination of both lead and the, uh, individual contributor role. Here, again, I was the 1st person who joined. I had the responsibility to build a 5 member team. Currently, I have around 5 people who are reporting to me directly. So, yeah, so when we speak about the technical stuffs, I'm more into the Hashicorp provided products like NoMad, Kansl, Terraform, Vault, etcetera. And, uh, basically, I work for an insurance client who who's based out of US. So, I mean, uh, on top of that, we manage, uh, most of my experiences or into the security front and on the log management, uh, systems. Yeah. I think this is the overview. So just to rephrase or just to summarize, I have got vast experience on both, uh, leading a team and, uh, you know, on the individual contributor stuff as well. I mean, more on the technical part also. Yeah. Thanks.

How would you securely manage secrets when deploying Python applications on Kubernetes? Okay. So the approach which I had designed, uh, was, you know, on the AWS platform, we usually don't have the secrets on the repository or, uh, on the container level, uh, services. We have we use a service named, uh, secrets manager where we, uh, actually store the secrets on the parameter store, uh, and the secrets gets, uh, get baked in during the time of build and deployment into the Kubernetes clusters through the container deployments. So how it works is the CACD tool will have the exposure to, uh, the secrets manager service where you will have the secrets stored. Right? Whenever I mean, it will have the provision to update as well in case if you want to update, uh, you know, change, uh, the password or something like that. So similarly, uh, the sequence will be stored over there. And during the run time of I mean, during the CICD process, it will actually pull the secrets and get them, uh, baked in into the containers during the deployment process itself, and it goes into the, um, uh, Kubernetes cluster. And the secrets which are stored under the, uh, secrets manager will be completely encrypted, and AWS uses its own algorithm to keep them, uh, safe. Yeah. So, I mean, again, uh, deploying Python applications is going to be a similar, uh, approach. In Python, you're going to have, um, a file secrets file, uh, where you're not going to have the secrets, but still you will have the key names alone, not the actual secrets, uh, with which it will match the, uh, values which are available in the, uh, secrets manager, and this is how it would actually relate. So this was the system which I designed more specifically for the, uh, Python applications, which are running on Kubernetes.

What would be the strategy for ensuring consistent system performance while utilizing spot instances in AWS? So when we speak about the spot instances, uh, AWS, I mean, again, it is a very, uh, great feature in terms of, uh, cost, uh, budgeting or whatever. But, uh, my advice, uh, of using spot instances would be, like, we can actually limit it only on the lower environments like, uh, the QA stage or whatever. It's not much advisable to have spot instances on the production systems, uh, because, uh, I mean, we know the actual feature of it. Right? So in spot instances, we may get terminated anytime as per their requirements and but provided they give us the prior notification instance is going to get terminated or whatever. But it is going to be very minimal. I mean I mean, they don't, uh, cut, uh, the hardware very frequently or something like that. But but my first strategy would be having spot instances on the lower environments and not using them on the production systems, uh, which will again impact the reliability of the systems. And then on top of that, let's have only some low priority apps, uh, running on these spot instances, though we have them on the lower, uh, environment so that productivity of the developers or QA engineers are not going to impact, uh, on any reason even if the instances go down due to, uh, the exabyte set reasons. So on maybe I would try a few more things like, uh, you know, uh, utilizing them at I mean, leveraging them, uh, to the maximum extent so that we would get, uh, the results, um, with the utmost, uh, priority. And maybe we would be able to fetch the results completely, uh, from the spot instance, uh, usages.

What strategy would you implement to ensure 0 downtime deployments for a distributed Python application? Okay. So, I mean, uh, we have several approaches to ensure zero downtime deployments, for a distributed Python application. I mean, it's not common for the Python. I mean, it is actually common for all the web apps, though it is written on Node. Js Golang or whatever. So, I mean, the common or the standard approach which has been used is that, uh, you know, uh, we will have the cluster, uh, running. I mean, let's assume we have an ECS or EKS, whichever is, uh, you know, feasible with, uh, with our environment. Uh, for now, I'll just assume we will have an ECS cluster where, uh, our Python application runs. Uh, maybe the back end API, uh, we can consider. So once, uh, the code is getting pushed into the master branch, uh, master branch, uh, it would actually, uh, bring up a new, uh, service, uh, itself, and then, you know, it will actually push push the code and then build the image on ECR, and then it will bring up a new service. I mean, the latest service, and the new service will be mapped to the latest version of the ECR image. And, uh, but still, you would have the older version of the service, uh, serving from the older, uh, version of the image again. But, uh, it would actually go down on the pin, uh, the new service is stable for around 10 to 15 minutes. Maybe we would have an option to set that, uh, as well on the AWS console. Uh, but still, uh, it will ensure that the application is up and running, and all the new requests which are coming in would be served from the new, uh, service. And just the the old service would still be, uh, serving the request which are act which were actually pending, uh, before the new service, uh, came in. So this is how we do, and it will just be on the standby for a few minutes. Once it ensures the new service is up and running for some 10, 15 minutes, it would go. So this, uh, would ensure zero downtime deployments. I mean, moreover, it is a standard practice, which I would say. I mean, we we would be able to implement the same procedure on Kubernetes cluster as well. Yeah. I would go with this approach to ensure, uh, 0 downtime during the deployments.

How would you leverage AWS Lambda to reduce operational overhead for a Python application. Okay. So AWS Lambda, as we know, it is a serverless platform where you would be able to run your Python code. I mean, any any publication, um, I mean, it it will actually come up only whenever it's needed. We will have to pay only for the, uh, time for which we use the API. So this is the basic overview of AWS Lambda. But, uh, to reduce the operational overhead of the Python web publication, first thing, uh, we would be I mean, we can actually decorate it with API gateway. And then through API gateway, we will have an option to map the a record c names into route 53. So what happens is that AP gateway will have the complete control of the application, like, you know, to control the, uh, IP addresses from where the, um, requests are coming in, uh, request per second. Uh, there are a lot of criterias through which we would be able to monitor. Uh, maybe the person or the organization which is actually consuming the APIs, uh, will have some common IP from where they are actually going to, uh, hit our APIs. Right? So consumer APIs. So they may be we would be able to get their NAT gateway's IP to waitlist in our, uh, gateway also. So a b gateway also just to leverage this thing. But, uh, to be very specific on this, uh, question, reduce operation overhead of a Python web publication. First thing is that we would, uh, let, um, I mean, let we let's assume we have a bad job running on every one hour or something like that. We have returned it on the Python. And then, uh, when we actually wanted to leverage the AWS Lambda, we would have an option to schedule it as well so that, uh, we will have to pay only for, uh, the time usage where it is, uh, actually getting, uh, where, I mean, where it will it is actually getting used. That that's what I meant to say. But when we speak about the web application of, uh, uh, Python web application, which is actually running on the AWS Lambda, We we can actually ensure, uh, that the applications are, you know, the APIs hits which are coming in. Let let us have some criteria of I mean, let's let's expose them only during this time, and then we can actually stop the request which are coming during the off business hours or something like that. Moreover, it will actually hand in the complete serverless environment, which is, uh, you know, it is AWS managed as well. Yeah. This would be my approach, uh, in order to leverage AWS Lambda, uh, in terms of the operational overhead for, uh, a standard, uh, web application.

How will you resolve this split brine scenario in a distributed database system. So when we speak about split brain, uh, scenario, I mean, uh, it is very common in the large scale database, more particularly on the SQL databases. When we have the databases configured under our TF relational database service, split prime scenarios are, uh, you know I mean, which which it it is going to impact the complete performance of the database, uh, to be precise. So just to resolve that, what, uh, we can do is there are several approaches to solve this. My first one may be, like, uh, tuning the queries as per, uh, the requirement. As per the requirement in the sense, uh, it will not have um, well, let's not write complicated queries, which is actually running for several minutes or several hours. Right? So we can actually break it down to several, uh, sub queries and then get them executed. This will be the first approach. And then maybe we can fetch I mean, this would be the role of the database administrator. We can fix the long line inquiries and then, uh, you know, get them changed, etcetera. We we can also create index, uh, if needed. Uh, in case if if a query if one particular query is taking huge time to run or something like that, we can have, uh, the required index, uh, getting created and things. Yeah. So with this, we will be able to have, uh, several, uh, you know I I mean, we would be able to solve several other issues, including, uh, the, uh, split time scenarios.

In this Java code, uh, determine the error without changing method signatures. They determine what error put eyes and propose a solution to change. Determine what error could I assign the process this one without changing method signatures. Okay. Uh, public class. So I think here, The error is something, uh, related to the class definitions and, uh, the function is being called. Okay. So here, uh, what we do is we are, uh, trying to do some join operation, and, uh, you know, we are actually waiting for the thread to finish before initiating join. But still there is no wait between 2 basically, we are, uh, applying 2 joints, t 1 and t 2. I don't know. I think that it has something to do with, uh, the new thread creations. Maybe we can have some time, uh, time out enabled between the join operations and things like that so that we would, you know, come across we we can get them fixed.

Given this by some function that is proposed to calculate the factorial of number I n pi, we it might not return the correct results. Factor 0 written 1, else written in 51. Yeah. So we are actually trying to print, uh, the factorial I mean, default, uh, value as 5 as the factorial. So that is the problem if I'm not wrong. Basically, we should be able to fetch the results from the input provided. I mean, we you are not supposed to provide or maybe, uh, give the input by default as, uh, you know, to have the value as 5. So I think that is the actual problem here in the, uh, given Python function.

Can you detail, uh, a rollout strategy for a Python application in a multi region AWS setup, ensuring high availability and fault tolerance. Okay. So for, uh, standard, uh, Python application, uh, rollout strategies, uh, you know, very common practice, particularly for a multi region AWS, uh, setup for the high availability. But still to be specific or, you know, to explain the standard practice, uh, I would actually go with the for first approach, we'll be having a Doctor setup and, uh, the Doctor setup, uh, will also get deployed whenever we deploy the code on the actual, uh, infrastructure. In addition to that, we will be doing several, um, drills between the d r setup and the actual virtual system to understand if, uh, if if it is actually switching, uh, if there is a problem and things like that. But, anyway, yeah, that would be one of the approaches. Uh, I mean, that would be one of the reasons, I would say, but still, the actual solution for this is that, uh, you know, we can have multi tier AWS system ensuring let let's assume we have our EKS clusters running on, uh, mobile agent, And, uh, we have few more clusters running on, um, maybe Virginia. So in in this case, uh, so, basically, we will have 2, um, uh, 2 clusters, I would say. But still, both the clusters are actually going to serve the same requirement. Uh, but, you know, to just just a way of, uh, load balancing or whatever, We we will have a common load balancer, but still, uh, we may need to manage the a record accordingly, and then we may need to do some segregation on the whole 53 itself. Maybe we can use, uh, some third party tools like Ansel, which is, uh, provided by HashiCorp and things like that. Uh, but still, uh, uh, the standard approach would be going with the 53. So when the request are actually coming to Route 53, uh, in case if it is not going through or not through the load balance or all from the through the cluster, it should actually automatically route it to the secondary setup. This is what I would prefer in this thing. But, uh, for a rollout strategy, uh, when we have multi regions, I don't know if the assumption I don't know whether my assumption is correct, but still, what I assume is when we speak out multi region AWS setup, Kubernetes cluster runs on one region and then, you know, lambda functions may run on a different region. So in that case, obviously, we will not be able to have handle it with a single VPC. Maybe we can have 2 VPCs, uh, both will be interconnected to a private network by, uh, VPC peering or maybe a transit gateway or whatever, uh, option we have. So through this, we would be able to achieve this rollout strategy without, uh, you know, downtime with high availability and, uh, uh, I mean, the the systems would be very, um, strong enough to have the, uh, utmost fault tolerance as well.

What is your method of implementing a secure CACD pipeline for Python applications in Google Cloud? Okay. So my method, I would still go with Jenkins. I will have the Jenkins installed in one of the compute, um, upload, uh, on on the GCP. And then, uh, through Jenkins, I will actually get connected to the private network. So through through I'll I'll establish a private connection into the, uh, VPC, uh, of I mean, obviously, the computer is going to run on the same VPC. Uh, but still on when we speak about the KCD, uh, pipeline, Jenkins will be the, uh, major contributor here. And, uh, I don't know whether the Python applications, uh, run on a Lambda or on, uh, normal easy to on prem related installations or, uh, it runs on ECS or I don't know whether it runs on EKS also. So it depends. Right? So, uh, my assumption is that it it runs on an easy two instance through maybe, uh, we will have some, um, packages enabled to keep the Python applications deployed on the compute. Uh, so and there are there are 2 computes. 1, um, compute, uh, we are actually deploying the web publication, and the second compute is actually the Jenkins. So both are going to interact to a private network to get the CACD. So what happens is that Jenkins machine will actually SSH into the, uh, easy to not sorry. It's not easy to instance, uh, uh, which has got the web application deployed. And then it will do a series of operations. Uh, maybe, uh, it will actually do git clone, git pull, or whatever with which the, uh, application is going I mean, the code is going to get pulled, and then it will actually run a set of build and deployment commands when the packages are ready. Uh, it will actually restart, uh, the process. So this is how when once the process is restarted, it will automatically start serving, um, from the new, uh, code batch or whatever. So I think this is the, uh, safest approach. But still, uh, to be very specific on the secrets, I think I've already, uh, explained this in one of the previous, uh, questions. I'll go with, uh, baking in the secrets, getting them, you know, storing them in the encrypted format on the GCP cloud itself, making them during the run time deployment as a part of Siri itself. Yeah. This is how I would prefer, but still, I would go with the self serving systems. I don't want the sorry. So the ops engineers to initiate the build whenever it is, it is required, though it is for production. Uh, we would just make sure the build is getting initiated as soon as the code gets merged to the, uh, master branch. Yeah. This is how, uh, I would prefer, uh, the standard CACD pipeline for, uh, Python applications in, uh, Google Cloud Platform.

Okay. How would you leverage elastic capabilities of the cloud to handle unexpected high workload on Python applications? Okay. So, uh, when we speak about the elastic capabilities, uh, AWS has got its own lot of, elastic, uh, you know, elastic in the sense. It it is a kind of auto scaling. Auto scaling comes by default with the service. For example, the elastic load balance, uh, elastic search I mean, elastic, uh, Kubernetes service, elastic container service. So, basically, by default, the, uh, auto scaling comes, uh, with it. Just we just need to enable a few features. That's what I would say. But, um, even we speak about the elastic capabilities, I would go with, uh, when we see I mean, when when I have an option to choose between 3 types of load balances, like network load balancer, uh, elastic load balancer, and classic load balancer, I would go with the elastic load balancer because network load balancer may not suit for all the applications. It is a very rare usage. Uh, maybe it is being used very rarely on the high scale network, uh, input applications, but still, I would go with, uh, this thing, elastic load balances, which will with which I can easily auto scale, uh, the load balance also. I mean, on top of that, I will have a lot of provisions to control the features like the request, uh, per second. This I can actually assign a separate security group to control the inbound, outbound rules, etcetera. Okay. So next, um, when we speak about, uh, Elasticsearch, I don't know whether I will have to cover the Elasticsearch features. I know AWS has got its own, uh, you know, Elasticsearch search service, but still, um, when we speak about the tools which are trending in the market, Elasticsearch, uh, I would go with the Elasticsearch for log management. Again, it it can be deployed on any cloud, whichever as per our preference. Uh, we have got its own, uh, advantages, pros and cons when it's becoming to handle the logs, to run the queries, fetch them, etcetera. Uh, on EKS and ECS, um, I mean, it is pretty much straightforward setting up things to Terraform. Uh, we we can in case if you are going to have only on, uh, AWS services, we can go with the CloudFormation templates itself rather than going with Terraform as it is very secure as I mean, we are not, I mean, uh, we are not actually, uh, making use of any third party tools, uh, right, which would actually mitigate the risk. So, yeah, these are the, um, uh, services which I've been working so far more particularly on the elastic, uh, capabilities, uh, to handle the unexpected workload. So, yeah, even on the auto scaling part, uh, both, uh, load balancer and um, ECS EKS, we will have a lot of scenarios to auto scale. Uh, few occasions, we may need to scale them vertically. Few occasions, we may need to scale them horizontally. And, uh, you know, in few occasions, I mean, we can even set some time. For example, um, we have some sale tomorrow in our website, and there will be high load from 10 AM to 6 PM. It is expected. We can actually span scale the system, uh, with that time frame itself just to reduce the cost. And there are few more scenarios. Like, there might be some unexpected workload in an unexpected time even during the business hours. At that time, we can have some set up. Right? If the request count is going beyond this limit, we will be, uh, I mean, the system had to scale itself before the new workload comes in and things like that, which will actually, um, uh, make sure the system is not going down and it keeps serving the request uh, till the end. And then, you know, I mean, I think this through these approaches, we would be able to leverage, uh, the, uh, elastic cloud capabilities very easily, I mean, on particularly on the AWS cloud.