profile-pic
Vetted Talent

Shyam Sundar

Vetted Talent
Engineering Manager - SRE with a demonstrated history of working in middleware and Linux administrations. Lead architecture design, implement and enact multiple enterprise platforms and solutions with utmost security.
  • Role

    Principal Cloud Architect - SRE

  • Years of Experience

    15.1 years

Skillsets

  • Kubernetes - 3 Years
  • Docker - 5 Years
  • AWS Services
  • Ci/Cd Pipelines
  • Cloud Observability
  • GCP services
  • infrastructure as code
  • monitoring & logging
  • Security
  • version control

Vetted For

9Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Senior Site Reliability Engineer (Remote, Vimeo)AI Screening
  • 68%
    icon-arrow-down
  • Skills assessed :Go, DevOps, Distributed web architecture, AWS, 組込みLinux, PHP, Python, Ruby, System Design
  • Score: 61/90

Professional Summary

15.1Years
  • Jul, 2024 - Present1 yr 10 months

    Principal Cloud Architect - SRE

    Nebula Tech Solutions
  • Jun, 2022 - Jul, 20242 yr 1 month

    Cloud Solutions Architect

    BlueAlly InfoTech India
  • Jul, 2019 - Jun, 20222 yr 11 months

    Lead DevOps Engineer

    Think & Learn (Byju's Group)
  • Nov, 2011 - Apr, 20175 yr 5 months

    Linux Administrator

    Tata Consultancy Services
  • Apr, 2017 - Jul, 20192 yr 3 months

    Senior DevOps Engineer

    Cognizant Technology Solutions

Applications & Tools Known

  • icon-tool

    Github

  • icon-tool

    Jenkins

  • icon-tool

    Packer

  • icon-tool

    Nomad

  • icon-tool

    Vault

  • icon-tool

    Consul

  • icon-tool

    Humio

  • icon-tool

    Graylog

  • icon-tool

    Prometheus

  • icon-tool

    Grafana

  • icon-tool

    Datadog

  • icon-tool

    Confluent

  • icon-tool

    Splunk

  • icon-tool

    Elasticsearch

  • icon-tool

    Vercel

  • icon-tool

    Netlify

  • icon-tool

    CloudFlare

  • icon-tool

    GoDaddy

  • icon-tool

    Terraform

  • icon-tool

    AWS (Amazon Web Services)

  • icon-tool

    GCP

  • icon-tool

    Kubernetes

  • icon-tool

    Terrafrom

Work History

15.1Years

Principal Cloud Architect - SRE

Nebula Tech Solutions
Jul, 2024 - Present1 yr 10 months
    Provide technology leadership for infrastructure, security, scalability and monitoring. Hands-on experience on many AWS services related to infra and security automation.

Cloud Solutions Architect

BlueAlly InfoTech India
Jun, 2022 - Jul, 20242 yr 1 month
    Implemented Terraform, Kubernetes, and Vault for secure and scalable infrastructure management. Orchestrated distributed systems using Nomad and Consul.

Lead DevOps Engineer

Think & Learn (Byju's Group)
Jul, 2019 - Jun, 20222 yr 11 months
    Managed GitHub organisation, CI/CD pipelines, and Kafka Cloud services. Led monolith-to-microservices migration for scalable applications.

Senior DevOps Engineer

Cognizant Technology Solutions
Apr, 2017 - Jul, 20192 yr 3 months
    Managed AWS infrastructure for insurance client, including VPCs, EC2 instances, alarms, databases, and Docker containers.

Linux Administrator

Tata Consultancy Services
Nov, 2011 - Apr, 20175 yr 5 months
    Performed software upgrades, managed Oracle WebLogic and IBM stack, and maintained 24/7 Linux infrastructure.

Major Projects

3Projects

Github Terraform Integrations

BlueAlly
Feb, 2024 - Present2 yr 3 months

    The GitHub provider is used to interact with GitHub resources.

    The provider allows you to manage your GitHub organization's members and teams easily.

    The GitHub provider offers multiple ways to authenticate with GitHub API.

    The setup was to automate the complete Github actions without UI access to it.

Payment Management System

Byjus
Jan, 2022 - Jun, 2022 5 months
    1. The goal of the project is to create a complete payment system with all the basic functionality.
    2. Managed Docker containers effectively for application deployment and orchestration in a cloud environment.
    3. Designed middleware solutions to facilitate seamless communication between different components within the infrastructure.

Order Management System

Byjus
Jan, 2021 - Aug, 2021 7 months
    1. OMS application is to track the details of the orders placed by various customers.
    2. Orchestrated CI/CD pipelines with Docker, Kubernetes and Terraform to automate deployment processes.
    3. Implemented Kafka for real-time data streaming and processing to enhance system performance.
    4. Automated infrastructure provisioning using Terraform configurations to streamline deployment workflows.

Education

  • Bachelor of Technology - Information Technology

    Rajalakshmi Engineering College (2011)

AI-interview Questions & Answers

I'm Shyam Sundar. I have around 12 years of experience in IT. So I initially started my career from TCS as a Linux administrator and then moved to Cognizant after gaining considerable experience in the base where I got the opportunity to work on the AWS cloud. So initially, I had a lot of experience using a lot of AWS provided cloud services. And then, you know, like, we were able to streamline at the end of my tenure. Then I moved to BYJU'S in the year 2019, where I was the first person to join the back end team. I was into an application. I was actually deployed to an application monitoring, which is a B2B phase application. Moreover, it was a kind of back end application. So I was the first DevOps person who joined, and then my responsibility was not only to support and work on the technical steps, but also to build a stringent team and design the architecture, providing designs for the microservices architecture as well. So there were a lot of responsibilities, and basically, it was a combination of several roles as such. So at the end, I mean, I had a great experience in terms of the technical exposures. So initially, we started with Elastic Beanstalk, then moved to EC2, but then we dockerized the applications through microservices architecture. And then we had a lot of requirements to move them to Kubernetes and things like that. So these were the major tools which I've been working on the infra side. When we speak about automations, I have experience on Terraform and AWS Cloud Development Kit, etcetera. In addition to that, on the CI/CD tools, I mean, I have worked on Jenkins and GitHub Actions. And for the monitoring part, I have worked on Datadog, Prometheus, Grafana, AppDynamics, etcetera. And on the log management system, I have experience working in Elastic Search, Splunk, CloudWatch, etcetera. So this is the overview. And then, I moved to the current organization named BlueLA in the year 2022. It's been exactly 2 years since I joined here. So here, again, it was a combination of both lead and individual contributor role. Here, again, I was the first person who joined. I had the responsibility to build a 5-member team. Currently, I have around 5 people who are reporting to me directly. So, yeah, when we speak about the technical stuffs, I'm more into the Hashicorp provided products like Nomad, Vault, Terraform, etcetera. And basically, I work for an insurance client who is based out of the US. So, I mean, on top of that, we manage most of my experiences on the security front and on the log management systems. Yeah. I think this is the overview. So just to summarize, I have got vast experience on both leading a team and you know, on the individual contributor stuff as well. I mean, more on the technical part also. Yeah. Thanks.

How would you securely manage secrets when deploying Python applications on Kubernetes? Okay. So the approach which I had designed was, you know, on the AWS platform, we usually don't have the secrets on the repository or on the container level services. We use a service named secrets manager where we actually store the secrets on the parameter store, and the secrets get baked in during the time of build and deployment into the Kubernetes clusters through the container deployments. So how it works is the CICD tool will have exposure to the secrets manager service where you will have the secrets stored. Right? Whenever you mean, it will have the provision to update as well in case if you want to update, you know, change the password or something like that. So similarly, the sequence will be stored over there. And during the runtime of the CICD process, it will actually pull the secrets and get them baked in into the containers during the deployment process itself, and it goes into the Kubernetes cluster. And the secrets which are stored under the secrets manager will be completely encrypted, and AWS uses its own algorithm to keep them safe. Yeah. So, again, deploying Python applications is going to be a similar approach. In Python, you're going to have a file called secrets file, where you're not going to have the secrets, but still you will have the key names alone, not the actual secrets, with which it will match the values which are available in the secrets manager, and this is how it would actually relate. So this was the system which I designed more specifically for the Python applications, which are running on Kubernetes.

What would be the strategy for ensuring consistent system performance while utilizing spot instances in AWS? So when we speak about spot instances, AWS is a very great feature in terms of cost, budgeting, or whatever. But my advice of using spot instances would be that we can actually limit it only on the lower environments like the QA stage or whatever. It's not much advisable to have spot instances on the production systems, because we know the actual feature of it. Right? So in spot instances, we may get terminated anytime as per their requirements, but they provide prior notification if an instance is going to get terminated or whatever. But it is going to be very minimal. I mean they don't cut the hardware very frequently or something like that. But my first strategy would be having spot instances on the lower environments and not using them on the production systems, which will again impact the reliability of the systems. And then on top of that, let's have only some low-priority apps running on these spot instances, though we have them on the lower environment so that productivity of the developers or QA engineers is not going to impact on any reason even if the instances go down due to the expected reasons. So I would try a few more things like utilizing them to the maximum extent so that we would get results with utmost priority. And maybe we would be able to fetch the results completely from the spot instance usages.

What strategy would you implement to ensure 0 downtime deployments for a distributed Python application? Okay, so we have several approaches to ensure zero downtime deployments for a distributed Python application. It's not common for Python specifically, but it's common for all web apps, regardless of the language used. The standard approach is to have a cluster running, such as an ECS or EKS cluster, where our Python application runs. Once code is pushed into the master branch, it brings up a new service, pushes the code, builds the image on ECR, and then brings up a new service mapped to the latest version of the ECR image. However, the older version of the service still serves requests for 10 to 15 minutes before being decommissioned. This ensures the application remains up and running, with new requests served by the new service, while the old service handles pending requests. This process is essentially a rolling update, where the new service is brought online while the old service remains available for a short period. Once the new service is stable, the old service is decommissioned, ensuring zero downtime deployments. Moreover, this is a standard practice that can be implemented on a Kubernetes cluster as well. I would recommend this approach to ensure 0 downtime during deployments.

How would you leverage AWS Lambda to reduce operational overhead for a Python application. Okay. So AWS Lambda, as we know, is a serverless platform where you can run your Python code. I mean, any publication will actually come up only whenever it's needed. We will have to pay only for the time for which we use the API. So this is the basic overview of AWS Lambda. But to reduce the operational overhead of the Python web application, first thing, we can decorate it with API Gateway. And then through API Gateway, we will have an option to map A record names into Route 53. So what happens is that API Gateway will have complete control of the application, like, you know, to control the IP addresses from where the requests are coming in, request per second. There are a lot of criteria through which we would be able to monitor. Maybe the person or the organization consuming the APIs will have some common IP from where they are actually going to hit our APIs. So, they may be we would be able to get their NAT gateway's IP to whitelist in our gateway also. So, API Gateway also just to leverage this thing. But to be very specific on this, to reduce operational overhead of a Python web application. First thing is that we would let's assume we have a job running on every hour or something like that. We have returned it on the Python. And then, when we actually wanted to leverage the AWS Lambda, we would have an option to schedule it as well so that we will have to pay only for the time usage where it is actually getting used. But when we speak about the web application of the Python web application, which is actually running on the AWS Lambda, we can actually ensure that the APIs hits which are coming in let's expose them only during this time, and then we can actually stop the requests which are coming during the off business hours or something like that. Moreover, it will actually hand in the complete serverless environment, which is you know, it is AWS managed as well. Yeah. This would be my approach in order to leverage AWS Lambda in terms of the operational overhead for a standard web application.

When we speak about a split-brain scenario, I mean it is very common in large-scale databases, particularly in SQL databases. When databases are configured under our TF relational database service, split-brain scenarios can impact the complete performance of the database, to be precise. To resolve this, there are several approaches. My first one is tuning the queries as per the requirement. By this, I mean we should not write complicated queries that run for several minutes or hours. Instead, we can break them down into several subqueries and then execute them. This is the first approach. As a database administrator, we can fix long-running queries and get them changed. We can also create an index if needed. If a particular query is taking a huge amount of time to run, we can create the required index and see things through. With this, we will be able to solve several other issues, including split-brain scenarios.

public class The error is something related to the class definitions and the function is being called. So what we do is we are trying to do some join operation, and you know we are actually waiting for the thread to finish before initiating join. But still there is no wait between them, we are applying two joints, t1 and t2. I don't know. I think that it has something to do with the new thread creations. Maybe we can have some timeout enabled between the join operations and things like that so that we would come across we can get them fixed.

Given this by some function that is proposed to calculate the factorial of a number I in pi, we it might not return the correct results. Factor 0 is written as 1, else written as 51. Yeah. So we are actually trying to print the factorial, I mean, default value as 5 as the factorial. So that is the problem if I'm not wrong. Basically, we should be able to fetch the results from the input provided. I mean, we are not supposed to provide or maybe give the input by default as you know, to have the value as 5. So I think that is the actual problem here in the given Python function.

Can you detail a rollout strategy for a Python application in a multi-region AWS setup, ensuring high availability and fault tolerance. Okay. So for standard Python applications, rollout strategies are common practice, particularly for multi-region AWS setups for high availability. But to be specific, I would explain the standard practice by going with the first approach, where we'll have a Canary setup and the Canary setup will also be deployed whenever we deploy the code on the actual infrastructure. In addition to that, we will be doing several drills between the Canary setup and the actual virtual system to understand if there is a problem and things like that. But anyway, yeah, that would be one of the approaches. I mean, that would be one of the reasons, but still, the actual solution for this is that we can have a multi-tier AWS system ensuring that we have our EKS clusters running on one region, and we have a few more clusters running on another region. So in this case, we will have two clusters, but both the clusters are actually going to serve the same requirement. To just load balance or whatever, we will have a common load balancer, but still, we may need to manage the A record accordingly, and then we may need to do some segregation on the 53 itself. Maybe we can use some third-party tools like Ansel, which is provided by HashiCorp, and things like that. But still, the standard approach would be going with the 53. So when requests are actually coming to Route 53, in case if it is not going through the load balancer or all from the cluster, it should actually automatically route it to the secondary setup. This is what I would prefer in this thing. But for a rollout strategy, when we have multi-region AWS setups, I don't know if the assumption I make is correct, but what I assume is that when we speak of multi-region AWS setups, Kubernetes clusters run on one region, and then lambda functions may run on a different region. So in that case, obviously, we will not be able to handle it with a single VPC. Maybe we can have two VPCs, both will be interconnected to a private network by VPC peering or maybe a transit gateway or whatever option we have. So through this, we would be able to achieve this rollout strategy without downtime, with high availability, and I mean, the systems would be very strong enough to have utmost fault tolerance as well.

What is your method of implementing a secure CACD pipeline for Python applications in Google Cloud? Okay, so my method is to still go with Jenkins. I will have the Jenkins installed on one of the compute engines on GCP, and then through Jenkins, I will get connected to the private network. So through this, I'll establish a private connection into the VPC, where the compute engine runs. But when we speak about the CI/CD pipeline, Jenkins will be the major contributor here. And I don't know whether the Python applications run on a Lambda or on a normal on-prem related installation or on ECS or EKS, so it depends. Right? So, my assumption is that it runs on an EC2 instance, where we will have some packages enabled to keep the Python applications deployed on the compute. There are two computes, one compute we are actually deploying the web publication, and the second compute is actually the Jenkins. So both are going to interact with a private network to get the CI/CD. So what happens is that Jenkins machine will actually SSH into the EC2 instance which has got the web application deployed. And then it will do a series of operations. It will actually do a git clone, git pull, or whatever with which the application is going – the code is going to get pulled, and then it will actually run a set of build and deployment commands when the packages are ready. It will actually restart the process. So this is how once the process is restarted, it will automatically start serving from the new code batch or whatever. So I think this is the safest approach. But still, to be very specific on the secrets, I think I've already explained this in one of the previous questions. I'll go with baking in the secrets, getting them stored in an encrypted format on the GCP cloud itself, making them available during the run-time deployment as part of the CI/CD pipeline itself. Yeah. This is how I would prefer, but still, I would go with the self-serving systems. I don't want the ops engineers to initiate the build whenever it is required, though it is for production. We would just make sure the build is getting initiated as soon as the code gets merged to the master branch. Yeah. This is how I would prefer the standard CI/CD pipeline for Python applications in Google Cloud Platform.

Okay. How would you leverage elastic capabilities of the cloud to handle unexpected high workload on Python applications? So, when we speak about the elastic capabilities, AWS has its own lot of elastic features. It is a kind of auto scaling. Auto scaling comes by default with the service. For example, the elastic load balance, elastic search, I mean, the Kubernetes service, elastic container service. So, basically, by default, the auto scaling comes with it. Just we need to enable a few features. That's what I would say. But, even when we speak about the elastic capabilities, I would go with the elastic load balancer because, like, network load balancer may not suit for all the applications. It is a very rare usage. Maybe it is being used very rarely on high-scale network input applications, but still, I would go with the elastic load balancer, which will allow me to easily auto scale the load balance also. I mean, on top of that, I will have a lot of provisions to control the features like the request per second. This I can actually assign a separate security group to control the inbound, outbound rules, etcetera. So next, when we speak about Elasticsearch, I don't know whether I will have to cover the Elasticsearch features. I know AWS has its own Elasticsearch search service, but still, when we speak about the tools which are trending in the market, Elasticsearch. I would go with the Elasticsearch for log management. Again, it can be deployed on any cloud, whichever as per our preference. We have its own advantages, pros and cons when it comes to handle the logs, to run the queries, fetch them, etcetera. On EKS and ECS, it is pretty much straightforward setting up things to Terraform. We can use CloudFormation templates itself rather than going with Terraform, as it is very secure, as we are not making use of any third-party tools, which would actually mitigate the risk. So, these are the services which I've been working on, particularly on the elastic capabilities to handle the unexpected workload. So, even on the auto scaling part, both load balancer and ECS, EKS, we will have a lot of scenarios to auto scale. Few occasions, we may need to scale them vertically. Few occasions, we may need to scale them horizontally. And, you know, in few occasions, we can even set some time. For example, we have some sale tomorrow in our website, and there will be high load from 10 AM to 6 PM. It is expected. We can actually scale the system with that time frame itself just to reduce the cost. And there are few more scenarios. Like, there might be some unexpected workload in an unexpected time even during the business hours. At that time, we can have some setup. Right? If the request count is going beyond this limit, the system will scale itself before the new workload comes in and things like that, which will actually make sure the system is not going down and it keeps serving the request till the end. And then, I think, through these approaches, we would be able to leverage the elastic cloud capabilities very easily, particularly on the AWS cloud.