profile-pic
Vetted Talent

Sudeep Gupta

Vetted Talent

I am working as Senior Infrastructure Engineer - Data Platform at Farfetch. I work on various technologies such as Kubernetes, Azure Cloud with Terraform, Prometheus Observability Stack, ArgoCD and more. My greatest accomplishment at Farfetch would be development of the state of the art in-house Airflow Platform which rivals the likes of managed services offered by Google and AWS. I am also leading SRE operations for various aspects of the Data Platform which includes providing L2 & L3 support, RCA investigations, and owning post-mortems as well as follow up development. As a high impact engineer, my contributions at Farfetch have included various infrastructure optimisations which have saved ~1 million USD in the last 18 months, and I am also an Open Source Contributor with contributions to Airflow.

I have had an interesting career, spanning over the entire Data stack - Analytics, ETL, and Infrastructure; I have seen it all. With extensive experience, in both Product and Consulting setups, I have worked on some of the most critical and challenging projects delivering immense value to my stakeholders.

  • Role

    Lead Site Reliability Engineer

  • Years of Experience

    15.25 years

  • Professional Portfolio

    View here

Skillsets

  • Scrum
  • Ranger
  • Agile
  • Avro
  • Azure
  • Databricks
  • GitLab
  • GitOps
  • Go
  • Helm
  • Presto
  • Kafka
  • ArgoCD
  • Airflow
  • CI/CD
  • DevOps
  • PagerDuty
  • REST
  • Deadmanssnitch
  • Prestod
  • Distributed crawlers
  • Hadoop - 6 Years
  • Python - 10 Years
  • Kubernetes - 6 Years
  • Kubernetes - 6 Years
  • Prometheus - 4 Years
  • Prometheus - 4 Years
  • Terraform - 4 Years
  • Terraform - 4 Years
  • Grafana - 4 Years
  • Grafana - 4 Years
  • Hadoop - 6 Years
  • Python - 10 Years
  • Docker - 6 Years
  • Docker - 6 Years
  • Ansible
  • Hive
  • MongoDB
  • MySQL
  • Neo4j
  • Scala
  • Spark

Vetted For

15Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Staff, Software Engineer (SRE)AI Screening
  • 61%
    icon-arrow-down
  • Skills assessed :Ansible, ArgoCD, BuildKite, Chef, CircleCI, Puppet, Spinakker, DevOps, SRE, Terraform, AWS, Docker, Jenkins, Kubernetes, System Design
  • Score: 61/100

Professional Summary

15.25Years
  • Mar, 2024 - Present2 yr 2 months

    Lead Site Reliability Engineer

    Avalara
  • Lead Site Reliability Engineer Cloud Infrastructure Platform

    Avalara
  • Apr, 2020 - Mar, 20243 yr 11 months

    Senior Infrastructure Engineer

    FARFETCH
  • Oct, 2016 - Jul, 2017 9 months

    Senior Data Engineer

    Fractal
  • Jul, 2017 - Apr, 20202 yr 9 months

    Associate

    BlackRock
  • Apr, 2020 - Apr, 20222 yr

    Infrastructure Engineer

    FARFETCH
  • Jan, 2016 - Oct, 2016 9 months

    Data Engineer

    Fractal
  • Jan, 2015 - Jan, 20161 yr

    Data Scientist and Technical Lead

    Stealth Mode Start-up

Applications & Tools Known

  • icon-tool

    Python

  • icon-tool

    Argo CD

  • icon-tool

    Terragrunt

  • icon-tool

    Kubernetes

  • icon-tool

    Azure

  • icon-tool

    Apache Airflow

  • icon-tool

    Terraform

  • icon-tool

    Airflow

  • icon-tool

    Grafana

  • icon-tool

    Prometheus

  • icon-tool

    Google Cloud

  • icon-tool

    Docker

  • icon-tool

    ArgoCD

Work History

15.25Years

Lead Site Reliability Engineer

Avalara
Mar, 2024 - Present2 yr 2 months

Lead Site Reliability Engineer Cloud Infrastructure Platform

Avalara
    Lead the design and development of internal developer platform and automation tools, while overseeing technical delivery, engineering practices, and scalable solutions that improve developer productivity and operational efficiency across product/engineering teams. Architected and developed a metrics-driven SRE compliance platform (Go, Kafka, GitLab, Prometheus) that replaced manual release governance with continuous policy evaluation, reducing deployment lead time by 80% while improving release stability at scale. Designed and built a Go-based configuration templating engine and validation system for Kubernetes and multi-environment deployments, reducing misconfiguration incidents by 40% and improving deployment hygiene. Designed AI-driven operational tooling integrating Prometheus metrics, logs, and deployment signals to accelerate root-cause analysis for services deployed on the platform. Partnered with Platform, Product, and Engineering Leadership to align infrastructure and reliability initiatives with organizational delivery and uptime goals. Lead a globally distributed team of 4 engineers and increased the team velocity by 30% with Agile coaching, and continuous feedback loops.

Senior Infrastructure Engineer

FARFETCH
Apr, 2020 - Mar, 20243 yr 11 months

Infrastructure Engineer

FARFETCH
Apr, 2020 - Apr, 20222 yr
    Built and scaled centralized cloud infrastructure and internal platforms supporting Application workloads and Analytics and MLOps workloads, with a focus on reliability, observability, and cost-efficient infrastructure automation for global engineering teams. Architected and deployed an Airflow Platform-as-a-Service (Terraform, ArgoCD, Helm) with custom RBAC, secrets backend, centralized logging, and observability; led migration from Google Cloud Composer, reducing platform costs by 70% while improving operational control and reliability. Designed and implemented a highly available Prometheus observability stack integrated with PagerDuty and DeadMansSnitch, achieving 99.99% platform uptime and saving $500K annually by retiring Azure Container Insights. Introduced spot-instance orchestration across Kubernetes and Databricks workloads, optimized GPU and compute utilization, and reduced annual infrastructure spend by $100K+. Implemented governance and cost observability frameworks (Databricks Overwatch) to provide automated insights into platform inefficiencies and resource usage patterns.

Associate

BlackRock
Jul, 2017 - Apr, 20202 yr 9 months
    Transitioned from hands-on data engineering to building and modernizing large-scale data infrastructure, orchestration platforms, and cloud-native systems supporting Analytics, and MLOps workloads. Architected a modular Data Fabric for Equity Research to automate ingestion and storage of multi-source structured and semi-structured datasets, enabling standardized signal generation workflows and reusable compute/analytics layers across research teams (GCP, Python, Flask, MongoDB, Ansible). Led migration of on-prem mortgage asset modeling infrastructure to GCP and Airflow (Composer), modernizing legacy batch pipelines, reducing runtime from 48 hours to 10 hours, and significantly improving research iteration cycles. Designed and implemented a scalable Data Lake platform for low-latency interactive analytics, establishing governance and data organization patterns (Medallion-style layering) to prevent data swamp and support large-scale analytical workloads (Hadoop, Spark, Presto). Engineered performance-critical internal tooling (FTPSync) for distributed file system synchronization across HDFS, NFS, and object storage, reducing algorithmic complexity from O(n) to O(n) and lowering memory footprint by 20%.

Senior Data Engineer

Fractal
Oct, 2016 - Jul, 2017 9 months

Data Engineer

Fractal
Jan, 2016 - Oct, 2016 9 months
    Worked on large-scale Big Data and Advanced Analytics systems for strategic enterprise and public-sector clients, focusing on distributed data pipelines, performance optimization, and scalable data infrastructure for analytics-driven decision making. Developed distributed financial fraud detection pipelines using Spark, Neo4j, and Python to identify fraud rings and shell entities; optimized pipeline architecture to reduce runtime from 12+ hours to under 2 hours for large graph based datasets. Designed and implemented Hadoop-based Data Lake and ETL frameworks (Hive, Spark, Ranger, Avro) integrating structured and semi-structured data sources, enabling scalable analytics and warehousing on self-hosted Hortonworks clusters. Engineered production-grade data processing workflows for high-volume analytical workloads, improving reliability, data consistency, and execution efficiency across client environments.

Data Scientist and Technical Lead

Stealth Mode Start-up
Jan, 2015 - Jan, 20161 yr

Achievements

  • Built a self hosted Airflow Platform service which is better than managed services on AWS and GCP.
  • Saved 1+ million USD in 18 months in Infrastructure Optimisations.
  • Open Source contributions to Airflow
  • Well versed with Data and MLOps
  • Kaggle Rank 8617
  • Ranked in 98th percentile in GATE 2011
  • Ranked 1681 in 7th National Cyber Olympiad
  • Ranked in 93rd percentile in National IT Aptitude Test

Major Projects

4Projects

Automating SRE Compliance

    Led the development of an event-driven platform to automate compliance, integrated with GitLab and Kafka for real-time project evaluations.

Context-Aware Template Rendering Engine

    Scalable engine in Go to provide infrastructure-aware configuration generation for deployment templates.

Airflow Platform

    Developed modularized Airflow Platform with integrated security, remote logging, automated deployment, and observability.

Low Latency/Interactive Analytics Data Lake Platform

    Designed a data lake for low-latency analytics and model computation, preventing data swamps.

Education

  • Master of Technology in Computer Science with Major in Data Engineering

    IIIT-Delhi (2014)
  • Bachelor of Technology in Computer Science

    GGSIPU (2012)

Interests

  • Watching Movies
  • Photography
  • Walking
  • Reading
  • Cooking
  • Writing
  • AI-interview Questions & Answers

    Hi. This is me, Sudip here, and I am working as senior infrastructure engineer at Farfetch. I'm specifically working in the data platform team. And as part as part of the data platform team, I am a lead engineer designing, development, and deployment of the Airflow platform specifically, where I've here that made the maximum contribution in terms of development. the airflow platform Farfetch is something, which we developed on Azure Cloud because there were no managed services which were available at the time. And, the kind of features that our users wanted, there were use cases for which, they were not supported by AWS and GCP. So in a in all, you know, Sense., the airflow platform that, I developed at Farfetch is in a lot of ways better than managed services on AWS, GCP, and even Astronomer, which are the original creators of AWS. Apart from that, I work on, you know, a lot of monitoring and observability through the Prometheus stack, infrastructure optimizations, terraforming, Azure Cloud. And, Yeah. I mean, that summarizes the gist of it. And I'm also chiefly responsible for the, you know, the SRE operations for the data We have various components and applications on the platform for which I was I am in charge of, leading the postmortems incidence management, taking, charge of the post postmortem development, So things like that. to summarize my value impact, in terms of money, which I've had at Farfetch. I would say in infrastructure optimizations alone, I have gone, saved, you know, more than $1,000,000 in the last 12 to 18 months. So I'm a high impact engineer. I would. That's how I would summarize it, my work as. And, I'm curious. I am a fast learner, and I like working in a team while, and I'm able to, you know, contribute hands on as an individual contributor and take a leader team technically as well.

    Okay. containerizing an existing Golang application. So, yeah, for consistent development and deployment. So let's begin with the basics. Let's say there is a build pipeline which compiles, this Golang application, And let's say we are doing that in a Docker itself, so we would obviously need a Golang Docker container to begin with. so we basically compile our application in that container. Copy the compiled binary into the 2nd container, which can be, let's say, an Alpine container because now once the machine code is once the goal, the Go binary is, Go binary is created. It can run-in any environment. So we copy that into that deployment container, which can be Alpine. And post that, we can basically deploy our container. With regards to deployment system. So the way we can actually handle it is we can use semantic, Tagging systems for our Docker con and Docker images, which eventually need to be deployed. And for deployment systems, we can use a continuous deployment solution like Argo CD. The Argo CD has a specific component like, an image updater, which can continuously monitor your image repository for updated, Docker image tags. So as soon as it detects that there's a new version, so it can get deployed. With regards to, you know, release releases. So I would rather say, let's put in extensive testing in, extensive test cases and the test suites. So that will basically help you automatically release your, applications if they are, if they pass a test suite. But, if this is something which needs to go gradually, we can use canary deployments, Like, you know, blue and green deployments using Istio. Set up a rule, which will actually help you do an AB testing. And if that is something which looks like it, so then you can manually release the application by updating the image tag.

    Okay. So, this actually, depends on what kind of setup you want to do. You can actually use Terraform Cloud. It's a managed service, and it will help you, basically, deploy, in a collaborative team setting. But, essentially, a better solution, and it is not really difficult, is, you could use Terraform with Atlantis Automation. So, what is it? So what Atlantis does essentially is, it will help you automate or, you know, integrate with your GitHub or your Git, and it gives you kind some kind of chatbot operation. So in a git PR, as soon as you raise a Terraform, as soon as you create your changes in the Terraform repository in GitHub so and as soon as you raise that p r, the Atlantis integration will help you know, plan the changes. It will help you visualize those, visualize this plan and the infrastructure changes in that plan, as a get comment itself so that it is visible for everyone. So long story short, to have infrastructure for a team collaboration with Terraform, you can basically deploy, LAN tests on your Kubernetes cluster, then you can host your, for Terraform back end, you can actually use whatever cloud store based storage. For AWS, you can use s 3. for Azure, you can use Azure Blob storage or container. And for GCP, again, something similar. So, plus your so tariff, plus your GitHub integrations and WhatsApp. So, yeah, that's how you would do it. integrate your integrate your Terraform code repository with Atlantis and deploy it using a cloud based storage solution.

    Okay. So one of the changes that I would suggest in this Dockerfile is The cgoenabled=zero, goose=linux, these seem to be environment variables, Which we should add in a different command so that this is not something, so this So this basically, the build process, goes faster, I guess, and because these are the these look to be static commands. And apart from that, I have not actually built a co binary myself. So this is something, that, which I can recommend from the top of my mind. But, yeah, I mean, looking at this

    Okay. So a critical service deployed using Kubernetes is not self recovering. So some other things that I would start looking into is the first thing that I would say is, whatever this app is. So list all components which are, you know, as part of this, So, part of the service. So for instance, there could be network services. There could be, you know, application ports, and all of those. So check the health status for all of these things. Let's say there is a part which is having such kind of failures. This port is having such kind of failures. So, you know, check the logs into the port as to what is the problem which that, you know, log has. But even before going into the logs, check for, you know, very basic things like, Is the port not getting scheduled on to some node? Is there some kind of a disk or a CPU or memory pressure on that node on which that port is getting deployed because often this is something which will, you know, which often happens. So for instance, let's say, service this service or the part which belongs to the service could be crash looping on a, you know, a memory Thanks. because sometimes, resources are specified as in memory. Request could be, you know, some, let's say, 4 GB, and limit could be 8 GB. But maybe the port requires 10 gigabytes memory, so it will work until the 8 gigs of memory is reached and then crashed because it is not able to request for more memory. And this crash looping is the loop will basically no. It's kind of a crash looping. So this is one such scenario. The other could be, let's say, check for network networking problems. So for instance, a spot could be connecting to some service which is outside of, you know, outside of the port. So for us, it could be connect trying to connect to the database. Could be trying to fetch a secret. That feat that secret may no longer be available. Okay. So these are things that I would start looking into, if there are Grafana dashboards, I would also, you know, check for the Grafana dashboards. Often, the application dashboards, you know, contain a lot of information with regards to the indicators, which we are monitoring, or the SLIs, which we are looking for. So, this is how I would step by step approach. Let's say I find something in the logs. So the next step would be to let's say if this does not point to something obvious, so the next step would be to check your GitHub for existing issues that may arise, with that error message Because sometimes the error can also be upstream, let's say and, of course, always check for the runbooks for the critical service. A critical service is expected to have a runbook. So if, let's say, the runbook does not contain anything, then it's probably time to wake up the owners of the application that, hey. We are experiencing some kind of an error, and we need to, you No. start mitigating this incident. So step 1 should be mitigation. The root cause analysis will then come later.

    So the latency could be of, do could be due to several reasons. There could be some kind of traffic spike or increased data which is coming in, which needs to be processed. there could be increased, CPU usage, which might be impacting or there could be increased network latency. Network latency Cannot be specifically fixed unless, like, that correlates with increased loads and, increased loads or something like that. So, probably, a few things which can be done is, increase the number of replica sets in that in that, specific application. The other could be to check for, CPU usage increase the CPU and memory resources for this application. 3rd could be, If, let's say, this application is indeed stateless, so, you know, so a third could be to basically restart the application in a rolling manner so that you can basically try to recover from some of the issues. Maybe the application ended self in a trap or something. finally, I would also check the Grafana dashboards for any logs, for any indicators which are spiking or and the logs to check, if there is something which is being output.

    Yeah. My instance type name. So I think there is a problem with the lifecycle tag Create before destroy. this is something which will Impact because okay. hang on. Yeah. Exactly. So create before destroy. So, this is something which will, basically impact. Let's say you, want to change the instance type, for instance, and Create before destroy. No. I think the problem is with the life cycle tag, but I'm not exactly able to figure out what might be the problem. The create before destroy is Going to something which is, I think, create a race condition of some type, which is going to, impact the continuous delivery, because it will get in a race condition with the machine type and the AMI, and the AMI tag of that. that's all I can guess for right now. there's a problem with the life cycle tag itself, but I'm not able to come up to the exact

    Okay. So how would I, identify possible reasons for the increased latency. so some of the things that I would check would be The, you know, the call pattern. Like, if the call is indeed taking a longer time than expected. So What does the service do? Like, is it being called after every 3 hours? It being called after some kind of a periodicity. Right? So what I would do is, first, I would try to also investigate if we can if you know that batch size of the data has been increasing at the external service level. Because if that's the case, Then probably we need to increase the free frequency of our API calls and do that. Secondly, One of the things that we can also do is, you know, have some kind of a queuing and caching, mechanism. For instance, if the data is same, we can probably cache the data at our end, you know, some kind of a middleware, layer. Secondly, we could have some kind of a sync call process. So for instance, rather than wait for the external API to, you know, to return that data and completely block the user, thread. So we can just, you know, have an async calling process so that it can basically call the external service and just return back without waiting for an answer. So the 2nd user will not have some increased waiting time. So these are 2 things, that we can probably do. So increasing the calling frequency. Secondly, change the call from blocking to non blocking by using async call methods, And that will help us reduce the latency. 3rd, which is would also be caching mechanisms. Yes.

    Okay. How to benchmark and improve the performance of stateful applications deployed on Kubernetes? so one of the things with relation to persistent storage management start, persistent storage management with stateful applications is, basically, When the pod is killed, the claim on that persistent volume is never released. So, ideally, the scenario, what is happening is so since the claim on that, volume is not released, So that is when if the port basically gets scaled, so it will it can retain that volume now. So that is why it is difficult tools upscale and downscale stateful applications dynamically. So this is the challenge. So one of the things that We need to see is, can we like, when it makes sense to a specific, you know, stateful application. So because once you upscale it or shard it. It is not inherently, easy to, you know, change that Sharding factor. Always use expandable volumes that is something that I would recommend because expandable volumes would mean that the storage factor of your Volume is not a limiting factor. Secondly, see if indeed you know? So for instance, The maximum experience that I had with stateful applications is managing Prometheus. So and we are storing a lot of metrics. So what needs to be figured out is, like, the way you can chart is on time series. So you can presaleigate specific shards to only hold metrics belonging to this period. So, basically, shard based on time stamp. So, basically, try to figure out to benchmark if the performance drag is because of more data, so you need an expandable volume, or it is because Trying to it's consistently going out of memory because it cannot load all the data in the, in the in the in the in memory system or because There's a genuine, you know, an increased CPU through, increased CPU latency. So based on this, you can try to shard your persistent storage, and it will help you basically,

    Deployment patterns. To mitigate the drifts, one thing that comes in extremely handy is that Terraform is basically your infrastructure as code. And so what that means is Any and all operations on your infrastructure, they are good. Now your trips could be Yeah. Audio configuration trips could be due to various reasons. So for instance, you updated your telephone provider and something some default in your configuration change, which was not present in that provider earlier. So let's say some kind of log enabled by default. Right? Something as simple as this could also create a potential drift when you are upgrading your Terraform providers. Right? so these are easy fixes because It's these are expected when they happen. The other could be that somebody operated on your infrastructure through your browsing browser portal or your CLI commands. So for instance, somebody, let's say, added another node pool in Kubernetes from your browser or from your, CLI shell, and they did not put in the Terraform equivalent Terraform. So this is exactly something which needs to be prevented. So To mitigate the risk, what needs to happen is, basically, create, IAM policy or IAM roles in your subscription for your cloud provider, and these roles could be admin, operator and a simple user. So your admin can basically, perform everything. Right? It's admin or root user. The second is operator. So operator can basically operate in the event of an incident and, so that they're not stuck maybe firing some CLI commands or whatever. And, your 3rd or, basically, rely on Terraform if in case even your Terraform is down in the case of a major catastrophe. And 3rd is your viewer. So viewer is somebody who can just view things, make no changes. So that's how I would have roll out the IAM roles. And post that, the idea should be you basically run a job, which is Terraform status. So this job runs periodically, and what it does, it checks for drifts Periodic, based on that periodicity in your infrastructure. So it can be part of your sprint or your, you know, daily task to check the output of this Terraform status job. And what this job, will basically tell you about the drift so you can Now then backtrack to what was the cause of that drift and fix those drifts as well. Becca, currently, we have part of, the drift catching is basically part of our weekly job. So whoever is on call or whoever is on, you know, support rotation because, we regular we all go on user support every week on a ROTA basis. So it's the responsibility of that person to check and fix for those drifts, And in case there's a it underlies a major problem. So to raise it up in the sprint task itself that, hey. there are drifts being introduced in our infrastructure because of so and so reason, and this is something that we need to fix or stop doing some action maybe. Right? That's how I would have it.

    What factors? What factors? So some of the factors that I would be Evaluating is how the users are building those Docker containers themselves. Right? So for instance, a lot of tools such as to an AppKit update or pip install without any cache, And all of this, you know, metadata from your app get up update on your PIP caches, they get stored. Secondly, if users are installing and compiling a lot of, libraries, they are not basically, you know, deleting your lib packages. So for instance, while you are compiling something, so you might need a lib package. But once you have compiled it, You don't need a limp package anymore. So, basically, you need to clean up after you do that. Thirdly, you need to Evaluate if, you know, post compilation or, basically, you know, have your multi container builds. So if users are, you know, not using multi container builds to build their docker images. So that is something which comes in really very handy to decrease the size of the image. Thirdly fourthly, it also, is A nice practice to use your alpine images because these are really very lightweight, and they allow you to, basically have minimum components in your Docker images and deploy a very Small image. So yeah. And, again, few things that I would have is, No. I think that's all I have.

    I haven't actually worked on Golang as much. So, my contribution to Golang is that I can edit existing codebase, to do some, you know, to do some feature delivery, but I haven't Actually, being a Golang developer myself, that is something that I still need to learn.