profile-pic
Vetted Talent

Sudeep Gupta

Vetted Talent

I am working as Senior Infrastructure Engineer - Data Platform at Farfetch. I work on various technologies such as Kubernetes, Azure Cloud with Terraform, Prometheus Observability Stack, ArgoCD and more. My greatest accomplishment at Farfetch would be development of the state of the art in-house Airflow Platform which rivals the likes of managed services offered by Google and AWS. I am also leading SRE operations for various aspects of the Data Platform which includes providing L2 & L3 support, RCA investigations, and owning post-mortems as well as follow up development. As a high impact engineer, my contributions at Farfetch have included various infrastructure optimisations which have saved ~1 million USD in the last 18 months, and I am also an Open Source Contributor with contributions to Airflow.

I have had an interesting career, spanning over the entire Data stack - Analytics, ETL, and Infrastructure; I have seen it all. With extensive experience, in both Product and Consulting setups, I have worked on some of the most critical and challenging projects delivering immense value to my stakeholders.

  • Role

    Lead Site Reliability Engineer

  • Years of Experience

    15.2 years

  • Professional Portfolio

    View here

Skillsets

  • Scrum
  • Ranger
  • Agile
  • Avro
  • Azure
  • Databricks
  • GitLab
  • GitOps
  • Go
  • Helm
  • Presto
  • Kafka
  • ArgoCD
  • Airflow
  • CI/CD
  • DevOps
  • PagerDuty
  • REST
  • Deadmanssnitch
  • Prestod
  • Distributed crawlers
  • Hadoop - 6 Years
  • Python - 10 Years
  • Kubernetes - 6 Years
  • Kubernetes - 6 Years
  • Prometheus - 4 Years
  • Prometheus - 4 Years
  • Terraform - 4 Years
  • Terraform - 4 Years
  • Grafana - 4 Years
  • Grafana - 4 Years
  • Hadoop - 6 Years
  • Python - 10 Years
  • Docker - 6 Years
  • Docker - 6 Years
  • Ansible
  • Hive
  • MongoDB
  • MySQL
  • Neo4j
  • Scala
  • Spark

Vetted For

15Skills
  • Roles & Skills
  • Results
  • Details
  • icon-skill_image
    Staff, Software Engineer (SRE)AI Screening
  • 61%
    icon-arrow-down
  • Skills assessed :Ansible, ArgoCD, BuildKite, Chef, CircleCI, Puppet, Spinakker, DevOps, SRE, Terraform, AWS, Docker, Jenkins, Kubernetes, System Design
  • Score: 61/100

Professional Summary

15.2Years
  • Mar, 2024 - Present2 yr 3 months

    Lead Site Reliability Engineer

    Avalara
  • Lead Site Reliability Engineer Cloud Infrastructure Platform

    Avalara
  • Apr, 2020 - Mar, 20243 yr 11 months

    Senior Infrastructure Engineer

    FARFETCH
  • Oct, 2016 - Jul, 2017 9 months

    Senior Data Engineer

    Fractal
  • Jul, 2017 - Apr, 20202 yr 9 months

    Associate

    BlackRock
  • Apr, 2020 - Apr, 20222 yr

    Infrastructure Engineer

    FARFETCH
  • Jan, 2016 - Oct, 2016 9 months

    Data Engineer

    Fractal
  • Jan, 2015 - Jan, 20161 yr

    Data Scientist and Technical Lead

    Stealth Mode Start-up

Applications & Tools Known

  • icon-tool

    Python

  • icon-tool

    Argo CD

  • icon-tool

    Terragrunt

  • icon-tool

    Kubernetes

  • icon-tool

    Azure

  • icon-tool

    Apache Airflow

  • icon-tool

    Terraform

  • icon-tool

    Airflow

  • icon-tool

    Grafana

  • icon-tool

    Prometheus

  • icon-tool

    Google Cloud

  • icon-tool

    Docker

  • icon-tool

    ArgoCD

Work History

15.2Years

Lead Site Reliability Engineer

Avalara
Mar, 2024 - Present2 yr 3 months

Lead Site Reliability Engineer Cloud Infrastructure Platform

Avalara
    Lead the design and development of internal developer platform and automation tools, while overseeing technical delivery, engineering practices, and scalable solutions that improve developer productivity and operational efficiency across product/engineering teams. Architected and developed a metrics-driven SRE compliance platform (Go, Kafka, GitLab, Prometheus) that replaced manual release governance with continuous policy evaluation, reducing deployment lead time by 80% while improving release stability at scale. Designed and built a Go-based configuration templating engine and validation system for Kubernetes and multi-environment deployments, reducing misconfiguration incidents by 40% and improving deployment hygiene. Designed AI-driven operational tooling integrating Prometheus metrics, logs, and deployment signals to accelerate root-cause analysis for services deployed on the platform. Partnered with Platform, Product, and Engineering Leadership to align infrastructure and reliability initiatives with organizational delivery and uptime goals. Lead a globally distributed team of 4 engineers and increased the team velocity by 30% with Agile coaching, and continuous feedback loops.

Senior Infrastructure Engineer

FARFETCH
Apr, 2020 - Mar, 20243 yr 11 months

Infrastructure Engineer

FARFETCH
Apr, 2020 - Apr, 20222 yr
    Built and scaled centralized cloud infrastructure and internal platforms supporting Application workloads and Analytics and MLOps workloads, with a focus on reliability, observability, and cost-efficient infrastructure automation for global engineering teams. Architected and deployed an Airflow Platform-as-a-Service (Terraform, ArgoCD, Helm) with custom RBAC, secrets backend, centralized logging, and observability; led migration from Google Cloud Composer, reducing platform costs by 70% while improving operational control and reliability. Designed and implemented a highly available Prometheus observability stack integrated with PagerDuty and DeadMansSnitch, achieving 99.99% platform uptime and saving $500K annually by retiring Azure Container Insights. Introduced spot-instance orchestration across Kubernetes and Databricks workloads, optimized GPU and compute utilization, and reduced annual infrastructure spend by $100K+. Implemented governance and cost observability frameworks (Databricks Overwatch) to provide automated insights into platform inefficiencies and resource usage patterns.

Associate

BlackRock
Jul, 2017 - Apr, 20202 yr 9 months
    Transitioned from hands-on data engineering to building and modernizing large-scale data infrastructure, orchestration platforms, and cloud-native systems supporting Analytics, and MLOps workloads. Architected a modular Data Fabric for Equity Research to automate ingestion and storage of multi-source structured and semi-structured datasets, enabling standardized signal generation workflows and reusable compute/analytics layers across research teams (GCP, Python, Flask, MongoDB, Ansible). Led migration of on-prem mortgage asset modeling infrastructure to GCP and Airflow (Composer), modernizing legacy batch pipelines, reducing runtime from 48 hours to 10 hours, and significantly improving research iteration cycles. Designed and implemented a scalable Data Lake platform for low-latency interactive analytics, establishing governance and data organization patterns (Medallion-style layering) to prevent data swamp and support large-scale analytical workloads (Hadoop, Spark, Presto). Engineered performance-critical internal tooling (FTPSync) for distributed file system synchronization across HDFS, NFS, and object storage, reducing algorithmic complexity from O(n) to O(n) and lowering memory footprint by 20%.

Senior Data Engineer

Fractal
Oct, 2016 - Jul, 2017 9 months

Data Engineer

Fractal
Jan, 2016 - Oct, 2016 9 months
    Worked on large-scale Big Data and Advanced Analytics systems for strategic enterprise and public-sector clients, focusing on distributed data pipelines, performance optimization, and scalable data infrastructure for analytics-driven decision making. Developed distributed financial fraud detection pipelines using Spark, Neo4j, and Python to identify fraud rings and shell entities; optimized pipeline architecture to reduce runtime from 12+ hours to under 2 hours for large graph based datasets. Designed and implemented Hadoop-based Data Lake and ETL frameworks (Hive, Spark, Ranger, Avro) integrating structured and semi-structured data sources, enabling scalable analytics and warehousing on self-hosted Hortonworks clusters. Engineered production-grade data processing workflows for high-volume analytical workloads, improving reliability, data consistency, and execution efficiency across client environments.

Data Scientist and Technical Lead

Stealth Mode Start-up
Jan, 2015 - Jan, 20161 yr

Achievements

  • Built a self hosted Airflow Platform service which is better than managed services on AWS and GCP.
  • Saved 1+ million USD in 18 months in Infrastructure Optimisations.
  • Open Source contributions to Airflow
  • Well versed with Data and MLOps
  • Kaggle Rank 8617
  • Ranked in 98th percentile in GATE 2011
  • Ranked 1681 in 7th National Cyber Olympiad
  • Ranked in 93rd percentile in National IT Aptitude Test

Major Projects

4Projects

Automating SRE Compliance

    Led the development of an event-driven platform to automate compliance, integrated with GitLab and Kafka for real-time project evaluations.

Context-Aware Template Rendering Engine

    Scalable engine in Go to provide infrastructure-aware configuration generation for deployment templates.

Airflow Platform

    Developed modularized Airflow Platform with integrated security, remote logging, automated deployment, and observability.

Low Latency/Interactive Analytics Data Lake Platform

    Designed a data lake for low-latency analytics and model computation, preventing data swamps.

Education

  • Master of Technology in Computer Science with Major in Data Engineering

    IIIT-Delhi (2014)
  • Bachelor of Technology in Computer Science

    GGSIPU (2012)

Interests

  • Watching Movies
  • Photography
  • Walking
  • Reading
  • Cooking
  • Writing
  • AI-interview Questions & Answers

    Hi, I'm Sudip, and I work as a senior infrastructure engineer at Farfetch. I'm specifically working in the data platform team. As part of the data platform team, I'm a lead engineer designing, developing, and deploying the Airflow platform, where I've made the maximum contribution in terms of development. The Airflow platform at Farfetch is something we developed on Azure Cloud because there were no managed services available at the time. The kind of features our users wanted, there were use cases for which they were not supported by AWS and GCP. So in all, the Airflow platform I developed at Farfetch is in a lot of ways better than managed services on AWS, GCP, and even Astronomer, which are the original creators of Airflow. Apart from that, I work on a lot of monitoring and observability through the Prometheus stack, infrastructure optimizations, terraforming, and Azure Cloud. I mean, that summarizes the gist of it. And I'm also chiefly responsible for the SRE operations for the data platform. We have various components and applications on the platform for which I lead the postmortems, incident management, and postmortem development. To summarize my value impact in terms of money, I would say in infrastructure optimizations alone, I have saved more than $1,000,000 in the last 12 to 18 months. So I'm a high-impact engineer. I would say that's how I would summarize my work. And I'm curious. I'm a fast learner, and I like working in a team while contributing hands-on as an individual contributor and taking a technical lead role in a team.

    Containerizing an existing Golang application for consistent development and deployment. Let's begin with the basics. Let's say there is a build pipeline which compiles this Golang application, and we're doing that in a Docker environment, so we would need a Golang Docker container to begin with. So, we compile our application in that container, copy the compiled binary into a second container, which can be an Alpine container, because once the machine code is created, the Go binary can run in any environment. We copy that into the deployment container, which can be Alpine, and post that, we can deploy our container. With regards to the deployment system, we can use semantic tagging systems for our Docker containers and Docker images, which eventually need to be deployed. For deployment systems, we can use a continuous deployment solution like Argo CD. The Argo CD has a specific component, an image updater, which can continuously monitor your image repository for updated Docker image tags. So, as soon as it detects a new version, it can get deployed. With regards to releases, I would say let's put in extensive testing in test cases and test suites. This will help you automatically release your applications if they pass a test suite. But if this is something which needs to go gradually, we can use canary deployments, like blue and green deployments using Istio. We can set up a rule which will help you do AB testing. And if that looks good, you can manually release the application by updating the image tag.

    Okay, so this actually depends on what kind of setup you want to do. You can actually use Terraform Cloud. It's a managed service, and it will help you, basically, deploy in a collaborative team setting. But, essentially, a better solution, and it is not really difficult, is, you could use Terraform with Atlantis Automation. So, what is it? So what Atlantis does essentially is it will help you automate or, you know, integrate with your GitHub or your Git, and it gives you a kind of chatbot operation. So in a Git PR, as soon as you raise a Terraform change, as soon as you create your changes in the Terraform repository in GitHub so and as soon as you raise that PR, the Atlantis integration will help you know, plan the changes. It will help you visualize those, visualize this plan and the infrastructure changes in that plan as a Git comment itself so that it is visible for everyone. So long story short, to have infrastructure for a team collaboration with Terraform, you can basically deploy LAN tests on your Kubernetes cluster, then you can host your Terraform backend, you can actually use whatever cloud-based storage. For AWS, you can use S3. For Azure, you can use Azure Blob storage or a container. And for GCP, again, something similar. So plus your Terraform, plus your GitHub integrations and Atlantis. So, yeah, that's how you would do it. Integrate your Terraform code repository with Atlantis and deploy it using a cloud-based storage solution.

    Okay, so one of the changes that I would suggest in this Dockerfile is the cgo-enabled=zero, goose=linux, these seem to be environment variables, which we should add in a different command so that this is not something so this. The build process, so this basically goes faster, I guess, and because these are static commands. And apart from that, I have not actually built a Go binary myself. So this is something that I can recommend from the top of my mind. But, yeah, I mean, looking at this.

    So a critical service deployed using Kubernetes is not self-recovering. So some other things that I would start looking into is the first thing that I would say is, whatever this app is. So list all components which are part of this service. For instance, there could be network services, application ports, and all of those. So check the health status for all of these things. Let's say there is a part which is having such kind of failures. This port is having such kind of failures. So check the logs into the port as to what is the problem that the log has. But even before going into the logs, check for very basic things like, Is the port not getting scheduled onto some node? Is there some kind of disk or CPU or memory pressure on that node on which that port is getting deployed because often this is something which happens. So for instance, let's say service this service or the part which belongs to the service could be crash looping on a memory issue. Sometimes resources are specified as in memory request could be some 4 GB, and limit could be 8 GB. But maybe the port requires 10 gigabytes of memory, so it will work until the 8 gigs of memory is reached and then crashed because it is not able to request for more memory. And this crash looping is a loop which basically causes no end. It's kind of a crash looping. So this is one such scenario. The other could be, let's say, check for network problems. So for instance, a spot could be connecting to some service which is outside of the port. So for us, it could be trying to connect to the database or trying to fetch a secret. That secret may no longer be available. Okay. So these are things that I would start looking into. If there are Grafana dashboards, I would also check for the Grafana dashboards. Often, the application dashboards contain a lot of information with regards to the indicators which we are monitoring, or the SLIs which we are looking for. So this is how I would step by step approach. Let's say I find something in the logs. So the next step would be to if this does not point to something obvious, then the next step would be to check your GitHub for existing issues that may arise with that error message. Because sometimes the error can also be upstream, and of course, always check for the runbooks for the critical service. A critical service is expected to have a runbook. So if the runbook does not contain anything, then it's probably time to wake up the owners of the application and say, hey, we are experiencing some kind of an error, and we need to start mitigating this incident. So step one should be mitigation, and the root cause analysis will then come later.

    So the latency could be due to several reasons. There could be some kind of traffic spike or increased data that's coming in, which needs to be processed. There could be increased CPU usage, which might be impacting, or there could be increased network latency. Network latency cannot be specifically fixed unless it correlates with increased loads and increased loads or something similar. So, probably, a few things which can be done is to increase the number of replica sets in that specific application. The other could be to check for CPU usage and the CPU and memory resources for this application. A third could be to restart the application in a rolling manner so that you can try to recover from some of the issues. Maybe the application ended up in a trap or something. Finally, I would also check the Grafana dashboards for any logs, for any indicators that are spiking, and the logs to check if there is something being output.

    Yeah, my instance type name. So I think there's a problem with the lifecycle tag Create before destroy. This is something which will impact because okay. Hang on. Yeah, exactly. So create before destroy. This is something which will basically impact. Let's say you want to change the instance type, for instance, and create before destroy. No. I think the problem is with the lifecycle tag, but I'm not exactly able to figure out what might be the problem. The create before destroy is going to create a race condition of some type, which is going to impact the continuous delivery because it will get in a race condition with the machine type and the AMI and the AMI tag of that. That's all I can guess for right now. There's a problem with the lifecycle tag itself, but I'm not able to come up with the exact.

    So how would you identify possible reasons for the increased latency? Some of the things that you would check would be the call pattern, like if the call is indeed taking a longer time than expected. So, what does the service do? Like, is it being called after every 3 hours? It's being called after some kind of a periodicity. Right? So, what you would do is first investigate if the batch size of the data has been increasing at the external service level. Because if that's the case, then probably we need to increase the frequency of our API calls. Secondly, one of the things that you can also do is have some kind of a queuing and caching mechanism. For instance, if the data is the same, you can probably cache the data at your end, some kind of a middleware layer. Secondly, you could have a sync call process. So, for instance, rather than wait for the external API to return that data and completely block the user thread, you can just have an async calling process so that it can call the external service and return back without waiting for an answer. So the second user won't have some increased waiting time. So, these are two things that you can probably do: increasing the calling frequency, and changing the call from blocking to non-blocking by using async call methods. That will help us reduce the latency. A third option would also be caching mechanisms. Yes.

    How to benchmark and improve the performance of stateful applications deployed on Kubernetes? One of the things with persistent storage management for stateful applications is that, basically, when the pod is killed, the claim on that persistent volume is never released. Ideally, the scenario is that since the claim on that volume is not released, when the pod is scaled, it can retain that volume. This is why it's difficult to upscale and downscale stateful applications dynamically. This is the challenge. One thing we need to see is whether it makes sense to shard a specific stateful application, because once you upscale it or shard it's not inherently easy to change that sharding factor. I would always recommend using expandable volumes because they would mean that the storage factor of your volume is not a limiting factor. Secondly, we need to figure out if the performance drag is because of more data. For instance, managing Prometheus, where we store a lot of metrics. The way you can chart is on time series, so you can prescale specific shards to only hold metrics belonging to this period. So, basically, shard based on time, timestamp. We should try to figure out if the performance drag is because of more data, so we need an expandable volume, or it's because the application is consistently going out of memory because it cannot load all the data in the in-memory system or because there's a genuine increased CPU throughput or latency. Based on this, we can try to shard our persistent storage, and it will help us.

    Deployment patterns. To mitigate drifts, one thing that comes in extremely handy is that Terraform is basically your infrastructure as code. And so, what that means is any and all operations on your infrastructure are good. Now, your drifts could be due to various reasons. So, for instance, you updated your telephone provider and something changed in your configuration, which was not present in that provider earlier. So, let's say some kind of log was enabled by default. Right? Something as simple as this could also create a potential drift when you're upgrading your Terraform providers. Right? So, these are easy fixes because they're expected when they happen. The other could be that somebody operated on your infrastructure through your browser portal or your CLI commands. So, for instance, somebody added another node pool in Kubernetes from your browser or from your CLI shell, and they did not put in the Terraform equivalent. So, this is exactly something that needs to be prevented. So, to mitigate the risk, what needs to happen is to create IAM policies or IAM roles in your subscription for your cloud provider, and these roles could be admin, operator, and a simple user. So, your admin can basically perform everything. It's an admin or root user. The second is operator. So, an operator can basically operate in the event of an incident and not get stuck firing some CLI commands or whatever. And, your third role, basically, relies on Terraform, even in the case of a major catastrophe. And, the third is your viewer. So, a viewer is somebody who can just view things, make no changes. So, that's how I would have rolled out the IAM roles. And, post that, the idea should be to run a job, which is Terraform status. So, this job runs periodically, and what it does is check for drifts based on that periodicity in your infrastructure. So, it can be part of your sprint or your daily task to check the output of this Terraform status job. And, what this job will basically tell you about the drift, so you can then backtrack to what was the cause of that drift and fix those drifts as well. Currently, we have part of the drift catching as part of our weekly job. So, whoever is on call or whoever is on support rotation, because we all go on user support every week on a rota basis. So, it's the responsibility of that person to check and fix for those drifts. And, in case there's a major problem, so to raise it up in the sprint task itself that, hey, there are drifts being introduced in our infrastructure because of so and so reason, and this is something that we need to fix or stop doing some action maybe. Right? That's how I would have it.

    What factors? So, some of the factors I would be evaluating are how users build their Docker containers themselves. For instance, a lot of tools such as AppKit update or pip install without any cache, and all this metadata from your app gets updated on your PIP caches, which gets stored. Secondly, if users are installing and compiling a lot of libraries, they are not deleting their lib packages. For instance, while you are compiling something, you might need a lib package. But once you have compiled it, you don't need a lib package anymore. So, you need to clean up after you do that. Thirdly, you need to evaluate if, post-compilation, you have multi-container builds. If users are not using multi-container builds to build their Docker images, that's something that comes in really handy to decrease the size of the image. It's also a nice practice to use Alpine images because they are really lightweight, and they allow you to have minimum components in your Docker images and deploy a very small image. So, yeah. And, again, a few things that I would have is no. I think that's all I have.

    I haven't actually worked on Golang as much. So, my contribution to Golang is that I can edit existing codebase to do some feature delivery, but I haven't actually worked on it as a developer myself, that is something that I still need to learn.