profile-pic

Amlan Sekhar Das

Seasoned Site reliability Engineering leader with over 12 years of experience driving scalable

distributed systems, cloud migrations, infrastructure security, and enterprise reliability

transformations leveraging open-source platforms. Trusted by executive stakeholders to steer

strategic availability, cost optimization, and engineering productivity in high-growth tech

environments. Proven ability to lead cross-functional teams and embed reliability culture aligned

with business goals.

  • Role

    Engineering Manager, Infrastructure (SRE, Devops)

  • Years of Experience

    13 years

Skillsets

  • Infrastructure
  • SRE
  • SIEM
  • Security
  • Root cause detection
  • Platformisation
  • open-source
  • Observability
  • Multi-cloud
  • Microservices
  • Kubernetes
  • Kafka
  • AI
  • Incident Management
  • GCP
  • DevOps
  • cost optimization
  • Compliance
  • Cloud
  • CI/CD
  • AWS
  • automation
  • AI

Professional Summary

13Years
  • Jul, 2021 - Present4 yr 8 months

    Engineering Manager, Infrastructure (SRE, Devops, Platformisation, Infra security)

    Meesho
  • Nov, 2019 - Jul, 20211 yr 8 months

    Team Lead, Cloud Performance

    Oracle

Work History

13Years

Engineering Manager, Infrastructure (SRE, Devops, Platformisation, Infra security)

Meesho
Jul, 2021 - Present4 yr 8 months
    Sole owner of infrastructure management spanning compute and network layers, handling 10 million RPS with optimized server costs and sustained high availability. Lead sales capacity planning, readiness, and delivery with zero uptime deviations. Manage compute infrastructure across 400,000+ vCPUs spanning 8 Kubernetes production clusters with 18,000 nodes and 6,000 standalone compute instances, maintaining sub-30 ms latency. Successfully led a record-breaking AWS to GCP cloud migration completed in 2.5 months with a strict error budget of 0.5%, ensuring zero downtime and reliability. Established and governed uptime framework achieving 99.9% overall uptime and 99.95% availability across platform layers. Conduct quarterly leadership reviews to evaluate and mitigate tech debt. Consistently achieve system-driven MTTR for critical incidents: Code Red < 5 mins, S0 < 30 mins, S1 < 1.2 hours over 8 consecutive quarters. Scaled and transformed the SRE organization into a 26-member platformization team across seniority levels, driving platform stability, server cost optimization, and developer productivity with open-source based tooling. Architected centralized observability platform processing ~0.7 million data points every 15 seconds at ~1% server cost, managing 12,000+ alerts every 5 seconds with AI-driven alerting and achieving 99.995% monthly uptime. Led centralized on-call operations, resolving 1,500+ monthly incidents with an average TTR under 0.7 hours, optimizing team bandwidth efficiency for 26 engineers. Delivered three internal Go-based platforms focused on application management, alert & incident management, and on-call management, leveraging AI for automated root cause detection and self-healing. Drove platform server cost optimizations yielding multimillion-dollar savings without compromising reliability. Directed multi-CDN ecosystem architecture with vendor leadership (Akamai, Cloudflare), migrations, and governance ensuring resilient global content delivery. Spearheaded infrastructure security initiatives implementing zero-trust architecture and comprehensive SIEM coverage from open-source security tools, ensuring IPO readiness and regulatory compliance. Pioneer AI-driven root cause detection and autonomous infrastructure self-healing initiatives, integrated tightly with open-source monitoring ecosystems. Lead multi-cloud hybrid architecture design integrating data center and cloud environments, focusing on cost reduction, governance, stability, reliability, and advanced observability. Designed and operated multi-region Kubernetes workloads achieving 3 million RPS over inter-cluster communications with a cost-effective single zone strategy delivering 99.95% uptime. Serve as direct point of contact for Google Cloud, Akamai, PagerDuty, Coralogix, and Cloudflare; lead production incident root cause analyses with vendors, enabling rapid escalation and resolution. Drive incident management culture with blameless RCA reviews and continuous reflection driving organizational learning.

Team Lead, Cloud Performance

Oracle
Nov, 2019 - Jul, 20211 yr 8 months
    Directed migration of healthcare and life science analytics products to OCI, driving performance standards, multi-tenant Kubernetes orchestration, and regulatory audit readiness using open-source technologies.

Education

  • B.Tech, Electronics Instrumentation

    SRM University (2013)