Amlan Sekhar Das

Seasoned Site reliability Engineering leader with over 12 years of experience driving scalable

distributed systems, cloud migrations, infrastructure security, and enterprise reliability

transformations leveraging open-source platforms. Trusted by executive stakeholders to steer

strategic availability, cost optimization, and engineering productivity in high-growth tech

environments. Proven ability to lead cross-functional teams and embed reliability culture aligned

with business goals.

Role
Engineering Manager, Infrastructure (SRE, Devops)
Years of Experience
13 years

Skillsets

Infrastructure
SRE
SIEM
Security
Root cause detection
Platformisation
open-source
Observability
Multi-cloud
Microservices
Kubernetes
Kafka
AI
Incident Management
GCP
DevOps
cost optimization
Compliance
Cloud
CI/CD
AWS
automation
AI

Professional Summary

13Years

Jul, 2021 - Present4 yr 10 months
Engineering Manager, Infrastructure (SRE, Devops, Platformisation, Infra security)
Meesho
Nov, 2019 - Jul, 20211 yr 8 months
Team Lead, Cloud Performance
Oracle

Work History

13Years

Engineering Manager, Infrastructure (SRE, Devops, Platformisation, Infra security)

Meesho

Jul, 2021 - Present4 yr 10 months

Sole owner of infrastructure management spanning compute and network layers, handling 10 million RPS with optimized server costs and sustained high availability. Lead sales capacity planning, readiness, and delivery with zero uptime deviations. Manage compute infrastructure across 400,000+ vCPUs spanning 8 Kubernetes production clusters with 18,000 nodes and 6,000 standalone compute instances, maintaining sub-30 ms latency. Successfully led a record-breaking AWS to GCP cloud migration completed in 2.5 months with a strict error budget of 0.5%, ensuring zero downtime and reliability. Established and governed uptime framework achieving 99.9% overall uptime and 99.95% availability across platform layers. Conduct quarterly leadership reviews to evaluate and mitigate tech debt. Consistently achieve system-driven MTTR for critical incidents: Code Red < 5 mins, S0 < 30 mins, S1 < 1.2 hours over 8 consecutive quarters. Scaled and transformed the SRE organization into a 26-member platformization team across seniority levels, driving platform stability, server cost optimization, and developer productivity with open-source based tooling. Architected centralized observability platform processing ~0.7 million data points every 15 seconds at ~1% server cost, managing 12,000+ alerts every 5 seconds with AI-driven alerting and achieving 99.995% monthly uptime. Led centralized on-call operations, resolving 1,500+ monthly incidents with an average TTR under 0.7 hours, optimizing team bandwidth efficiency for 26 engineers. Delivered three internal Go-based platforms focused on application management, alert & incident management, and on-call management, leveraging AI for automated root cause detection and self-healing. Drove platform server cost optimizations yielding multimillion-dollar savings without compromising reliability. Directed multi-CDN ecosystem architecture with vendor leadership (Akamai, Cloudflare), migrations, and governance ensuring resilient global content delivery. Spearheaded infrastructure security initiatives implementing zero-trust architecture and comprehensive SIEM coverage from open-source security tools, ensuring IPO readiness and regulatory compliance. Pioneer AI-driven root cause detection and autonomous infrastructure self-healing initiatives, integrated tightly with open-source monitoring ecosystems. Lead multi-cloud hybrid architecture design integrating data center and cloud environments, focusing on cost reduction, governance, stability, reliability, and advanced observability. Designed and operated multi-region Kubernetes workloads achieving 3 million RPS over inter-cluster communications with a cost-effective single zone strategy delivering 99.95% uptime. Serve as direct point of contact for Google Cloud, Akamai, PagerDuty, Coralogix, and Cloudflare; lead production incident root cause analyses with vendors, enabling rapid escalation and resolution. Drive incident management culture with blameless RCA reviews and continuous reflection driving organizational learning.

Team Lead, Cloud Performance

Oracle

Nov, 2019 - Jul, 20211 yr 8 months

Directed migration of healthcare and life science analytics products to OCI, driving performance standards, multi-tenant Kubernetes orchestration, and regulatory audit readiness using open-source technologies.

Education

B.Tech, Electronics Instrumentation
SRM University (2013)