Manager Site Reliability Engineering (Platform & Tooling)
OktaNov, 2024 - May, 20261 yr 6 months
Leading a 24-person SRE org (16 SREs, 8 SWEs). Built, mentored, and managed a high-performing organization of SREs and Software Engineers overseeing Infrastructure Delivery, CTAP (Cloud Tooling Automation Pipelines), Kubernetes Platform (EKS/GKS), and Observability. Fostered an AI & Automation First culture, partnered with recruiting and People Ops to hire and retain top-tier talent, and aligned platform capabilities with reliability, security, and delivery velocity. Championed AI-driven reliability by deploying ML-based predictive stability models, engineered self-healing runbooks and LLM-augmented incident triage, and commanded 24x7 incident response. Defined SLIs, SLOs, and error budget policies for 22 critical services, enforced Production Readiness Reviews, and integrated Spinnaker and ArgoCD orchestration. Directed cloud spend optimization through ML-based resource right-sizing engines, achieved $310K+ in annualized savings, and embedded security best practices into the delivery pipeline.