Manager Site Reliability Engineering (Platform & Tooling)
OktaNov, 2024 - Present1 yr 4 months
Oversee the SRE organization focused on the Okta Cloud platform, managing Infra Delivery, Cloud Tooling Automations and Pipelines (CTAP), K8s, and Observability. Built, mentored, and led a high-performing team of SREs and Software Engineers; partnering directly with recruiting to hire top-tier talent and fostering a culture of continuous learning and "Automation First" ethics. Spearheaded the architecture and strategic rollout of the Internal Developer Platform (IDP), adopting a "Platform-as-Product" mindset to reduce cognitive load and accelerate developer velocity across the engineering organization. Integrated CI/CD orchestration (Spinnaker, ArgoCD) and self-service IaC to streamline the path to production. Transforming core infrastructure stability by shifting from reactive to proactive SRE practices; architected a predictive, self-healing cloud platform that sustains 99.99% availability for critical production systems. Championed the transition from reactive toil to proactive, code-driven fleet management. Instituted Continuous Stability practices driven by AI forecasting, predicting service degradation 30 minutes in advance and slashing Mean Time To Resolution (MTTR) by 85% (120m to 18m). Directed cloud spend optimization and resource efficiency initiatives aligned with business metrics. Achieved $310K+ in annualized savings ($180K infrastructure + $130K licensing) within the first 90 days by implementing ML-based right-sizing engines and negotiating strategic vendor agreements. Identified and mitigated bottlenecks in the development flow by deploying an ML-driven change risk assessment engine and expanding automated regression coverage from 45% to 85%. Resulted in an 18% uplift in cloud release velocity and a 60% reduction in production defects. Fortified the platform by embedding security best practices into the delivery pipeline. Deployed real-time anomaly detection to identify novel threats and automated vulnerability scanning to ensure secure, compliant releases without slowing delivery. Collaborating with cross-functional stakeholders to align platform capabilities with competing constraints of reliability, security, and delivery speed; actively governing key metrics including RPO, RTO, cloud spend, and vulnerability posture.