Site Reliability Engineer (SRE)
Yellow.aiJul, 2023 - Present2 yr 7 months
Orchestrated multi-tenant environments across On-Premise and SaaS, managing high-availability Kubernetes (EKS) and Docker Swarm clusters to ensure 99.9% service reliability. Automated EKS lifecycle management, performing rolling upgrades of control planes, worker nodes, and add-ons to maintain security compliance and performance. Optimized cloud spend by migrating to Karpenter for just-in-time provisioning and Spot instance orchestration, reducing infrastructure costs by 30%. Streamlined application delivery by developing modular Helm charts, enabling reusable and standardized deployments across diverse client environments. Architected a centralized logging and monitoring stack using Fluentd, OpenSearch, Prometheus, and Grafana, reducing Mean Time to Resolution (MTTR) by 50% through proactive alerting. Hardened platform security by implementing HashiCorp Vault for dynamic secret injection and automated credential rotation via the Vault Agent Injector. Standardized Infrastructure as Code (IaC) using Terraform to automate resource provisioning, eliminating manual configuration drift and reducing deployment lead times. Developed Python and Bash automation suites to eliminate toil, reducing manual operational errors and increasing overall team velocity. Owned 24/7 on-call rotations for mission-critical production environments, leading critical incident response and Root Cause Analysis (RCA), coordinating cross-functional teams to resolve production outages, maintain 99.9% uptime and prevent recurrence. Designed and executed Disaster Recovery (DR) strategies using Velero and custom backup workflows for both On-Premise and SaaS deployments. Managed end-to-end quarterly release cycles, including high-availability database setups and automated data seeding to simplify client onboarding. Served as the Technical Lead for client escalations and mentored junior engineers, institutionalizing DevOps best practices and accelerating team onboarding.