Manager, Site Reliability Engineering
KHUMBU SYSTEMSJan, 2019 - Dec, 20223 yr 11 months
Oversee the operation of the production environment including monitoring and troubleshooting, review and scheduling of planned changes, and managing outages. Served as information security & network subject matter expert; provided advisory and consulting services as needed for various projects in the organization. Authored configuration management procedures, playbooks for resolving incidents. CICD: o Eliminated manual release cycles by automating build and deployment process with Azure pipelines and Gradle, thereby reducing the release time by ~40% o Automated entire deployment of infrastructure across several environments by implementing CICD pipelines. o Implemented static code analysis and security scans via Sonarcloud in CI pipelines. Iaac: o Transformed existing manual infrastructure into composable, reusable Terraformmodules along with continuous testing of IaaC code. o , , Serverless, Dynamodb, Route 53 and other AWS services to AWS SAM and AWS Cloudformation. Observability: o Defining and capturing metrics such as latency, traffic, errors via AWS cloud formation logs to be exported to Datadog and AWS ELK. o Implemented chaos engineering to proactively detect potential failure points, identify bottlenecks. Continuous monitoring: o Implemented monitoring for all production mission-critical resources by defining SLI and SLOmetrics. o Modernised on-call system by migrating to Pagerduty based on AWS alarms and defining rules, communication methods and response plans. Compliance: o Owned responsibility for our product-related security compliance initiatives such as SOC 2, ISO 27001, PCI compliance as well as annual assessments with the external audit firms. o Developed and implemented continuous compliance in AWS via a pipeline using CloudFormation guard and Terraform-Compliance frameworks. Identified and reduced AWS billing by 30% by identifying various methods to optimally utilise resources Architected the scaling of requests up to a 15X increase and with 99.99% availability. Modernised disaster recovery by implementing AWS central backup, multi-site active-active models for mission-critical services.