Spearheaded chaos engineering initiatives for NPCI mission-critical applications (IMPS, PSO, CBDC, NTS, IRCS, EFRM, UPI) achieving 99.99% uptime by proactively identifying and mitigating failure points. Collaborated with stakeholders to design architecture-driven experiments and probes, developing hypothesis-based test plans through deep system-architecture analysis to uncover bottlenecks and fault-prone components. Leveraged SLIs and SLOs to quantify system resilience, conducting analyses of MTTD, MTTR and MTPOD to drive measurable improvements. Designed and executed failure mode experiments including latency, loss, partitioning, quorum loss, rack awareness, service crashes, and advanced scenarios like fragmentation, SYN flooding, and IO freeze using BYOC. Participated in incident retrospectives to analyse root causes and designed chaos experiments to validate production fixes and prevent recurrences. Enforced ChaosGuard guardrails for safe, controlled production tests. Deployed Prometheus, Grafana, Alertmanager, OTel tracing; created actionable dashboards and success metrics for gamedays and RCA cycles. Built automated health checks integrating logs, APIs, resource metrics, and Kafka topics, improving transaction visibility during experiments. Developed SRE-grade automation using Shell, Python, Ansible, improving operational consistency and reducing manual overhead. Added k6/JMeter load tests to support performance validation, capacity planning, and auto-scaling thresholds. Migrated applications using Velero and MinIO, establishing a Disaster Recovery site for Harness SMP to ensure business continuity. Provisioned and automated infrastructure using Terraform and Ansible, including reusable modules, multi-environment setups, IaC-driven deployments, and deploying Kubernetes clusters using Ansible for consistent, repeatable provisioning. Orchestrated safe production chaos by enabling ChaosGuard guardrails, leading CAB approvals and stakeholder alignment, scheduling controlled non-peak experiments with a limited blast radius. Organized gamedays to validate fixes and train teams on failure scenarios. Built CI/CD pipelines (Harness/Jenkins) with automated tests, security scans, and rollout strategies (blue/green, canary).