Sandhya Chunchu

I would like to work with an organization where my technical skills will be utilized and further enhanced. This would help me grow professionally as well as personally contributing to organizational development. I am hardworking and diligent towards my work & can work under pressure so that i have the ability to learn. My technical experience has helped me to build a strong foundation of soft skill that has been vital for a professional. Proficiency in grasping new technical concepts quickly and utilizing them in an effective manner

Role
Senior Associate SRE & Chaos Engineer
Years of Experience
3.7 years

Skillsets

reliability
minIO
Monoliths
Networking
Observability
OpenStack
Orchestration
platform reliability
Prometheus
Python
Redis
Loadbalancing
resilience
Restore
Scripting
SFTP
Shell
SMP
Terraform
Velero
VMware
Failure Analysis
automation
Backup
Chaos engineering
CI/CD
Containerization
Distributed Systems
DNS
Docker
ELK
Ansible
Grafana
harness
Incident Analysis
JMeter
K6
Kafka
Keydb
Kubernetes
Litmus

Professional Summary

3.7Years

Jul, 2022 - Present3 yr 9 months
Senior Associate SRE
NPCI

Work History

3.7Years

Senior Associate SRE

NPCI

Jul, 2022 - Present3 yr 9 months

Spearheaded chaos engineering initiatives for NPCI mission-critical applications (IMPS, PSO, CBDC, NTS, IRCS, EFRM, UPI) achieving 99.99% uptime by proactively identifying and mitigating failure points. Collaborated with stakeholders to design architecture-driven experiments and probes, developing hypothesis-based test plans through deep system-architecture analysis to uncover bottlenecks and fault-prone components. Leveraged SLIs and SLOs to quantify system resilience, conducting analyses of MTTD, MTTR and MTPOD to drive measurable improvements. Designed and executed failure mode experiments including latency, loss, partitioning, quorum loss, rack awareness, service crashes, and advanced scenarios like fragmentation, SYN flooding, and IO freeze using BYOC. Participated in incident retrospectives to analyse root causes and designed chaos experiments to validate production fixes and prevent recurrences. Enforced ChaosGuard guardrails for safe, controlled production tests. Deployed Prometheus, Grafana, Alertmanager, OTel tracing; created actionable dashboards and success metrics for gamedays and RCA cycles. Built automated health checks integrating logs, APIs, resource metrics, and Kafka topics, improving transaction visibility during experiments. Developed SRE-grade automation using Shell, Python, Ansible, improving operational consistency and reducing manual overhead. Added k6/JMeter load tests to support performance validation, capacity planning, and auto-scaling thresholds. Migrated applications using Velero and MinIO, establishing a Disaster Recovery site for Harness SMP to ensure business continuity. Provisioned and automated infrastructure using Terraform and Ansible, including reusable modules, multi-environment setups, IaC-driven deployments, and deploying Kubernetes clusters using Ansible for consistent, repeatable provisioning. Orchestrated safe production chaos by enabling ChaosGuard guardrails, leading CAB approvals and stakeholder alignment, scheduling controlled non-peak experiments with a limited blast radius. Organized gamedays to validate fixes and train teams on failure scenarios. Built CI/CD pipelines (Harness/Jenkins) with automated tests, security scans, and rollout strategies (blue/green, canary).

Education

B.E in Electronics and Communication Engineering
MVSREC Hyderabad (2022)

Sandhya Chunchu

Senior Associate SRE & Chaos Engineer

3.7 years

Skillsets

Professional Summary

Work History

Senior Associate SRE

Education

B.E in Electronics and Communication Engineering