profile-pic

Sandhya Chunchu

I would like to work with an organization where my technical skills will be utilized and further enhanced. This would help me grow professionally as well as personally contributing to organizational development. I am hardworking and diligent towards my work & can work under pressure so that i have the ability to learn. My technical experience has helped me to build a strong foundation of soft skill that has been vital for a professional. Proficiency in grasping new technical concepts quickly and utilizing them in an effective manner

  • Role

    Senior Associate SRE & Chaos Engineer

  • Years of Experience

    3.7 years

Skillsets

  • reliability
  • minIO
  • Monoliths
  • Networking
  • Observability
  • OpenStack
  • Orchestration
  • platform reliability
  • Prometheus
  • Python
  • Redis
  • Loadbalancing
  • resilience
  • Restore
  • Scripting
  • SFTP
  • Shell
  • SMP
  • Terraform
  • Velero
  • VMware
  • Failure Analysis
  • automation
  • Backup
  • Chaos engineering
  • CI/CD
  • Containerization
  • Distributed Systems
  • DNS
  • Docker
  • ELK
  • Ansible
  • Grafana
  • harness
  • Incident Analysis
  • JMeter
  • K6
  • Kafka
  • Keydb
  • Kubernetes
  • Litmus

Professional Summary

3.7Years
  • Jul, 2022 - Present3 yr 9 months

    Senior Associate SRE

    NPCI

Work History

3.7Years

Senior Associate SRE

NPCI
Jul, 2022 - Present3 yr 9 months
    Spearheaded chaos engineering initiatives for NPCI mission-critical applications (IMPS, PSO, CBDC, NTS, IRCS, EFRM, UPI) achieving 99.99% uptime by proactively identifying and mitigating failure points. Collaborated with stakeholders to design architecture-driven experiments and probes, developing hypothesis-based test plans through deep system-architecture analysis to uncover bottlenecks and fault-prone components. Leveraged SLIs and SLOs to quantify system resilience, conducting analyses of MTTD, MTTR and MTPOD to drive measurable improvements. Designed and executed failure mode experiments including latency, loss, partitioning, quorum loss, rack awareness, service crashes, and advanced scenarios like fragmentation, SYN flooding, and IO freeze using BYOC. Participated in incident retrospectives to analyse root causes and designed chaos experiments to validate production fixes and prevent recurrences. Enforced ChaosGuard guardrails for safe, controlled production tests. Deployed Prometheus, Grafana, Alertmanager, OTel tracing; created actionable dashboards and success metrics for gamedays and RCA cycles. Built automated health checks integrating logs, APIs, resource metrics, and Kafka topics, improving transaction visibility during experiments. Developed SRE-grade automation using Shell, Python, Ansible, improving operational consistency and reducing manual overhead. Added k6/JMeter load tests to support performance validation, capacity planning, and auto-scaling thresholds. Migrated applications using Velero and MinIO, establishing a Disaster Recovery site for Harness SMP to ensure business continuity. Provisioned and automated infrastructure using Terraform and Ansible, including reusable modules, multi-environment setups, IaC-driven deployments, and deploying Kubernetes clusters using Ansible for consistent, repeatable provisioning. Orchestrated safe production chaos by enabling ChaosGuard guardrails, leading CAB approvals and stakeholder alignment, scheduling controlled non-peak experiments with a limited blast radius. Organized gamedays to validate fixes and train teams on failure scenarios. Built CI/CD pipelines (Harness/Jenkins) with automated tests, security scans, and rollout strategies (blue/green, canary).

Education

  • B.E in Electronics and Communication Engineering

    MVSREC Hyderabad (2022)