Principal Engineer & Site Reliability Engineering Manager
SportskeedaOct, 2021 - Apr, 20242 yr 6 months
Platform Scale: Infra supporting 40M+ pageviews/day, ~6.5K backend RPS, 300K+ concurrent users across teams. Migration Platform & Tooling: Built reusable migration platform enabling 6+ zero-downtime migrations (Redis, MongoDB, MySQL, Elasticsearch) across 4 teams. Created shadow dual-write gradual cutover pattern with automated consistency checks & rollback capabilities. Results: Migration incident rate 40%→5%, zero data loss across all migrations. Redis Enterprise: 50% cost reduction ($50K/year), sub-10ms p99 latency. MongoDB 3.0→4.4 upgrade saved $30K/year. Technical: Dual-write synchronization, percentage-based cutover, SLO-based gates, real-time monitoring. CDN Platform & Edge Infrastructure: Migrated CloudFront to Cloudflare, solved A/B test caching inefficiency (100 cache variants→2-3). Built Cloudflare Workers to normalize A/B segments at edge, adopted by frontend team for other use cases. Configured WAF rules and bot management protecting against DDoS attacks and malicious traffic. Results: 97% cache footprint reduction, 82%→97% hit rate, $40K/year savings. Cost optimization: Identified AWS egress cost increase, optimized cache strategy reducing 90% origin requests. Platform Caching Infrastructure: Designed multi-layer caching (CDN→Redis→App LRU→DB) adopted by 3 backend teams, 60% db load reduction. Built reusable golang-lru wrapper (10K items, 30s TTL) with monitoring dashboards and integration patterns. Performance: P99 latency 180ms→25ms during peak traffic, infrastructure costs down $30K/year. Developer Platform & Testing Infrastructure: Built production-like testing platform by extending A/B framework (cookie-based routing to isolated environments). Results: $55K/year savings vs traditional staging, 40% production bug reduction, parallel testing for 3 teams. Infrastructure Platform & Reliability: Observability: Migrated Datadog to self-hosted maintaining distributed tracing. Compute: AWS Graviton migration (40 instances, 20% cost reduction, $4.7K/year) with compatibility playbooks. Database: MySQL read replicas with analytics isolation; MongoDB 14 node scaling based on actual load. SRE: Established SLO framework (99.9%+ uptime), blameless postmortems, 30% MTTR reduction. Real-Time Platform (WebSocket Infrastructure): Built WebSocket infrastructure from scratch for Cricrocket App delivering fastest live cricket score updates. Designed scalable architecture: connection management, msg broadcasting, horizontal scaling across instances. Results: Supported 20K+ concurrent WebSocket connections in production, stress tested to 50K+ concurrent users.