Site Reliability Engineer
the COOL CoJul, 2024 - Feb, 20261 yr 7 months
Optimized Linux kernel and OS-level parameters to improve resource utilization and increase application performance by 25% under production traffic. Implemented a K3s cluster across 50+ dedicated Leaseweb servers; improved compute utilization by 30%. Developed observability stack (Prometheus, Grafana, Alertmanager) integrated with Slack, PagerDuty, and Email; defined SLIs/SLOs for latency (<200ms p95), availability (99.9%), and error budgets. Reduced mean time to detection (MTTD) by 40% and mean time to recovery (MTTR) by 35% by implementing alerting tied to SLIs/SLOs and standardized runbooks. Designed high-availability PostgreSQL cluster with ZFS-backed storage, streaming replication, and pg auto failover; ensured zero data loss and recovery within <60s during failover drills. Optimized Kafka brokers sustaining throughput of 500MB/s; fine-tuned partitioning, replication factor, and JVM GC settings to keep produce and consume latencies under 30ms. Tuned Apache Druid ingestion pipeline to handle 500K events/sec with consistent query latency <300ms. Automated rolling updates for PostgreSQL, HAProxy, and K3s clusters via Ansible, reducing operator toil by 70%.