Designed and operated production ML platforms on AWS EKS, supporting end-to-end ML lifecycle from training to deployment. Built and optimized Docker images and deployed services using Helm charts, ensuring consistent, reproducible releases. Implemented CI/CD pipelines using Jenkins, GitHub Actions, and Argo CD, enabling automated build, test, and deployment workflows. Managed infrastructure using Terraform (IaC) across multiple AWS services (EKS, EC2, VPC, IAM, S3, ECR, RDS, Lambda). Deployed and served ML models, ensuring scalability and low latency. Implemented model versioning and experiment tracking with MLflow and DVC. Set up observability using Prometheus, Grafana, Loki, and CloudWatch, enabling proactive monitoring and alerting. Optimized cluster costs via resource right-sizing, autoscaling (HPA), and node utilization strategies. Collaborated with data science and platform teams to productionize models with high reliability and security.