Senior Staff Software Engineer
DruvaJan, 2021 - Jan, 20243 yr
Optimized data backup and recovery architecture for a SaaS platform handling 10M+ API requests/day; reduced latency by 35% using Amazon S3 Multipart API with dynamic chunking, deduplication, and parallel uploads, improving throughput by 40%. Ingested API logs, system metrics, and incident reports from SharePoint, OneDrive, and Teams via AWS SQS and Lambda. Ensured data completeness and quality with schema validation and automated error-handling pipelines. Cleaned and normalized unstructured logs and metrics. Engineered features such as error frequencies, response-time distributions, and dependency-graph embeddings to feed ML Model development. Handled missing data and outliers with imputation and anomaly-aware filters. Built unsupervised ML pipelines (DBSCAN, LSTM) for pattern detection, using backpropagation with gradient descent to update weights. Fine-tuned transformer LLMs on historical incident summaries and logs for resolution suggestion. Conducted hyperparameter tuning (learning rate, batch size) and optimization with Adam optimizer, minimizing the Loss Function that measures prediction error. Assessed anomaly detectors via precision, recall, and ROC-AUC metrics. Validated LLM recommendations with human-in-the-loop A/B testing and confidence-score thresholds. Implemented Guardrails for LLMs & traditional ML models, including output sanitization, rate limits, and fail-safe fallbacks. Deployed real-time inference endpoints via FastAPI and AWS Step Functions, achieving <200 ms response and 99.9% uptime. Integrated MLOps pipelines for continuous monitoring of drift, data-quality alerts, and automated retraining. Enabled self-healing workflows: feedback loops capture analyst overrides, trigger auto-retraining, and refine labels via weak supervision and active learning. Detected silent failures and regressions early, reducing incident spikes by 50%. Resolution Recommendation Engine: Top-N remediation suggestions via Qdrant vector search; auto-resolves 1K+ incidents/month, cutting MTTR by 60%. NLP-driven summaries translate cryptic logs into human-readable insights for L1/L2 teams. Preemptively flags high-risk backup tasks; continuous learning improves accuracy by 20% over six months.