Built production-grade multilingual (English + Spanish) real-time ASR/STT systems using NVIDIA Parakeet-TDT-0.6B-v3, Parakeet-CTC-0.6B-ES, Parakeet-RNNT-1.1B, Nemotron Speech Streaming, Whisper Large-v3 Turbo, Google STT v2, and Azure Speech Services supporting live microphone and audio file streaming with partial and final transcript generation. Architected WebSocket ASR servers with FastAPI + Uvicorn featuring chunk-wise PCM streaming, silence-based finalization, manual flush triggers, and JSON partial/final transcript events; tuned VAD systems (WebRTC, Adaptive Energy, Silero) with silence thresholds (200ms-900ms) and chunk durations (10/20/30ms) for near real-time delivery without premature cutoffs. Conducted structured ASR model benchmarking across Parakeet-TDT/CTC-ES/RNNT, Riva, Whisper, Google STT, and Azure STT evaluating TTFT, TTFB, WER, first/average latency, context-switching score, and final transcript quality; generated detailed Excel-based evaluation reports with per-file model analysis. Containerized GPU-accelerated ASR inference services using Docker, CUDA 12.x (nvidia/cuda runtime), NIM, Whisper, Google STT, NIM deployments, and optimized Dockerfiles; resolved critical issues including NeMo import crashes, torch/CUDA mismatches, segfaults, PyAudio/ffmpeg build failures, and pydantic conflicts for high-availability production deployment. Deployed and operated real-time ASR services on GCP (Compute Engine, Artifact Registry, Cloud Run) and AWS EC2 managing VM provisioning, Docker registry pushes, port/firewall rules, SSH tunneling, and WebSocket/gRPC remote endpoint validation and health monitoring. Evaluated and compared NVIDIA Triton Inference Server, NIM, Riva, and vLLM for speech AI model serving assessing GPU utilization, streaming endpoint behavior, supported model discovery, and deployment strategies to determine optimal production scalability and inference performance.