May 14, 2025 · 8 min read

Serving 50K Inference Requests a Day on Kubernetes

Lessons from architecting production ML serving on AKS and hitting sub-200ms latency without burning through GPU budget.

MLOpsKubernetes

The hard part of serving machine learning models in production is rarely a single model endpoint. It is the operational system around it: autoscaling, batching, cold starts, observability, and cost control.

For this platform, Kubernetes handled isolation and scheduling while FastAPI provided a predictable request layer. The biggest win came from treating model servers as measurable services instead of opaque GPU boxes.

The resulting stack handled more than 50K inference requests per day with sub-200ms latency by combining right-sized GPU workers, request queueing, and Prometheus-driven scaling signals.