Back to glossary

Model Serving

The infrastructure and systems that host trained ML models and handle inference requests in production, optimizing for latency, throughput, and cost.

Model serving is the bridge between a trained model and user-facing features. It handles receiving requests, running inference, returning results, and managing the operational concerns of production systems: scaling, load balancing, batching, caching, and failover.

For teams using LLM APIs (OpenAI, Anthropic), model serving is largely handled by the provider. Your engineering focus shifts to API management: request routing between models based on task complexity, response caching for common queries, rate limit management, and fallback chains when primary models are unavailable. A typical production setup routes 70-80% of requests to cheaper models, escalating only complex cases to premium models.

For teams running self-hosted models (fine-tuned models, embedding models, custom classifiers), serving infrastructure matters more. Solutions like vLLM, TGI, and BentoML handle GPU utilization, request batching, and scaling. The key optimization is batching: processing multiple requests together on the GPU dramatically improves throughput and reduces per-request cost, at the expense of slightly higher latency for individual requests.

Related Terms

Further Reading