Model Serving

Model serving is the bridge between a trained model and user-facing features. It handles receiving requests, running inference, returning results, and managing the operational concerns of production systems: scaling, load balancing, batching, caching, and failover.

For teams using LLM APIs (OpenAI, Anthropic), model serving is largely handled by the provider. Your engineering focus shifts to API management: request routing between models based on task complexity, response caching for common queries, rate limit management, and fallback chains when primary models are unavailable. A typical production setup routes 70-80% of requests to cheaper models, escalating only complex cases to premium models.

For teams running self-hosted models (fine-tuned models, embedding models, custom classifiers), serving infrastructure matters more. Solutions like vLLM, TGI, and BentoML handle GPU utilization, request batching, and scaling. The key optimization is batching: processing multiple requests together on the GPU dramatically improves throughput and reduces per-request cost, at the expense of slightly higher latency for individual requests.

Related Terms

MLOps

Batch Inference

Real-Time Inference

A/B Testing

Feature Flag

Semantic Search

Further Reading

LLM Cost Optimization: Cut Your API Bill by 80%

Fine-tuning vs Prompting: The Real Trade-offs