Inference

Inference is the production phase of AI: the model receives an input (a user query, an image, a data point), processes it through its learned weights, and produces an output (a response, a classification, a recommendation). While training happens once or periodically, inference happens millions of times per day in production systems.

The economics of inference dominate AI product costs. Training a model is a one-time (or periodic) expense, but inference costs scale linearly with usage. For LLMs, inference costs include compute for processing input tokens, generating output tokens, and the memory required to hold model weights. Optimizing inference through caching, batching, quantization, model routing, and smaller models is critical for sustainable unit economics.

For growth teams, inference is where AI meets the user. Inference latency directly impacts user experience (users expect sub-second responses), inference costs determine your margin per AI interaction, and inference reliability determines your uptime. The key production concerns are latency (how fast), throughput (how many concurrent requests), cost (price per prediction), and availability (what happens when the model or API is down).

Related Terms

RAG (Retrieval-Augmented Generation)

Embeddings

Vector Database

LLM (Large Language Model)

Fine-Tuning

Prompt Engineering