Inference
The process of running a trained AI model on new inputs to generate predictions or outputs, as opposed to training where the model learns from data. This is what happens every time a user interacts with an AI feature.
Inference is the production phase of AI: the model receives an input (a user query, an image, a data point), processes it through its learned weights, and produces an output (a response, a classification, a recommendation). While training happens once or periodically, inference happens millions of times per day in production systems.
The economics of inference dominate AI product costs. Training a model is a one-time (or periodic) expense, but inference costs scale linearly with usage. For LLMs, inference costs include compute for processing input tokens, generating output tokens, and the memory required to hold model weights. Optimizing inference through caching, batching, quantization, model routing, and smaller models is critical for sustainable unit economics.
For growth teams, inference is where AI meets the user. Inference latency directly impacts user experience (users expect sub-second responses), inference costs determine your margin per AI interaction, and inference reliability determines your uptime. The key production concerns are latency (how fast), throughput (how many concurrent requests), cost (price per prediction), and availability (what happens when the model or API is down).
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.