Knowledge Distillation
A model compression technique where a smaller student model is trained to mimic the outputs of a larger teacher model, preserving most of the teacher's performance at a fraction of the compute cost.
Knowledge distillation transfers the "knowledge" encoded in a large, expensive model into a smaller, cheaper one. The student model is trained not on the original labeled data but on the teacher model's output probabilities (soft labels), which contain richer information than hard labels alone. A teacher that outputs "90% cat, 8% lynx, 2% dog" teaches the student about inter-class relationships that a simple "cat" label does not convey.
The technique enables significant cost savings in production. A distilled model might retain 95% of the teacher's accuracy while being 10x smaller and faster. This is especially valuable for deployment on edge devices, mobile applications, and high-volume inference where every millisecond and dollar matters.
For AI product teams, distillation is a practical strategy for reducing inference costs. You can use a powerful LLM like GPT-4 or Claude to generate high-quality outputs for your specific task, then use those outputs as training data for a smaller, cheaper model. This "LLM-to-small-model" distillation pipeline is increasingly common: use the expensive model to bootstrap quality, then distill to a cost-effective model for production scale.
Related Terms
RAG (Retrieval-Augmented Generation)
A technique that grounds LLM responses in external data by retrieving relevant documents at query time and injecting them into the prompt context.
Embeddings
Dense vector representations of text, images, or other data that capture semantic meaning in a high-dimensional space, enabling similarity search and clustering.
Vector Database
A specialized database optimized for storing, indexing, and querying high-dimensional vector embeddings with sub-millisecond similarity search.
LLM (Large Language Model)
A neural network trained on massive text corpora that can generate, understand, and transform natural language for tasks like summarization, classification, and conversation.
Fine-Tuning
The process of further training a pre-trained LLM on a domain-specific dataset to specialize its behavior, style, or knowledge for a particular task.
Prompt Engineering
The practice of designing and iterating on LLM input instructions to reliably produce desired outputs for a specific task.