Back to glossary

Knowledge Distillation

A model compression technique where a smaller student model is trained to mimic the outputs of a larger teacher model, preserving most of the teacher's performance at a fraction of the compute cost.

Knowledge distillation transfers the "knowledge" encoded in a large, expensive model into a smaller, cheaper one. The student model is trained not on the original labeled data but on the teacher model's output probabilities (soft labels), which contain richer information than hard labels alone. A teacher that outputs "90% cat, 8% lynx, 2% dog" teaches the student about inter-class relationships that a simple "cat" label does not convey.

The technique enables significant cost savings in production. A distilled model might retain 95% of the teacher's accuracy while being 10x smaller and faster. This is especially valuable for deployment on edge devices, mobile applications, and high-volume inference where every millisecond and dollar matters.

For AI product teams, distillation is a practical strategy for reducing inference costs. You can use a powerful LLM like GPT-4 or Claude to generate high-quality outputs for your specific task, then use those outputs as training data for a smaller, cheaper model. This "LLM-to-small-model" distillation pipeline is increasingly common: use the expensive model to bootstrap quality, then distill to a cost-effective model for production scale.

Related Terms