All terms
Training & adaptation

Distillation

Also known as: knowledge distillation, model distillation, teacher-student training

Training a smaller 'student' model to mimic a larger 'teacher' model, getting most of the teacher's quality at a fraction of the size and cost.

What it means

In distillation you take a big, expensive, smart "teacher" model — say Claude Opus 4.5 or GPT-5 — and use its outputs to train a much smaller "student" model. The student can either match the teacher's exact next-token probability distributions (classical distillation) or simply train on millions of teacher-generated example responses (response distillation, the dominant flavor in 2026). The student ends up punching far above its parameter count because it's compressing a smarter model's behavior rather than learning from scratch. This is how the modern small-model tier exists. Claude Haiku, GPT-5 mini, Gemini Flash, Llama 4 8B Instruct — all of them are distilled, in some way, from larger sibling models. They run cheaper and faster than the flagship while capturing 70-90% of the quality on most tasks. For high-volume applications (chat assistants, classification, summarization at scale), distilled small models are usually the right answer over calling the big model directly. Distillation has a quietly transformative second-order effect: it lets a lab amortize a $100M pre-training run across a whole product family. You pre-train one giant base, post-train it into the flagship, then distill that flagship into mini, nano, and edge variants. The same investment ships five products. It also creates legal questions when the teacher belongs to someone else — DeepSeek and others have been credibly accused of distilling from closed frontier APIs, which is one of the live ToS battlegrounds in 2026.

Example

Claude Haiku 4.5 is roughly 10x smaller than Claude Opus 4.5 but matches Opus 3 quality on most benchmarks — the result of distillation from the Opus-class teacher plus heavy post-training. Same story for GPT-5 mini and Gemini 3 Flash.

Why it matters

Distillation is the reason 'small model' no longer means 'dumb model'. Most production traffic in 2026 hits distilled models, not flagships, because they're 10-50x cheaper for nearly the same quality on bounded tasks. Understanding which model is distilled from which (and how aggressively) explains a lot of the price/performance map.

Related terms

See it in a comparison