All terms
Training & adaptation

Synthetic data

Also known as: model-generated data, AI-generated training data, self-distilled data

Training data generated by AI models rather than collected from humans — increasingly the dominant data source in modern post-training.

What it means

Synthetic data is anything in the training set that wasn't originally written by a person: model-generated question-answer pairs, model-rewritten documents, model-graded examples, fully fabricated conversations, code completions sampled from a teacher model. By 2026 it's not unusual for post-training datasets to be 70-95% synthetic, with humans involved mostly in writing prompts, designing pipelines, and spot-checking quality. The reason is simple economics. Humans are slow and expensive at producing aligned, high-quality training examples. Models are fast and cheap. If GPT-5 can generate ten thousand strong reasoning traces overnight for the cost of a few hundred dollars, there's no way human contractors can compete on volume. This is how we got the explosion in math and code performance from 2024 onward — most of those gains came from training on vast pools of synthetic, model-graded reasoning traces. The chicken-and-egg question is real. If models learn from data that other models generated, you can get model collapse — quality degrades over generations as errors and biases compound. The current consensus is that synthetic data is fine, even excellent, when (a) the teacher is genuinely better than the student at the task, (b) outputs are filtered or graded by a reliable signal, and (c) some grounded human or real-world data is mixed in. Pure self-generation in a closed loop is where things go wrong.

Example

When DeepSeek-R1 was post-trained for reasoning, the team generated millions of chain-of-thought traces with the model itself, kept the ones that arrived at correct answers (graded by automated checkers), and trained on those. Almost no human-written reasoning was involved.

Why it matters

Synthetic data has quietly become the foundation of modern post-training. Almost every quality jump in reasoning, coding, and tool use since 2024 was synthetic-data-driven. It's also the heart of ongoing copyright and provenance debates — once a model has been trained on data generated by another model, untangling who 'owns' what is essentially impossible.

Related terms

See it in a comparison