Pre-training

What it means

Pre-training is where the raw model is born. You take a randomly initialized neural network and feed it trillions of tokens of text — books, code, web pages, papers, transcripts — and have it predict the next token over and over until it has internalized grammar, facts, reasoning patterns, and the statistical shape of human language. Everything the model "knows" about the world traces back to this phase. It is also the most expensive single thing AI labs do. A frontier pre-training run costs $100M+ in compute alone for GPT-5, Claude Opus 4.5, or Gemini 3 Ultra-class models, runs across tens of thousands of GPUs for months, and consumes enough electricity to power a small city. This is why only a handful of labs in the world actually pre-train frontier models from scratch — most teams start from someone else's pre-trained checkpoint. Pre-training produces a "base model": fluent but raw. It will happily continue any text in any direction, with no concept of being a helpful assistant, no refusals, no "user vs assistant" turns. To turn that base model into something like ChatGPT or Claude, you need post-training (SFT, RLHF/DPO, safety tuning). Pre-training sets the ceiling on what the model can know; post-training shapes how it behaves.

Example

Meta's Llama 4 base model was pre-trained on roughly 15 trillion tokens of mixed web, code, and book data before any instruction tuning. That base checkpoint is what every Llama 4 fine-tune in the open ecosystem starts from.

Why it matters

Pre-training determines a model's raw knowledge cutoff, language coverage, and reasoning ceiling. No amount of fine-tuning can teach a model facts that weren't in its pre-training data — it can only teach behavior. When people talk about a model being 'smart' versus 'well-behaved', smartness mostly comes from this stage.

What it means

Example

Why it matters

Related terms

See it in a comparison