How large language models work
The mental model that fixes most prompting confusion — prediction, training, inference, why hallucinations happen, why prompt phrasing matters so much. For the operator who wants to understand the mechanism.
How large language models work
A large language model (LLM) is a statistical pattern-recognizer trained on text. It predicts the next token in a sequence based on what came before. Everything else — chat, code generation, summarization, reasoning — is built on top of that one mechanism.
That sentence sounds reductive. It isn't. The mental model "it's predicting the next word" is the one that explains almost every behavior you'll see: why models hallucinate, why prompt phrasing matters, why context windows are finite, why the same model can be brilliant and stupid in adjacent prompts.
This guide is for the operator who wants to understand the mechanism — not to build a model from scratch, but to stop being surprised by what one does.
What "large" means
LLMs are large in three different senses:
- Parameters (the model's internal weights): GPT-4-class models have hundreds of billions to over a trillion. Smaller useful models start at ~7 billion.
- Training data (the text the model has seen): typically trillions of tokens, scraped and filtered from the public web, books, code repositories, and licensed sources.
- Compute (what it cost to train): the largest models cost $50M–$200M+ in compute to train. This is why there are only a handful of frontier labs.
Each axis matters. A model with more parameters but worse training data isn't smarter. The interplay is what determines quality.
The training process, in one paragraph
The model is shown an enormous amount of text. For each chunk of text, it's asked to predict what comes next. When it predicts wrong, its internal weights are nudged in a direction that would have produced the right answer. Repeated trillions of times across the training data, this turns the model into something that, given any sequence of text, can produce a plausible continuation. That's the entire training objective.
After this base training, frontier labs do additional rounds:
- Supervised fine-tuning: humans demonstrate good responses to instructions, and the model is trained to imitate them
- Reinforcement learning from human feedback (RLHF): humans rank multiple model responses, and the model is trained to produce responses humans would prefer
- Safety training: the model is trained to refuse certain categories of requests and to flag uncertainty
This combined process is what turns a base model (which can autocomplete anything) into a chat assistant (which follows instructions, has a personality, and refuses to help you commit a crime).
How inference works (what happens when you send a prompt)
You type something. The model sees your text. It outputs one token. Then it sees your text plus that one token, and outputs the next. Then it sees your text plus those two tokens, and outputs the third. And so on, until it produces a stop signal or hits a length limit.
The model has no memory between sessions. Every conversation starts fresh. What looks like memory is the entire chat history being re-sent with every message — which is also why long conversations cost more (more tokens) and eventually degrade (the context window has limits, covered separately in Context windows explained).
Why models hallucinate
Hallucination isn't a bug. It's the same mechanism that produces useful output, applied to questions where the model lacks the data.
The model is always predicting "what's the most plausible continuation." When it knows the answer (because the answer appeared often in training), the most plausible continuation is the correct one. When it doesn't know — because the question is obscure, or recent, or about a private topic — the most plausible continuation is something that sounds like the kind of answer you'd expect, even if no such answer exists.
The model has no built-in way to say "I don't know." It can be trained to say so, and modern frontier models do this much more reliably than older ones, but the underlying mechanism is still "predict the most plausible next token." Confident-sounding wrong answers are a feature of how the model works, not a glitch you can prompt away.
This is why verification matters. See How to verify AI output before you trust it for the practical checklist.
Why prompt phrasing matters so much
If the model is predicting the most plausible continuation, then the prompt is everything that comes before. Changing the prompt changes what continuations are plausible.
A prompt that begins "You are a senior engineer reviewing code for security issues..." biases the model toward outputs that look like senior engineering security reviews. A prompt that just says "review this code" biases toward generic code-review language. Same model, very different output, because the conditioning shifts what's most plausible next.
This isn't magic. It's the same pattern-matching, applied to the surrounding context you provided.
What models can and can't do well
Some things models are reliably good at:
- Anything that requires summarizing, restructuring, or transforming text
- Generating text that matches a pattern (code, emails, marketing copy, fiction)
- Answering factual questions where the answer is well-represented in training data
- Reasoning through multi-step problems within their context window
- Translating between languages
- Explaining concepts at different levels of complexity
Some things they're unreliable at:
- Anything requiring up-to-date information beyond their training cutoff (without tools)
- Arithmetic, especially with large numbers (they're language models, not calculators)
- Counting (asking how many words or letters in a passage)
- Adherence to strict format rules without examples
- Tasks requiring true randomness (their "random" is biased toward training patterns)
- Knowledge of private or proprietary data they were never trained on
The capability map shifts with every model generation. The shape of what's reliable and what isn't doesn't.
Why two models give different answers to the same question
Different training data, different fine-tuning, different alignment processes. A question that has a clear ground-truth answer should produce similar (though not identical) responses across frontier models. A question that's ambiguous, or that depends on judgment, will surface differences in how each model was tuned.
Different models also have different personalities. Some are more cautious, some more direct. Some hedge more, some commit more. This is downstream of choices labs made during training. See Which AI should I use? for the practical comparison.
What "reasoning" means in modern models
Newer frontier models can be configured to spend extra compute "thinking" before answering. What's happening under the hood: the model is generating internal reasoning tokens that aren't shown to the user, then generating the final answer conditioned on its own reasoning.
This works because LLMs are better at solving multi-step problems when they have space to work through them step by step — the same way a human is better at math when allowed to write out intermediate steps rather than answering in their head.
Extended thinking helps on complex reasoning, code generation, and analysis. It doesn't help on simple lookups, and it costs more (more tokens generated). Treat it as a knob to turn for hard problems, not a default setting.
What's coming next
The frontier is moving in three directions simultaneously:
- Longer context windows: from 8K → 200K → 1M+ tokens, allowing larger documents and longer conversations to fit
- Better reasoning: models that handle multi-step problems more reliably, especially with extended thinking
- Tool use and agents: models that don't just answer — they take actions, call APIs, read and write files, run code
The mechanism stays the same. The capability surface keeps expanding.
What to read next
- What is a token in AI? The unit that controls cost and output — the atom underneath everything above
- Context windows explained: what they limit and what they don't — why your long conversation eventually breaks
- How to verify AI output before you trust it — the workflow that makes hallucination manageable
- What makes a prompt work — the practical follow-up
Next in this pillar
What is a token in AI? The unit that controls cost and outputGet the next guide when it lands
One email on Sunday with new /learn guides, tool updates, and a couple of links worth reading.