Transformers and attention: the architecture under every modern AI model
The intuition behind attention (without the math), why transformers scaled when RNNs did not, encoder vs decoder, why context length is bounded, and what comes after pure attention (Mamba, MoE, hybrids).
Transformers and attention: the architecture under every modern AI model
A transformer is a neural network architecture that learns by figuring out which parts of an input matter for which parts of an output. The mechanism it uses is called attention. Every major LLM since 2018 — GPT, Claude, Gemini, Llama, DeepSeek — is a transformer. So is most modern vision and audio AI. Understanding the architecture explains a lot about why these models work the way they do.
This guide is for the operator who wants the intuition without the math.
What problem the transformer solved
Before transformers, the dominant architecture for sequence data (text, time series, audio) was the recurrent neural network (RNN), often as an LSTM variant. RNNs processed text one word at a time, carrying forward a hidden state.
Two problems with that:
- Long-range dependencies were weak. By the time the RNN got to word 100, its "memory" of word 1 had faded. Important context kept slipping away.
- Training couldn't parallelize. Each word had to be processed in sequence. On GPUs (which excel at parallel work), this left most of the hardware idle.
The transformer, introduced in the 2017 paper "Attention Is All You Need," fixed both. It processes all words in parallel, and it uses attention to let any word in the sequence directly look at any other word, no matter how far apart.
What attention does, intuitively
For each word in the input, the model asks: "to understand this word, which other words should I look at, and how much should I weigh each one?"
Take the sentence: "The animal didn't cross the street because it was too tired."
For the word "it," attention learns that "animal" matters a lot (it's what "it" refers to), "street" matters a little (also a possible referent, but less likely from context), and "the" matters almost not at all (no semantic content).
This weighting is learned during training. Nobody told the model what "it" means. Through being exposed to billions of examples, the model learned which patterns of attention produce useful predictions.
How the math works (the short version)
For every word in the sequence, the model computes three vectors:
- Query (Q): "what am I looking for?"
- Key (K): "what do I represent?"
- Value (V): "what information do I carry?"
To compute attention from word A to word B: take A's Query, compare it to B's Key (via dot product), and use the result to weight B's Value. Do this for every pair of words. Sum the weighted Values for each word — that's the new representation of that word, now informed by everything else in the sequence.
In practice, the model does this many times in parallel ("multi-head attention") with different learned Q/K/V projections, so different "heads" can attend to different relationships (syntactic, semantic, positional, etc.).
The mathematical detail isn't important to use these models. The mental model is: every word can look at every other word, weighted by relevance learned from training data.
Why transformers scale
Three properties that turned out to matter:
1. Parallelism
Because every word can be processed in parallel (each computing its attention over all others), training and inference saturate GPUs efficiently. This is how labs train models on trillions of tokens.
2. Scaling laws
Researchers discovered that transformer performance follows predictable scaling laws: bigger models trained on more data with more compute keep improving in predictable ways. There's no clean cliff where you stop benefiting from more scale. This is why frontier-model training budgets have grown 100x in five years.
3. Transfer learning
A transformer trained on one task (predicting the next word in any text) turned out to learn representations useful for almost any other text task. This is the foundation of modern AI: train one giant model on everything, then fine-tune it for specific applications.
Why "context length" exists as a constraint
Attention has a quadratic cost in sequence length. Doubling the context window quadruples the compute and memory required. This is the technical reason context windows are finite — and why a 1M-token window is much harder than a 100K window.
Newer architectures (Mixture of Experts, sparse attention, alternative attention variants) try to break this quadratic cost. Some work in production, but the basic transformer attention is still the dominant pattern.
Encoder, decoder, and the variants you've heard of
The original transformer had two parts:
- Encoder: reads the input, produces representations
- Decoder: generates output one token at a time, attending to both the encoder output and what it's generated so far
Different model families use different combinations:
- Encoder-only (BERT, RoBERTa): great for classification, semantic search, embeddings — anything where you need to understand text but not generate it
- Decoder-only (GPT, Claude, Llama, Gemini): great for generation — chat, completion, code
- Encoder-decoder (T5, the original): good for translation, summarization, structured transformation
Modern frontier chat models are decoder-only. They generate by attending to all prior tokens (the prompt + everything generated so far) and predicting the next one.
What attention can't easily do
Pure attention is set-based — it doesn't know the order of input tokens unless you tell it. Real transformers add positional encoding (extra information that says "this word is at position 5") so order isn't lost.
This is a small detail but it explains why some old prompt-engineering tricks (like rearranging the order of examples) work: changing position changes how attention combines them.
Attention also struggles with:
- Very long-range dependencies in practice: even when the architecture allows it, the model often pays less attention to material in the middle of a long context (the "lost in the middle" problem mentioned in context windows)
- Strict arithmetic and counting: nothing in the attention mechanism is doing exact computation
- Generalizing far outside training distribution: attention learns patterns; if you ask about something not represented in training, it patterns-matches to something else
What's coming after pure transformers
Active research directions:
- State Space Models (Mamba, RWKV): alternative architectures that scale linearly with sequence length rather than quadratically. Have shown promise on long-context tasks.
- Mixture of Experts (MoE): standard in most frontier models now. Different parts of the model activate for different inputs, allowing total parameter count to grow without proportional inference cost.
- Hybrid architectures: combining attention with other mechanisms (recurrent, convolutional) for specific tradeoffs.
For practitioners: the transformer + attention pattern is going to be the dominant architecture for years. The newer variants are evolutions, not replacements.
Why this matters when you're using AI
You don't need to know the architecture to use it. But understanding attention explains a few things you'll see in practice:
- Long inputs degrade attention: a fact in the middle of 100K tokens is more likely to be missed than the same fact at the start or end
- Order of examples matters: in few-shot prompts, examples at the end often weigh more
- Format consistency helps: when your prompt format matches patterns the model has seen in training, attention "finds" the relevant pattern faster
- Repetition can backfire: stating the same thing 5 times can mean the model averages across them rather than treating each as a strong signal
These aren't quirks. They're how attention works.
What to read next
- How large language models work — the higher-level mechanism
- Embeddings explained: how AI represents meaning as numbers — what attention produces under the hood
- Context windows explained — the practical limits of attention
Next in this pillar
Fine-tuning vs RAG vs prompting: which one fits your problemGet the next guide when it lands
One email on Sunday with new /learn guides, tool updates, and a couple of links worth reading.