All terms
Architecture
Attention
Also known as: self-attention, multi-head attention, scaled dot-product attention
The mechanism that lets a model decide which parts of its input matter most for each token it produces. Every Transformer is built around it.
What it means
Attention is how a Transformer figures out which tokens to "look at" when processing each position. For every token, the model produces three vectors: a query, a key, and a value. The query gets compared (dot product) against every other token's key to produce a score, those scores get softmaxed into weights, and the output is a weighted sum of all the value vectors. In English: each token asks "who in this sequence is relevant to me?" and pulls in their information accordingly.
There are two flavors that show up everywhere. Self-attention is when a sequence attends to itself — used inside every Transformer block to mix information across positions. Cross-attention is when one sequence attends to another — used in encoder-decoder models so the decoder can look at the encoder's output. Decoder-only models like GPT and Claude only use self-attention (with a causal mask so tokens can't peek at the future).
Multi-head attention runs many attention computations in parallel with different learned projections, then concatenates the results. The intuition is that different "heads" can specialize: one might track syntax, another long-range coreference, another local n-grams. Modern models use 32 to 128 heads. Variants like Grouped-Query Attention (GQA) and Multi-Query Attention (MQA) share keys and values across heads to make inference cheaper — Llama 3, Mistral, and most 2025-era models use GQA.
Attention's main weakness is that its cost scales quadratically with sequence length: a 100k-token context takes roughly 100x more attention compute than a 10k context. That's why long-context models use tricks like sliding window attention, sparse attention, or FlashAttention to keep things tractable.
Example
When you ask Claude "What did the CEO say in the third paragraph?", attention is what lets the model jump straight to the third paragraph instead of relying on the order it read things in.
Why it matters
Attention is the one mechanism behind nearly every capability of modern LLMs. Long context, in-context learning, instruction following — they're all consequences of attention being good at finding relevant information across a sequence.