All terms
Architecture
Transformer
Also known as: Transformer architecture, Transformer network
The neural network architecture behind every modern LLM. Introduced in 2017, it processes sequences using self-attention instead of recurrence.
What it means
The Transformer is the architecture that runs the entire LLM era. GPT, Claude, Gemini, Llama, DeepSeek, Mistral — all of them are Transformers under the hood. It was introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google, and the title was the actual thesis: you don't need recurrence or convolutions, just attention.
The core trick is self-attention. For each token in the input, the model computes how much it should "look at" every other token, then mixes their representations weighted by those scores. This happens in parallel for the whole sequence at every layer, which is why Transformers train so much faster than the RNNs and LSTMs they replaced — RNNs process tokens one at a time, Transformers eat the whole sequence at once on a GPU.
A modern Transformer is just this same block — attention, then a feed-forward network, with residual connections and layer norm — stacked dozens to hundreds of times. GPT-3 had 96 layers. Frontier models in 2026 have hundreds. Bigger model = more layers, wider hidden dimensions, more attention heads. The architecture itself has barely changed since 2017; what changed is scale, training data, and tweaks like RoPE positional encoding, RMSNorm, and grouped-query attention.
Transformers replaced RNNs because they parallelize, scale gracefully to billions of parameters, and capture long-range dependencies cleanly through attention rather than squeezing everything through a single hidden state.
Example
GPT-4, Claude 4.7, Gemini 2.5, Llama 4, and DeepSeek-V3 are all Transformers. The differences are in scale, training, and routing — not the fundamental architecture.
Why it matters
Understanding that almost every model you interact with is the same basic architecture, just scaled differently, demystifies a lot of AI hype. Progress since 2020 has come mostly from more data, more compute, and better training — not from a new architecture replacing the Transformer.