All terms
Architecture
Decoder-only model
Also known as: causal language model, autoregressive Transformer, GPT-style model
A Transformer that just predicts the next token, autoregressively. The architecture used by GPT, Claude, Llama, Gemini, and basically every modern chat model.
What it means
A decoder-only model is a Transformer that does one thing: given a sequence of tokens, predict the next one. It applies a causal attention mask so each token can only attend to tokens before it, never after. You generate by sampling the next token, appending it to the sequence, and repeating. That's it — no separate encoder, no cross-attention, no special "input vs output" split.
This is the GPT-family architecture, and it has effectively won. GPT-2/3/4, Claude, Llama, Mistral, DeepSeek, Qwen, Gemini — all decoder-only. The reasons it won are mostly practical: decoder-only models train cleanly on raw internet text (just predict the next token, no need to construct input/output pairs), they handle arbitrary tasks via in-context learning ("translate this:" works as well as a dedicated translation model), and they're simpler to scale.
The contrast is with encoder-decoder models like T5 or original BART, which have a separate encoder that reads the input and a decoder that generates output, connected by cross-attention. Encoder-decoder is theoretically better for tasks with a clean input-to-output structure (translation, summarization), but decoder-only models matched or beat them once they got big enough, and they're far more flexible for chat.
The "decoder-only" name is a historical artifact — the original 2017 Transformer had both an encoder and a decoder, and GPT-style models kept just the decoder half. In 2026 it's just "the default Transformer" and almost everyone calls these "LLMs" without qualifying further.
Example
When ChatGPT generates a response one token at a time, that's a decoder-only model in action — each token depends only on what came before it, and the model never "looks ahead."
Why it matters
The decoder-only architecture is the reason a single model can do translation, coding, math, summarization, and chat without specialization. Understanding that everything is just next-token prediction explains a lot of LLM behavior — including why they're so good at completion and so weird about counting.