Encoder-decoder — what it means in plain English

What it means

Encoder-decoder is the architecture from the original 2017 "Attention Is All You Need" paper. The encoder is a stack of Transformer blocks that processes the input bidirectionally — every token can see every other token. The decoder is another stack that generates the output one token at a time, with two kinds of attention: causal self-attention over what it has generated so far, and cross-attention that reaches into the encoder's output to pull in source information. This architecture dominated machine translation and structured sequence-to-sequence tasks. T5, BART, mT5, and the original Transformer for English-to-German translation are all encoder-decoder. It's a clean fit when there's a real input-output split: the input gets encoded once into a dense representation, and the decoder repeatedly attends to that representation while generating. For translation, that's exactly the shape of the problem. So why aren't chat models encoder-decoder? A few reasons. Decoder-only models scale more cleanly on raw text (no need for paired input-output examples). They handle multi-turn chat naturally — the whole conversation is just one growing sequence. And once decoder-only models got big enough, they matched encoder-decoder performance on translation and summarization too, while being more flexible for everything else. By 2022 the field had largely moved on. Encoder-decoder still shows up in production for narrow workloads. Google Translate uses a custom encoder-decoder. Many specialized summarization, OCR (like TrOCR), and speech-to-text systems (Whisper) are encoder-decoder. But for general-purpose AI assistants, decoder-only won, and the term "LLM" almost always means decoder-only now.

Example

Whisper, OpenAI's speech-to-text model, is encoder-decoder: the encoder processes the audio, the decoder generates the transcript token by token while attending to the encoded audio.

Why it matters

Knowing the encoder-decoder/decoder-only split helps you read papers and understand why certain models work the way they do. It also explains why specialized translation and OCR systems often beat general LLMs on those specific tasks — they're using an architecture purpose-built for input-to-output mapping.

What it means

Example

Why it matters

Related terms