All terms
Safety, eval & ops
MMLU (Massive Multitask Language Understanding)
Also known as: Massive Multitask Language Understanding, Hendrycks MMLU
A 57-subject multiple-choice benchmark covering everything from US history to abstract algebra. The most-cited general knowledge test of the LLM era.
What it means
MMLU was introduced in 2020 by Hendrycks et al. as a way to measure broad knowledge across high school, college, and professional-level subjects. Topics range from elementary math to international law to clinical medicine. Each question is multiple-choice with four options, so random guessing scores 25%. Humans top out around 90% (specialists in their field) with cross-subject expert averages near 80%.
For years, MMLU tracked progress cleanly. GPT-3 in 2020 scored ~44%, GPT-3.5 in 2022 was ~70%, GPT-4 in 2023 hit 86%, and Claude 3 Opus pushed past 88% in 2024. By 2025-2026, every frontier model — Claude Opus 4.7, GPT-5, Gemini 3 Ultra, DeepSeek V4 — scores in the 88-92% range. The benchmark is functionally saturated; differences inside that band are mostly noise from question ambiguity, not real capability gaps.
MMLU still gets cited because it's familiar and the dataset is fixed. But labs have moved to harder successors: MMLU-Pro (10 options instead of 4, harder questions, ~75% frontier ceiling in 2026), GPQA Diamond (PhD-level science, ~70% frontier), and Humanity's Last Exam (curated impossible-for-most-experts, ~30% frontier). If a 2026 launch leads with MMLU numbers, that's a marketing tell — they're showing the easy benchmark because the harder ones don't look as good.
Example
A 2026 model card showing MMLU 91% / MMLU-Pro 76% / GPQA Diamond 68% tells you the model is in the frontier tier. The MMLU number alone tells you nothing — every serious model clears 88%.
Why it matters
MMLU is the canonical example of benchmark saturation. Knowing why it's no longer useful trains you to read benchmark claims skeptically — always ask which benchmark, what version, and where the frontier ceiling currently sits.