Benchmark

What it means

Benchmarks are how the field measures progress. MMLU tests general knowledge, HumanEval and SWE-bench test code, GPQA tests grad-level science reasoning, MATH tests math, ARC-AGI tests novel reasoning, MMMU tests multimodal understanding. Each benchmark has a fixed dataset and scoring rubric, so a Claude vs GPT comparison on MMLU is at least apples-to-apples. The dirty secret is that benchmarks correlate poorly with real-world utility once they mature. Gaming is rampant — benchmark questions leak into training data (intentionally or accidentally), labs train explicitly on benchmark distributions, and test sets get memorized. By the time a benchmark hits 90%+ across frontier models, it's often saturated and stops discriminating. MMLU was the king from 2020-2023; in 2026 it's basically retired because everyone scores 88-92%. Newer benchmarks try harder: SWE-bench Verified uses real GitHub issues with passing tests, ARC-AGI-2 deliberately uses puzzles humans solve easily but models can't memorize, and Humanity's Last Exam aggregates hard expert-written questions across dozens of fields. But every benchmark gets attacked eventually. The healthier framing: use public benchmarks as a smell test for capability tier, then build private evals on tasks you actually care about.

Example

When DeepSeek R1 launched scoring 79% on AIME 2024 (a math olympiad benchmark), the field took it seriously — that's in the same range as o1 and o3-mini, suggesting the open-weight model had real reasoning, not just benchmark memorization. Replication on private eval sets later confirmed it.

Why it matters

Benchmarks are the marketing currency of AI. Every model launch claims "state of the art on X." Learn which benchmarks still discriminate (SWE-bench Verified, ARC-AGI-2, GPQA Diamond), which are saturated (MMLU, HellaSwag), and how to read between the lines. Better yet, build your own evals.

What it means

Example

Why it matters

Related terms

See it in a comparison