All terms
Safety, eval & ops
Eval / evaluation
Also known as: evaluation suite, private benchmark, LLM-as-judge
The broader practice of measuring model output quality on tasks you actually care about — usually a custom test suite specific to your app.
What it means
An "eval" (engineer slang for evaluation) is your private benchmark. You collect 50-500 examples of real tasks your app handles, define what "good" looks like (often via a rubric scored by another LLM, sometimes by humans), and run your candidate models against the set. The output is a number per model per task, so you can decide whether GPT-5.1 is actually better than Claude Sonnet 4.6 on your customer support flow.
Custom evals beat public benchmarks for production work for one reason: they measure your distribution. MMLU tells you a model knows random facts; your eval tells you whether the model handles your customers' weirdly-phrased refund requests. Frameworks like OpenAI Evals, Promptfoo, Braintrust, LangSmith, and Anthropic's Inspect framework make this accessible — most teams can stand up a basic eval suite in a day.
The evolution is to LLM-as-judge eval: you don't write deterministic graders, you have GPT-5 score the output against a rubric. This is faster than human grading and (with calibration) often correlates well with human judgment. Watch for systematic biases — judge models prefer their own family's outputs, prefer longer answers, and miss subtle failures. Cross-check with human spot-checks on a 5-10% sample. The teams shipping good AI products in 2026 all have eval pipelines they ship as part of CI.
Example
A code-review startup runs 200 real PR diffs through Claude, GPT-5, and Gemini 3 every Monday. Each model output is scored against ground-truth review comments by a judge model and 20 are spot-checked by humans. This eval — not MMLU — decides which model gets routed customer traffic that week.
Why it matters
If you ship LLM-powered features and don't have evals, you can't tell when a model upgrade silently breaks things. Every prompt change, every model swap, every fine-tune is a flying blind without an eval. This is the single highest-ROI engineering investment for AI products.