How to evaluate an LLM feature is working (without fooling yourself)

Why "looks good" is not evaluation. Building a small eval set (20 cases beats 200), the four grading methods (programmatic, reference, LLM-as-judge, human), what to measure, and how to spot production drift.

9 min read·Updated May 27, 2026

How to evaluate an LLM feature is working (without fooling yourself)

Most LLM features ship with no evaluation beyond "I tried it a few times and it looked good." Then they degrade silently when the model is updated, when the prompt is edited, or when users start sending inputs the team never tested. Evaluation — having a structured way to know whether the system is producing what you want — is what separates AI features that improve over time from ones that get worse.

This guide is for the operator, PM, or engineer responsible for an AI feature in production. Not a research paper on eval methodology — the practical version.

Why "looks good" isn't evaluation

The model is fluent. Its outputs almost always look good. That fluency hides three classes of failure that vibes-based testing misses:

  1. Confident wrong answers: the output reads well and is incorrect
  2. Drift over time: the model is updated, your prompt change has unintended effects, your data distribution shifts — and "looks good" doesn't catch any of it
  3. Failure on edge cases: the cases you didn't think to test are exactly the ones that break in production

The fix: a small, persistent set of inputs you check the system against, with explicit criteria for what counts as a pass.

The eval set: the foundation of every other technique

An eval set is a fixed list of test cases. For each: the input, what good output looks like, and (ideally) how to check programmatically whether the output meets the bar.

A small eval set (20–100 cases) covering your actual use cases beats a large one full of synthetic data.

To build one:

  1. Sample real user inputs if you have them — these are your gold for what users send in practice
  2. Cover the obvious categories — the happy path, the most common edge cases, the types of input you've seen break before
  3. Cover the categories you don't want to see, too — inputs the system should refuse, inputs that should escalate to a human
  4. Date each case — when added, when last reviewed; eval sets go stale

Store the eval set in code (a JSON or YAML file in the repo). Make running it a single command.

The four ways to grade output

Once you have inputs, you need to grade outputs. In rough order of reliability:

1. Programmatic checks (best when you can)

For tasks with verifiable answers: regex matches, JSON-schema validation, exact-match for classification, executing generated code, calling APIs. When this works, it's the gold standard — fast, cheap, repeatable.

Use for: structured extraction, classification, code generation, anything with a clear right answer.

2. Reference comparison

For each test case, store an example of acceptable output. Compare new output to the reference (similarity score, BLEU, ROUGE for text; or a strict diff for structured output).

Use for: summarization, translation, structured response shaping.

Limit: requires a reference per case, and "matches the reference" isn't the same as "is correct" for open-ended tasks.

3. LLM-as-judge

Use a second LLM (often a different, more capable one) to grade the first model's output against a rubric. Cheaper than humans, faster, scales.

Use for: open-ended generation where there's no exact right answer (tone, helpfulness, format adherence, factual grounding against retrieved sources).

Limit: judge models have their own biases. Calibrate against human grades on a sample before trusting at scale.

4. Human evaluation

Have humans rate outputs against a rubric. Gold standard for nuanced judgment, but slow and expensive.

Use for: validating the LLM-as-judge calibration, periodic quality audits, evaluating new features before launch.

Most production eval pipelines use a mix: programmatic where possible, LLM-as-judge for the rest, human sampling to keep both honest.

What to measure

Pick metrics that match what users care about:

  • Correctness: does the output give the right answer?
  • Factuality / grounding: are the claims supported by source material (for RAG, by retrieved chunks)?
  • Format adherence: does the output match the required structure (JSON shape, length limits, no banned words)?
  • Refusal accuracy: does the system refuse what it should refuse, and not refuse what it shouldn't?
  • Safety: does it produce content within your safety policies?
  • Latency: P50, P95, P99 — averages hide tail problems
  • Cost per request: trending up or down?

For each, define a target. Not "high correctness" — "≥85% correctness on the eval set, with no regression vs. last release."

The before/after test for every change

Whenever you change the system — new prompt, new model, new RAG chunks — run the eval set both ways. Compare:

  • Cases that pass in both
  • Cases that pass in old but fail in new (regressions — the change broke something)
  • Cases that fail in old but pass in new (improvements — what we hoped for)
  • Cases that fail in both (still broken — known issues)

A change that improves 5 cases and regresses 3 is not unambiguously a win. Often you find that "improving" the prompt for new use cases breaks something you didn't realize you needed.

How to spot drift in production

The eval set tells you about behavior on a fixed input set. It doesn't tell you what's happening in production. For that, monitor:

  • Live error rate: what fraction of requests return malformed output, fall back to defaults, or are flagged by downstream checks
  • User-side signals: thumbs-down ratings, complaint volume, re-asks of similar questions
  • Cost trend: if cost per request is climbing without a clear cause, something is generating more tokens than expected (a loop, a broken truncation, longer context)
  • Refusal rate: should be roughly stable; sudden spikes mean either the input distribution changed or the model's refusal calibration moved

When drift is detected, expand the eval set with examples of the new failure mode, fix the root cause, and add a regression test so the same drift can be caught faster next time.

Evaluation when the answer is open-ended

For tasks with no single right answer (creative writing, brainstorming, open-ended chat), traditional accuracy metrics don't apply. Common approaches:

  • Rubric-based scoring: define 3–5 dimensions (helpfulness, accuracy, tone, format) and have an LLM or human score 1–5 on each. Trend the scores over time.
  • A/B testing: ship two variants to a fraction of users, measure downstream signals (conversion, retention, task completion). Slow but ground-truth.
  • Pairwise preference: show humans (or an LLM judge) two outputs side-by-side and ask which is better. More reliable than absolute scoring.

For open-ended use cases, accept that single-number "quality" doesn't exist. Track multiple dimensions and watch their interactions.

The minimal eval setup for a new feature

If you're shipping an LLM feature today and don't have evals yet, the minimum:

  1. 20 test cases covering the obvious uses and the 3–5 most likely failure modes
  2. A grading function — programmatic where possible, LLM-as-judge for the rest
  3. A run-evals command that produces a pass/fail report
  4. A run on every prompt change — make it part of the PR template
  5. Production logging of inputs, outputs, and errors — even if you're not analyzing yet, you'll want the history

Add complexity (more cases, more metrics, more sophistication) only when you have a specific gap that's biting you.

Why most teams skip this and regret it

Evaluation feels like overhead. It produces no user-visible feature. The first time you build the harness, it slows you down.

But within a quarter of shipping an unmeasured AI feature, most teams hit one of:

  • The model is updated and quality silently changes (better or worse, with no way to tell)
  • The prompt is edited and a known-good case breaks (without a regression test, you find out in production)
  • A customer complaint surfaces a failure pattern you didn't know existed
  • You can't tell whether a "quality improvement" effort moved any number

A small eval set caught all four of those, and the cost is one afternoon plus 30 minutes per change.

What to read next

Get the next guide when it lands

One email on Sunday with new /learn guides, tool updates, and a couple of links worth reading.