Eval-Driven Development

Foundations

Evals vs. tests vs. benchmarks: what's the difference?

"Our model scores 90%" can mean four completely different things, and people use the words interchangeably as if it didn't matter. A metric, a test, a benchmark, and an eval are distinct objects with distinct jobs. Confusing them is one of the most common ways teams talk themselves into shipping something that isn't actually good — so it's worth getting the vocabulary straight before you bet a release on it.

Four definitions

Start with the smallest unit and build up. These definitions come from the codex's Part I taxonomy, which is the shared vocabulary the rest of eval-driven development rests on.

  1. A metric is one measurement. Accuracy, F1, exact-match, latency, pass-rate, BLEU — a single number describing one property of one run. A metric is an ingredient, not a verdict. On its own it tells you nothing about whether the number is the right number to care about.
  2. A test is a deterministic assertion. The classic dev-sense test: given this input, the code returns exactly this output, and the check is binary and repeatable. The total equals 42 or it doesn't. The JSON parses or it doesn't. There is no sampling and no judgment — the assertion is the ground truth.
  3. A benchmark is a standardized dataset plus metric plus protocol, built for cross-model comparison. MMLU, HELM, SWE-bench, HumanEval. The point of a benchmark is to put many different systems on the same yardstick so you can rank them. It is general by design and not specific to your application.
  4. An eval is your application-specific check. The minimal eval is a dataset (your inputs), a success criterion (your definition of done), and a grader (code, an LLM-as-judge, or a human). It runs against your prompts, your retrieval, your tools, in your stack. Its result is an estimate with uncertainty, not a single anointed number — an eval is a runnable experiment, not a vibe.

The relationships nest. A benchmark uses metrics and is often built out of many tests. An eval uses metrics and may include tests as its cheapest grading layer. But the jobs are different: a benchmark answers "which model is generally better?", while an eval answers "is my system good enough to ship?" Those are not the same question, and the difference is where most of the trouble starts.

Side by side

MetricTestBenchmarkEval
What it isOne measurementA deterministic assertionStandardized dataset + metric + protocolDataset + success criterion + grader
ScopeA single propertyOne unit of behaviorGeneral capability, many modelsYour application, end to end
Answers"How much of X?""Right or wrong?""Which model is generally better?""Is my system good enough to ship?"
Result shapeA numberBinary, exact-matchA ranked score (often one number)An estimate with error bars
Gradern/a (it is the number)Code assertionFixed harnessCode, LLM-judge, or human — chosen per criterion
OwnerWhoever reports itYouA lab or communityYou
Is it your spec?NoPart of itNoYes

The trap: shipping on a benchmark score

The single most expensive mix-up is treating a benchmark number as if it were your eval. It feels reasonable — the model that tops the leaderboard should be the best choice for your app, right? — and it is wrong often enough to be dangerous. Two failure modes, both documented at length in the codex's Part VIII, explain why.

Construct validity. A benchmark named for a capability frequently doesn't measure that capability. A review of hundreds of LLM benchmarks found pervasive gaps between the construct in the title ("safety", "robustness", "reasoning") and what the score actually rewards. And no finite, fixed test set can stand in for the open-ended behavior your product needs — a closed slice of inputs is a closed spec. A high benchmark score answers a question that is adjacent to yours, not identical to it.

The leaderboard illusion. Public benchmarks are optimization targets, and "when a measure becomes a target, it ceases to be a good measure." Contamination is the default rather than the exception — surveys find that essentially all public target-answer benchmarks leak into pretraining data, so a chunk of any headline score can be memorization rather than reasoning. Leaderboards can be gamed through private multi-variant testing and best-of-N publication, and even small amounts of leaderboard-distribution data can buy large relative gains. The codex catalogs cases where de-contaminating a coding benchmark cut a system's "skill" by most of its apparent value, and where the field's flagship agentic-coding benchmark was effectively retired as a frontier signal once its gains were traced back to training-time exposure.

Why this produces false confidence A benchmark tells you a model is generally strong. It does not tell you that your prompts, your retrieved context, and your tools combine into a system your users will accept. The number is real; the inference from it to your product is the broken part. Treat a benchmark as a hint about which models to evaluate — never as evidence that your app works.

What an eval is for your app — and why it's the spec

An eval is the only one of the four that is built from your reality. You assemble a dataset of real or representative inputs, you write down the criteria that define a good response for your users, and you pick a grader for each criterion: a cheap code assertion where the answer is verifiable, a validated LLM-judge where the quality is subjective, a human where neither will do. That bundle — data, criteria, grader — is a runnable, executable description of what "done" means for your application.

That is exactly what a specification is. The reason the eval, and not the benchmark or the lone metric, becomes the spec under EDD is that it is the only artifact that is both yours and runnable. A benchmark is runnable but not yours. A written PRD is yours but not runnable, so it drifts from the code. An eval is yours and runnable, so the AI — or the human — can iterate against it until it passes, and "does it pass the evals?" replaces "does it look right?"

One honest caveat that keeps the eval from becoming a benchmark in disguise: you don't write the whole eval up front. The criteria are discovered by grading real outputs and doing error analysis — write evaluators for the failures you actually observe, not the ones you imagine — so the spec and the eval co-evolve. There's more on that loop in eval-driven development vs. TDD and in the practical mechanics of writing evals for an AI coding agent.

Which one do I reach for?

  1. Reach for a test when the thing you care about is verifiable and deterministic — the total matches, the schema validates, the migration runs. Tests are the first and cheapest layer; catch the obvious cases here before you pay for anything fancier.
  2. Reach for a metric when you need to quantify one property of a run — but never report it alone. Pair it with the criterion that says what counts as good, and remember a metric is an ingredient in an eval, not a verdict by itself.
  3. Reach for a benchmark when you're choosing between models and want a directional, cross-model hint about general capability. Read it skeptically, assume contamination, and confirm on your own held-out data before adopting anything on standing alone.
  4. Reach for an eval when you need to know whether your system is good enough to ship, gate in CI, or guard against regressions. This is the one that becomes your spec. Everything else feeds into it.

The short version: a metric is a number, a test is an assertion, a benchmark ranks models, and an eval is your spec. Keep them separate in your head and a lot of false confidence quietly disappears.

Grounded in the EDD codex — esp. Part I (the metric / test / benchmark / eval taxonomy and pass@k) and Part VIII (construct validity, contamination, the leaderboard illusion, and why a green suite can still ship a broken product).