Eval-Driven Development

What is eval-driven development?

Eval-driven development is the practice of using evals — not vibes — as the executable spec and the guardrail for AI-assisted software.

By Brenn Hill · Updated June 2026

You define evaluations — automated checks, often assertion-based or LLM-graded — before and as you build with AI, and the AI iterates until the evals pass. The evals become the executable specification and the guardrail. "Does it pass the evals?" replaces "does it look right?"

It is the AI-era successor to test-driven development. Where TDD uses deterministic unit tests to drive deterministic code, EDD uses evals — which can grade the non-deterministic output of an LLM and the behavior of an agent — to drive AI-assisted and AI-agent development. The shape is the same as TDD: write the check, then build until it passes. What changes is what a "check" can be.

The bottleneck moved

For most of software history, writing code was the slow part. AI changed the ratio: an agent can produce a large change in seconds, and another right after it. When generation is cheap and fast, the binding constraint stops being "can we write it?" and becomes "can we tell whether what it wrote is any good?" That question — the verifier — is what separates a loop from a vibe.

Eval-driven development is the discipline of building that verifier deliberately. The generator (the model, the prompt, the tools) gets the attention; the verifier is what makes the output trustworthy enough to keep. And there is a real asymmetry working in your favor: verifying a solution is generally easier than producing it, which is exactly why a modest eval suite can govern a much more capable generator.

Eval vs test vs benchmark A metric is one measurement (accuracy, exact match). A benchmark is a standardized dataset + metric + protocol for comparing models (MMLU, SWE-bench). A test is a deterministic assertion. An eval is your application-specific check — a dataset, a success criterion, and a grader — that says whether your system does the job. A high benchmark score is not your product passing its evals. EDD is about the last one.

What counts as an eval

An eval is a runnable experiment, not a single number. It has three parts:

A dataset — the inputs you care about (drawn from real usage and real failures, not imagined cases).
A success criterion — what "good" means for each input.
A grader — how you decide pass or fail. Graders come in three flavors, in order of preference where they apply:

Code / execution-based. Run the output and check it — the unit test passes, the function returns the right value, the page renders. Fastest, most reliable, hardest to game. For code, the tests are the spec.
LLM-as-judge. Where no deterministic check exists (tone, helpfulness, faithfulness to a source), a strong model can grade against a rubric and reach roughly human-level agreement — if you validate it against human labels, control for length and position bias, and reserve it for subjective quality rather than verifiable correctness.
Human review. The ground truth the other two are calibrated against — expensive, so you spend it on building and checking the graders, not on every run.

The loop

EDD in practice is a flywheel, and it does not start by writing evals for failures you imagine. It starts by looking at your data:

Error analysis first. Read real outputs. Categorize how they fail. The eval suite should emerge from observed failure modes — write evaluators for errors you discover, not errors you imagine.
Turn each failure into a cheap assertion. A regex that blocks a leaked ID, a check that the JSON parses, a unit test that must pass. Catch the obvious 80% deterministically before paying for a judge.
Add LLM-judge evals for what code can't grade — and validate the judge against your own labels until it agrees.
Run the suite as a gate in CI. Block the change when scores drop. Keep regression evals near 100% pass; let capability evals start low as bets on what's now possible.
Close the loop with production. Offline evals make you fast; online evals on real traffic catch drift and novel failures. Yesterday's production failure is tomorrow's golden-set case.

A small worked example

Suppose an agent files expense reports from receipt images. A first eval set, built from a dozen real receipts, might gate on:

Check	Grader	Why
Total matches the receipt	Code (exact numeric)	Verifiable; never delegate to a judge.
Date is a valid date ≤ today	Code (assertion)	Cheap, deterministic.
Category is reasonable for the merchant	LLM-judge (rubric)	Subjective; validate against human labels.
No hallucinated line items	LLM-judge (grounded)	Every line must trace to the image.

The agent iterates until the suite is green. Crucially, you measure reliability, not a flattering peak: an agent that succeeds 70% of the time looks like ~97% if you report "passes at least once in three tries" (pass@k) but only ~34% if you ask whether it passes all three (pass^k). For something you ship, pass^k is the honest number.

From vibe check to eval suite

Teams climb a predictable ladder: subjective spot-checks ("looks good") → deterministic code checks → a single LLM judge → a calibrated, statistically-read suite with human sampling and regression gates. Each rung scales further than the last; the vibe check is the only one that doesn't scale at all.

What EDD is not

Not "write every eval before any code." The spec is discovered by grading real outputs — you can't fully specify the criteria until you've seen how the system fails. Evals and the spec co-evolve.
Not a green-dashboard guarantee. Every eval is a Goodhart target. Contamination, saturation, gaming, and style-biased judges all mean an all-green suite can still ship a broken product. Evals are necessary, not sufficient — pair them with held-out tests and real-world feedback.
Not a pile of generic 1–5 scores. Metric sprawl and off-the-shelf "helpfulness" judges create false confidence. Prefer binary pass/fail with a written critique and product-specific failure modes.

The one-line version Eval-driven development is using evals — not vibes — as the spec and the guardrail for AI-assisted software. It is how you let an AI agent change a codebase without breaking it: the evals are what make code safely modifiable by AI.

Want the evidence behind all of this? The research codex assembles 130+ annotated, cited sources across eight parts — from the statistics of evals to coding-agent benchmarks, LLM-as-judge biases, and the ways evals mislead. Start with EDD vs TDD or how to write evals for a coding agent.

Newsletter

Get new eval-driven development essays by email

Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.

This page is the canonical definition. The concept is free to use, adopt, and cite. Corrections and worked examples are welcome via GitHub.