Eval-Driven Development
What is eval-driven development?
Eval-driven development is the practice of using evals — not vibes — as the executable spec and the guardrail for AI-assisted software.
You define evaluations — automated checks, often assertion-based or LLM-graded — before and as you build with AI, and the AI iterates until the evals pass. The evals become the executable specification and the guardrail. "Does it pass the evals?" replaces "does it look right?"
It is the AI-era successor to test-driven development. Where TDD uses deterministic unit tests to drive deterministic code, EDD uses evals — which can grade the non-deterministic output of an LLM and the behavior of an agent — to drive AI-assisted and AI-agent development. The shape is the same as TDD: write the check, then build until it passes. What changes is what a "check" can be.
The bottleneck moved
For most of software history, writing code was the slow part. AI changed the ratio: an agent can produce a large change in seconds, and another right after it. When generation is cheap and fast, the binding constraint stops being "can we write it?" and becomes "can we tell whether what it wrote is any good?" That question — the verifier — is what separates a loop from a vibe.
Eval-driven development is the discipline of building that verifier deliberately. The generator (the model, the prompt, the tools) gets the attention; the verifier is what makes the output trustworthy enough to keep. And there is a real asymmetry working in your favor: verifying a solution is generally easier than producing it, which is exactly why a modest eval suite can govern a much more capable generator.
What counts as an eval
An eval is a runnable experiment, not a single number. It has three parts:
- A dataset — the inputs you care about (drawn from real usage and real failures, not imagined cases).
- A success criterion — what "good" means for each input.
- A grader — how you decide pass or fail. Graders come in three flavors, in order of preference where they apply:
- Code / execution-based. Run the output and check it — the unit test passes, the function returns the right value, the page renders. Fastest, most reliable, hardest to game. For code, the tests are the spec.
- LLM-as-judge. Where no deterministic check exists (tone, helpfulness, faithfulness to a source), a strong model can grade against a rubric and reach roughly human-level agreement — if you validate it against human labels, control for length and position bias, and reserve it for subjective quality rather than verifiable correctness.
- Human review. The ground truth the other two are calibrated against — expensive, so you spend it on building and checking the graders, not on every run.
The loop
EDD in practice is a flywheel, and it does not start by writing evals for failures you imagine. It starts by looking at your data:
- Error analysis first. Read real outputs. Categorize how they fail. The eval suite should emerge from observed failure modes — write evaluators for errors you discover, not errors you imagine.
- Turn each failure into a cheap assertion. A regex that blocks a leaked ID, a check that the JSON parses, a unit test that must pass. Catch the obvious 80% deterministically before paying for a judge.
- Add LLM-judge evals for what code can't grade — and validate the judge against your own labels until it agrees.
- Run the suite as a gate in CI. Block the change when scores drop. Keep regression evals near 100% pass; let capability evals start low as bets on what's now possible.
- Close the loop with production. Offline evals make you fast; online evals on real traffic catch drift and novel failures. Yesterday's production failure is tomorrow's golden-set case.
A small worked example
Suppose an agent files expense reports from receipt images. A first eval set, built from a dozen real receipts, might gate on:
| Check | Grader | Why |
|---|---|---|
| Total matches the receipt | Code (exact numeric) | Verifiable; never delegate to a judge. |
| Date is a valid date ≤ today | Code (assertion) | Cheap, deterministic. |
| Category is reasonable for the merchant | LLM-judge (rubric) | Subjective; validate against human labels. |
| No hallucinated line items | LLM-judge (grounded) | Every line must trace to the image. |
The agent iterates until the suite is green. Crucially, you measure reliability, not a flattering peak: an agent that succeeds 70% of the time looks like ~97% if you report "passes at least once in three tries" (pass@k) but only ~34% if you ask whether it passes all three (pass^k). For something you ship, pass^k is the honest number.
From vibe check to eval suite
Teams climb a predictable ladder: subjective spot-checks ("looks good") → deterministic code checks → a single LLM judge → a calibrated, statistically-read suite with human sampling and regression gates. Each rung scales further than the last; the vibe check is the only one that doesn't scale at all.
What EDD is not
- Not "write every eval before any code." The spec is discovered by grading real outputs — you can't fully specify the criteria until you've seen how the system fails. Evals and the spec co-evolve.
- Not a green-dashboard guarantee. Every eval is a Goodhart target. Contamination, saturation, gaming, and style-biased judges all mean an all-green suite can still ship a broken product. Evals are necessary, not sufficient — pair them with held-out tests and real-world feedback.
- Not a pile of generic 1–5 scores. Metric sprawl and off-the-shelf "helpfulness" judges create false confidence. Prefer binary pass/fail with a written critique and product-specific failure modes.
Want the evidence behind all of this? The research codex assembles 130+ annotated, cited sources across eight parts — from the statistics of evals to coding-agent benchmarks, LLM-as-judge biases, and the ways evals mislead. Start with EDD vs TDD or how to write evals for a coding agent.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
This page is the canonical definition. The concept is free to use, adopt, and cite. Corrections and worked examples are welcome via GitHub.