Comparison
Eval-driven development vs. test-driven development
Eval-driven development borrows TDD's rhythm — write the check first, then build until it passes — and applies it to systems where the output is non-deterministic. The rhythm carries over. What changes is what a "check" can be, and how you read its result.
The shared shape
Test-driven development, as Kent Beck framed it, is a tight loop: write a failing test, make it pass with the simplest change, refactor. The test is an executable specification written before the code. EDD keeps that spine. You define an eval — a dataset, a success criterion, and a grader — and the AI iterates until the eval passes. "Evals are the new unit tests" is the slogan, and it is half right.
Several things genuinely carry over from TDD:
- Spec-first discipline. The check encodes intent before implementation, so you are building toward a definition of done rather than a vibe.
- Fast feedback. An automated check you can run constantly is what makes iteration cheap — the whole point of both practices.
- Regression safety. Once a behavior is captured, the suite stops it from silently breaking later. In EDD these are "regression evals," kept near 100% pass.
- CI gates. Both run in the pipeline and block a change that drops below threshold.
Where the analogy breaks
TDD was designed for deterministic software graded by exact match: the function returns
4 or it doesn't. LLM output is probabilistic, and a single response can be
simultaneously accurate but too long, or well-formatted but incomplete. That difference
cascades:
| Test-driven development | Eval-driven development | |
|---|---|---|
| System under test | Deterministic code | Non-deterministic model / agent behavior |
| Result | Binary pass/fail, exact match | Often graded across multiple dimensions; a statistical estimate |
| Graders | Code assertions | Code plus LLM-as-judge plus human review |
| Determinism of the check | The test itself is deterministic | An LLM-judge grader is itself non-deterministic and biased — it must be validated |
| Reading the result | Green = done | Run many times; report reliability (pass^k) and error bars, not one run |
| When the spec is written | Mostly up front | Discovered by grading real outputs ("criteria drift") |
| Flakiness | A bug to eliminate | Inherent — managed with sampling and thresholds, not eliminated |
Three differences that matter most
1. The grader can be a model — and that model is fallible. Where no deterministic check exists (tone, helpfulness, faithfulness), you grade with an LLM judge. A strong judge can reach roughly human-level agreement, but it carries position, verbosity, and self-preference bias, and lands near random on objectively-verifiable correctness. In TDD the test is the ground truth; in EDD you often have to validate the grader itself against human labels before you can trust a green run.
2. A pass is statistical, not absolute. The same input can pass on one run and fail on the next. So an eval result is an estimate: run it multiple times, report a range, and distinguish capability (can it pass — pass@k) from reliability (does it pass every time — pass^k). A 70%-reliable agent reads as ~97% at pass@3 but ~34% at pass^3. TDD never had to make that distinction.
3. You can't write all the evals first. The strict "test-first" move doesn't fully transfer. The practitioners who popularized evals are explicit about it: write evaluators for the errors you discover, not the errors you imagine. You need to grade real outputs to learn what your criteria even are — so the eval suite is grown from error analysis, not authored up front. The spec and the evals co-evolve.
So do evals replace unit tests? No.
Deterministic tests are not obsolete under EDD — they are the first and cheapest layer of the eval stack. Use a code assertion wherever the thing you care about is verifiable (the total matches, the JSON parses, the migration runs), and catch the obvious 80% before you ever pay for a judge. The pattern that grades AI-written code at scale — running real test suites where the bug-fix tests must pass and the regression tests must stay green — is just unit testing applied to a patch. Evals extend testing to the things tests can't express: behavior, quality, and grounding.
The rule of thumb:
- Use a test when the answer is verifiable and deterministic.
- Use a code-based eval when you can execute and check, even if the input space is large.
- Reach for an LLM-judge only for subjective quality you can't grade with code — and validate it first.
Bottom line
Eval-driven development is TDD's successor for the AI era, not a replacement for testing. Keep the discipline — check first, build to pass, gate in CI, guard against regressions — and add three things the probabilistic world demands: graders that can be models, results read as statistics, and a spec you discover by looking at real failures. Tests tell you the code is right; evals tell you the behavior is good enough to keep.
Next: where EDD sits relative to TDD and BDD, how to write evals for an AI coding agent, or the underlying evidence in the codex.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
Grounded in the EDD codex — esp. Part VI (the practice and the TDD analogy), Part I (eval statistics, pass@k), Part II (LLM-as-judge bias), and Part III (execution-based grading of code).