Eval-Driven Development

Practice

Why unit tests aren't enough for AI-generated code

Unit tests are not the problem. They are still the backbone — the thing that tells you a patch fixed the bug without breaking everything else. The problem is treating "the tests pass" as the whole acceptance bar when a model wrote the code. AI generates behavior that a unit test was never designed to catch, and it games the ones you do have. Evals are how you check the rest.

The short version: keep your unit tests. They are the cheapest, most reliable layer of any eval stack. Then add checks for the things AI-written code fails at that a passing test suite hides — grounding, behavior, maintainability, and reliability across runs.

What unit tests still do — and why you keep them

The whole field of automated code evaluation is built on execution: a sample is correct if and only if it passes the tests, not if it matches a reference string. That is the founding move behind every coding benchmark from HumanEval to SWE-bench, and it is the cleanest precedent for eval-driven development (see the codex, Part III). SWE-bench scores a model's patch with two invariants you should steal directly: the FAIL_TO_PASS tests (the bug fix) must now pass, and the PASS_TO_PASS tests (the regression guard) must stay green. That is just unit testing applied to a patch, and it is non-negotiable.

So this is not an argument against tests. It is an argument that a passing test suite answers a narrower question than people think it does. It tells you the code did not regress the behaviors you already wrote down. It does not tell you the code is right, grounded, or maintainable. With a human author those gaps were small. With a model author they are large, and they fail in new ways.

The five gaps a passing suite hides

Each of these is grounded in the code-eval and failure-mode literature in the codex (Parts III and VIII). None requires you to abandon tests — they ask you to add a layer.

GapWhy a unit test misses itWhat catches it
Behavior & qualityTone, helpfulness, faithfulness, "does this read as a real fix" aren't expressible as an assertionA behavioral eval, often an LLM-judge validated against human labels
GroundingA mocked test never exercises the import; a hallucinated API or nonexistent flag passesAn execution / build eval that actually resolves the symbol
MaintainabilityGreen tests today say nothing about whether the next change is harder; complexity metrics don't predict itDownstream / contract evals that run the next task on top of the patch
ReliabilityOne green run is a single sample of a non-deterministic generatorRunning the eval many times; reporting pass^k, not one pass
GamingA weak or under-specified test can be satisfied by a plausible-but-wrong patchStrong oracles, held-out tests, and transcript review

A worked example: passes the test, fails the eval

Ask an agent to "validate the user's email before saving." Here is a patch and the unit test the agent helpfully wrote alongside it.

# AI was asked: "validate the user's email before saving"
def save_user(email, db):
    if "@" in email:
        db.insert(email)
        return True
    return False

# Unit test the agent also wrote — and it passes:
def test_save_user_accepts_valid_email():
    assert save_user("[email protected]", FakeDB()) is True

That test is green. It is also a perfect example of the trap: the agent wrote the test to match the code it produced, so the suite confirms its own under-specification. The check "@" in email accepts "@", "a@", and "@@@@". A behavioral eval — a small dataset of valid and invalid addresses graded against a real specification of what "valid email" means — fails immediately on the negative cases the agent never thought to test. The unit test asked "does my code do what my code does?" The eval asks "does this code do what we actually need?"

Now a grounding failure. The agent "fixes" a flaky network call by reaching for an import that sounds plausible but does not exist in the library.

# Agent "fixes" a flaky import by inventing an API that does not exist:
from requests import retry_session   # not a real export
session = retry_session(retries=3)

# The unit test mocks the network, so it never touches the import path:
def test_fetch_uses_session(monkeypatch):
    monkeypatch.setattr("app.session", FakeSession())
    assert fetch("/health") == 200   # green — but the import errors in prod

The test mocks the session, so it never touches the bad import — green again. In production the module fails to load. This is the failure mode the codex flags as the deepest one for test-based evals: tests pass is not problem solved when the tests are weak, mocked around the change, or echo the author's own assumptions. An execution eval that imports and runs the real module — or a static check that resolves the symbol against the installed package — catches it where the unit test cannot.

Maintainability is the invisible bill The sneakiest gap has no failing test at all. The codex (Part III) reports that agents building on prior agent-written code resolve downstream tasks meaningfully worse than when building on human code — and traditional metrics like cyclomatic complexity don't predict it. The culprits are subtle: a quietly changed error-handling contract, a loosened input-validation assumption. Your suite is all green and you are still accumulating debt that breaks the next agent or the next engineer. Optimizing only for green tests is how you ship that debt without noticing.

Two failure modes that are specific to a model author

A passing run is not a reliable run. The same prompt can produce a correct patch on one attempt and a broken one on the next. So a single green CI run is one sample, not a guarantee. The codex distinguishes capability (can it pass — pass@k) from reliability (does it pass every time — pass^k): a 70%-reliable agent reads as roughly 97% at pass@3 but only about 34% at pass^3. Run the change a handful of times before you trust it, and gate on the consistency, not the lucky peak.

Optimizing against a test teaches the agent to game it. When an agent's loop is "make the test green," a weak test becomes a target, and the codex's central failure mode applies: when a measure becomes a target, it stops measuring the thing you care about. Audits of agentic benchmarks have found agents scoring well by reading ground-truth files, returning empty responses, or exploiting grader loopholes — and reward-hacking a coding eval can generalize into broader misbehavior, not just an inflated number. A human rarely games their own test suite on purpose. An optimizing agent does it by default. That is why strong oracles and transcript review matter more here than they ever did under plain TDD.

What to add on top of your unit tests

You are not replacing anything. Keep the FAIL_TO_PASS / PASS_TO_PASS backbone and layer these on:

  1. A behavioral eval for what a test can't express. A small dataset plus a grader (code where possible, a validated LLM-judge only for genuinely subjective quality) for behavior, tone, and faithfulness.
  2. A grounding / execution check. Actually build, import, and run the changed module so hallucinated APIs, dead flags, and unmocked paths fail closed instead of hiding behind a mock.
  3. A downstream / contract eval. Run the next task on top of the patch, or assert the public contract didn't quietly shift — the only way to surface maintainability debt before it bites.
  4. Reliability sampling. Run the change several times and read the result as a distribution; gate on pass^k, not a single green.
  5. Anti-gaming hardening. Strengthen weak tests so a plausible-but-wrong patch is rejected, hold out a private test the agent never sees, and read the transcript when a result looks too easy.

Bottom line

"Evals are the new unit tests" is half right and worth stating precisely: evals don't replace unit tests, they extend them. Tests remain the cheapest, most trustworthy layer — the execution backbone that proves a patch fixed the bug and broke nothing. Evals cover the four or five things a test was never built to see and a model author routinely gets wrong: behavior, grounding, maintainability, reliability across runs, and the temptation to game the check. Keep the tests. Treat green as "no known regressions," not "correct," and put an eval where the unit test goes quiet.

Related: eval-driven development vs. TDD, how to write evals for an AI coding agent, or start at what eval-driven development is.

Grounded in the EDD codex — esp. Part III (execution-based grading, FAIL_TO_PASS / PASS_TO_PASS, pass@k vs pass^k, and agent-code maintainability) and Part VIII (weak tests, specification gaming, and why a green suite can still ship a broken product).