FAQ
Eval-driven development FAQ
Short answers to the questions that come up most. For the long version, see the definition, the articles, or the codex.
What is eval-driven development?
Eval-driven development (EDD) is the practice of using evals — automated checks, assertion-based or LLM-graded — as the executable specification and the guardrail for AI-assisted and AI-agent software. You define the evals and the AI iterates until they pass. "Does it pass the evals?" replaces "does it look right?"
How is an eval different from a test?
A test is a deterministic assertion: the function returns 4 or it does not. An eval is a runnable experiment for non-deterministic output — a dataset, a success criterion, and a grader — whose result is a statistical estimate, not a single pass/fail. Tests are the cheapest kind of eval; evals extend testing to behaviour and quality that exact-match cannot express.
Do I still need unit tests if I have evals?
Yes. Deterministic tests are the first and cheapest layer of an eval stack — use them wherever the answer is verifiable. Evals add coverage for the things tests cannot express: behaviour, quality, grounding, and agent trajectories. They layer; they do not replace each other.
What is the difference between an eval and a benchmark?
A benchmark is a standardized dataset, metric, and protocol for comparing models (MMLU, SWE-bench). An eval is your application-specific check on your data, prompts, and tools. A high benchmark score is not the same as your product passing its evals.
Is LLM-as-judge reliable?
A strong judge can reach roughly human-level agreement on subjective quality, but it carries position, verbosity, and self-preference biases and performs poorly on objectively-verifiable correctness. Use it only for what code cannot grade, validate it against human labels, randomize order, control length, and pin temperature — then it is a useful, scalable grader.
How many eval cases do I need to start?
Begin with 20 to 50 cases drawn from real failures — your bug tracker, support queue, and production traces. The point is coverage of real failure modes discovered through error analysis, not a large imagined matrix.
Should I write all the evals before the feature, like TDD?
Keep the rhythm of writing the check first, but do not over-apply it. The practitioners who popularized evals are explicit: write evaluators for the errors you discover, not the ones you imagine. You learn your real criteria by grading actual outputs ("criteria drift"), so the suite grows from error analysis.
What is pass@k versus pass^k?
pass@k is the probability that at least one of k attempts succeeds — it measures capability (can it ever?). pass^k is the probability that all k attempts succeed — it measures reliability (does it every time?). A 70%-reliable agent looks like about 97% at pass@3 but only about 34% at pass^3. Ship on reliability.
How do evals fit into CI/CD?
Run the eval suite as a build gate on every change and every model upgrade. Keep regression evals (behaviours that already work) near 100% pass and block the build if they break; let capability evals (harder bets) start low and track the trend without blocking.
What are regression evals?
Regression evals are evals for behaviours that already work, kept near 100% pass to catch silent backsliding when you change a prompt, upgrade a model, or update a dependency. Run them on every change and on a schedule, because providers update models under you.
Can a passing eval suite still ship a broken product?
Yes. Static evals are a finite, closed spec, and every eval is a Goodhart target — vulnerable to contamination, saturation, gaming, and style-biased judges. Treat green as "no known regressions," not "correct," and pair evals with held-out tests and real-world feedback.
How does EDD relate to TDD and BDD?
All three write the check first and build to pass it. TDD puts the check in code (deterministic units); BDD makes it readable, behavioural, and shared (specification by example); EDD extends the family to non-deterministic AI output by adding model and human graders and reading results statistically. EDD is spiritually closest to BDD.
What tools should I use for evals?
There is no single tool. Most teams pair a lightweight CI eval framework (such as Promptfoo, DeepEval, or Inspect AI) with an observability platform (such as Langfuse, Braintrust, or Phoenix). Favour portable, open-source building blocks for the spec layer, and watch for lock-in and licensing. See the tools page for a vendor-neutral survey.
How do evals make a codebase safe for AI to modify?
A change is acceptable only if it passes the evals, so a strong eval suite is the executable contract that lets an agent modify code without breaking it: comprehensive tests that must pass and stay green, behavioural evals for what tests cannot express, and autonomy gated on reliability rather than a single lucky run.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.