Question 1

What is eval-driven development?

Accepted Answer

Eval-driven development (EDD) is the practice of using evals — automated checks, assertion-based or LLM-graded — as the executable specification and the guardrail for AI-assisted and AI-agent software. You define the evals and the AI iterates until they pass. "Does it pass the evals?" replaces "does it look right?"

Question 2

How is an eval different from a test?

Accepted Answer

A test is a deterministic assertion: the function returns 4 or it does not. An eval is a runnable experiment for non-deterministic output — a dataset, a success criterion, and a grader — whose result is a statistical estimate, not a single pass/fail. Tests are the cheapest kind of eval; evals extend testing to behaviour and quality that exact-match cannot express.

Question 3

Do I still need unit tests if I have evals?

Accepted Answer

Yes. Deterministic tests are the first and cheapest layer of an eval stack — use them wherever the answer is verifiable. Evals add coverage for the things tests cannot express: behaviour, quality, grounding, and agent trajectories. They layer; they do not replace each other.

Question 4

What is the difference between an eval and a benchmark?

Accepted Answer

A benchmark is a standardized dataset, metric, and protocol for comparing models (MMLU, SWE-bench). An eval is your application-specific check on your data, prompts, and tools. A high benchmark score is not the same as your product passing its evals.

Question 5

Is LLM-as-judge reliable?

Accepted Answer

A strong judge can reach roughly human-level agreement on subjective quality, but it carries position, verbosity, and self-preference biases and performs poorly on objectively-verifiable correctness. Use it only for what code cannot grade, validate it against human labels, randomize order, control length, and pin temperature — then it is a useful, scalable grader.

Question 6

How many eval cases do I need to start?

Accepted Answer

Begin with 20 to 50 cases drawn from real failures — your bug tracker, support queue, and production traces. The point is coverage of real failure modes discovered through error analysis, not a large imagined matrix.

Question 7

Should I write all the evals before the feature, like TDD?

Accepted Answer

Keep the rhythm of writing the check first, but do not over-apply it. The practitioners who popularized evals are explicit: write evaluators for the errors you discover, not the ones you imagine. You learn your real criteria by grading actual outputs ("criteria drift"), so the suite grows from error analysis.

Question 8

What is pass@k versus pass^k?

Accepted Answer

pass@k is the probability that at least one of k attempts succeeds — it measures capability (can it ever?). pass^k is the probability that all k attempts succeed — it measures reliability (does it every time?). A 70%-reliable agent looks like about 97% at pass@3 but only about 34% at pass^3. Ship on reliability.

Question 9

How do evals fit into CI/CD?

Accepted Answer

Run the eval suite as a build gate on every change and every model upgrade. Keep regression evals (behaviours that already work) near 100% pass and block the build if they break; let capability evals (harder bets) start low and track the trend without blocking.

Question 10

What are regression evals?

Accepted Answer

Regression evals are evals for behaviours that already work, kept near 100% pass to catch silent backsliding when you change a prompt, upgrade a model, or update a dependency. Run them on every change and on a schedule, because providers update models under you.

Question 11

Can a passing eval suite still ship a broken product?

Accepted Answer

Yes. Static evals are a finite, closed spec, and every eval is a Goodhart target — vulnerable to contamination, saturation, gaming, and style-biased judges. Treat green as "no known regressions," not "correct," and pair evals with held-out tests and real-world feedback.

Question 12

How does EDD relate to TDD and BDD?

Accepted Answer

All three write the check first and build to pass it. TDD puts the check in code (deterministic units); BDD makes it readable, behavioural, and shared (specification by example); EDD extends the family to non-deterministic AI output by adding model and human graders and reading results statistically. EDD is spiritually closest to BDD.

Question 13

What tools should I use for evals?

Accepted Answer

There is no single tool. Most teams pair a lightweight CI eval framework (such as Promptfoo, DeepEval, or Inspect AI) with an observability platform (such as Langfuse, Braintrust, or Phoenix). Favour portable, open-source building blocks for the spec layer, and watch for lock-in and licensing. See the tools page for a vendor-neutral survey.

Question 14

How do evals make a codebase safe for AI to modify?

Accepted Answer

A change is acceptable only if it passes the evals, so a strong eval suite is the executable contract that lets an agent modify code without breaking it: comprehensive tests that must pass and stay green, behavioural evals for what tests cannot express, and autonomy gated on reliability rather than a single lucky run.

Eval-driven development FAQ