Glossary
Eval-driven development glossary
The core vocabulary of EDD, in plain language. For where each idea comes from and the evidence behind it, see the codex.
- Eval
- A runnable experiment that checks an AI system: a dataset of inputs, a success criterion, and a grader. Unlike a test, it grades non-deterministic output and its result is a statistical estimate, not a single pass/fail.
- Eval-driven development (EDD)
- Using evals as the executable spec and guardrail for AI-assisted and agent software: define the evals, then have the AI iterate until they pass.
- Grader
- The thing that decides pass or fail for an eval case. Three kinds, cheapest first: code/execution, LLM-as-judge, and human review.
- Golden set
- The curated dataset of inputs (and expected behaviour) an eval suite runs against. Best grown from real failures and production traces, not imagined cases.
- LLM-as-judge
- Using a language model to grade output against a rubric. Useful for subjective quality, but biased (position, verbosity, self-preference) and unreliable on verifiable correctness — so it must be validated against human labels.
- pass@k
- The probability that at least one of k attempts succeeds. Measures capability — whether a system can ever produce a correct result.
- pass^k
- The probability that all k attempts succeed. Measures reliability — whether a system does the task every time. The honest number for anything you ship.
- Regression eval
- An eval for behaviour that already works, kept near 100% pass to catch silent backsliding from a prompt change, model upgrade, or dependency update.
- Capability eval
- An eval for a behaviour you cannot reliably do yet. It deliberately starts at a low pass rate and is tracked as a bet on what is becoming possible, rather than blocking the build.
- Error analysis
- Reading real outputs and categorizing how they fail, until no new failure type appears. The highest-leverage activity in EDD — the eval set is built from what you find.
- Criteria drift
- The catch-22 that you need criteria to grade outputs, but grading is what reveals the real criteria. It means eval rubrics are discovered and iterated, not fully written up front.
- Benchmark
- A standardized dataset, metric, and protocol for comparing models (for example MMLU or SWE-bench). Distinct from an eval, which is specific to your application.
- Contamination
- When benchmark or eval data leaks into a model’s training data, inflating scores through memorization rather than genuine capability. A reason to prefer fresh, private, or held-out evals.
- Goodharting (eval gaming)
- From "when a measure becomes a target, it ceases to be a good measure": once an eval is optimized against, it can be satisfied without the underlying quality — via contamination, reward hacking, or style-over-substance.
- RAG triad
- Three reference-free axes for evaluating retrieval-augmented generation: context relevance (retrieval), groundedness/faithfulness (claims trace to the context), and answer relevance.
- Trajectory evaluation
- Grading the steps an agent takes (tool calls, order, state changes), as opposed to only its final outcome. Use it when the process matters; otherwise grade the outcome and allow partial credit.
- Online eval
- Scoring a sample of live production traffic, as opposed to offline evals against a fixed set. Catches drift and novel failures; its findings feed back into the golden set.
- Eval harness
- The machinery that runs an eval suite: provides inputs and tools, executes cases (often repeatedly and in parallel), records every step, grades, and aggregates. Frequently the biggest source of misleading results.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
Definitions distilled from the EDD codex.