Eval-Driven Development

Articles

Articles

Plain, practitioner pieces on building AI-assisted software measurably — the spec and the guardrail, not the vibe check.

Definition

What is eval-driven development?

The canonical definition: evals as the executable spec and guardrail for AI-assisted software.

Comparison

Eval-driven development vs. test-driven development

What carries over from TDD, what breaks, and how evals and tests work together.

Comparison

Eval-driven development vs. TDD and BDD

Where EDD sits in the driven-development lineage — and why it is closest to BDD.

Comparison

Evals vs. tests vs. benchmarks: what’s the difference?

Four different things, often conflated — and which one is actually your spec.

Concept

Why unit tests aren’t enough for AI-generated code

Tests stay essential, but AI code needs behavior, grounding, and maintainability checked too.

How-to

How to write evals for an AI coding agent

From your first failure log to a CI-gating eval suite.

How-to

Eval-driven development with Claude Code, Cursor, and Copilot

Make an eval suite the gate for agent changes, whichever assistant you use.

How-to

How to build an eval harness for an LLM app

Datasets from real traffic, layered graders, CI gating, online evals, statistics.

How-to

LLM-as-judge evals: when and how (and when not)

When to grade with a model, how to make the judge trustworthy, and when never to.

How-to

Writing grading rubrics for agent behavior

Score anchors, binary criteria, outcome vs trajectory, and validating the rubric.

How-to

Regression evals: catching AI-agent drift

Model upgrades and prompt tweaks shift behavior silently. Catch it before users do.

How-to

How to use evals to make a codebase safe for AI to modify

Evals are the guardrail that lets an agent change your code without breaking it.

Artifact

An eval-driven development maturity model

A five-level scorecard, from vibe checks to a calibrated, online eval suite.

Reference

The eval-driven development codex

130+ annotated, cited sources across eight parts — the research behind the practice.

Tools & resources

Free kit

The EDD kit

Copy-paste checklist, starter eval suite, and LLM-as-judge rubric. Free to download.

Interactive

Maturity scorecard

Five questions to find your EDD level — and your next step.

Guide

The eval tooling landscape

A vendor-neutral survey of eval tools and how to choose.

Reference

FAQ

Short answers to the questions that come up most.

Reference

Glossary

Plain-language definitions of the core EDD terms.

Forthcoming

Case study · soon

Evals for a support agent: a worked teardown

How-to · soon

What evals cost to run — and how to keep it cheap