Practice
Regression evals: catching AI-agent drift
Your agent worked yesterday. The same task fails today, and nothing in your code changed. That gap is drift: the behavior of an AI system shifts underneath you, quietly, between the moment it passed review and the moment a user hits the regression. Regression evals are the standing guard that catches it first.
Where drift comes from
A deterministic service only changes when you change it. An agent has at least four other surfaces that can move without a single line of your code being touched:
- Provider model upgrades. The endpoint you call gets a new checkpoint, a routing change, or a quiet deprecation. Behavior shifts and there is no commit in your history to point at.
- Prompt and tooling changes. A reworded system prompt, a new tool, a tightened argument schema — small edits that ripple through a multi-step trajectory, because mistakes in agents propagate and compound.
- RAG corpus changes. The knowledge base is re-indexed, re-chunked, or simply grows. Retrieval quality moves, and answers that used to be grounded stop being grounded.
- Dependency updates. A bumped SDK, a changed default temperature, a new tokenizer, a library the agent shells out to. The model is identical; the environment around it is not.
Only one of these four shows up in a diff. The other three are exactly why "it passed when I merged it" is not a durable guarantee, and why you need a check that runs on a clock as well as on a commit.
Regression evals vs capability evals
An eval suite for an agent really wants to be two suites with opposite target scores. The distinction comes straight from frontier-lab practice and it is the backbone of this whole article.
| Regression evals | Capability evals |
|---|---|
| Behaviors that already work | Behaviors you can't do yet |
| Target: near 100% pass — keep it there | Target: starts low, a bet on the next few months |
| Any drop is a defect to block | A rise is progress to celebrate |
| Runs on every change and on a schedule | Runs to track the climb over time |
Regression evals "should maintain nearly 100% pass rate" so that backsliding is loud. Capability evals deliberately start low — they are bets on what the model will be able to do soon, and you watch the number climb. Same harness, opposite expectations. Confuse the two and you either tolerate regressions or panic over a capability eval that was always meant to be red.
The scheduled run is the part teams skip and regret. A pull-request gate only fires when you change something. A nightly run against a fixed dataset is what catches the silent provider update — the case where your code is frozen and the behavior moved anyway.
Building the golden set: every failure becomes a permanent case
A regression suite is only as good as its memory. The discipline is simple and unforgiving: every real failure becomes a permanent case in the golden set, so the exact thing that broke can never silently regress again. Bug tracker, support queue, production traces, the agent's own bad transcripts — each one becomes a task with a starting state, an instruction, and a grader made of tests. The dataset grows alongside the agent. A good start is twenty to fifty tasks drawn from real failures, and it only compounds from there.
Grade the outcome, not the path. For a coding agent the tests are the spec, so a
task passes when its fail_to_pass tests pass and its pass_to_pass
tests stay green. For stateful agents, compare the final world-state to an annotated goal
state rather than reading the transcript — it sidesteps brittle text matching. Resist
asserting an exact sequence of tool calls: agents routinely find valid approaches you didn't
anticipate, and path-matching produces evals that fail on good work.
The detail that makes regression evals trustworthy is reliability. Agents are non-deterministic, so a single green run is weak evidence. Run each task several times and report pass^k — the probability it passes every time — not pass@k, the probability it passes at least once. A 70%-reliable agent reads as roughly 97% at pass@3 but only about 34% at pass^3. For a regression gate, pass^k is the honest number: it tells you whether the behavior will hold, not merely whether it can. On τ-bench, pass^8 fell below 25% in retail even for capable models — a gap a single run would have hidden completely.
Online drift monitoring: the other half of the loop
Offline regression evals are a closed, finite spec. They catch the failures you have already seen. They cannot, by construction, see a failure mode you have not yet captured. That is what online evaluation is for. Sample a slice of real production traffic — commonly around 5 to 10% — and score it asynchronously with code checks and validated judges, watching for drift, novel inputs, and silent provider model updates that your fixed dataset might not exercise.
The two layers are complementary, not redundant: use offline to go fast, use online to be right. The loop closes when a production failure becomes tomorrow's golden-set case — the online layer discovers the regression, the offline layer makes sure it never comes back. Tracing is the substrate that makes this possible; you can only score the execution you captured, so instrument the retrieval step and the generation step separately and attach scores to the spans.
What to gate, and what to alert
Not every signal deserves the power to block a merge. The rule of thumb:
- Gate on regression evals. They are near-100% by definition, so a drop is a clear, deterministic defect. Block the merge; on the scheduled run, page someone.
- Alert on capability evals. A red capability eval is the expected state, not a failure. Track the trend; never let it block a deploy.
- Gate only what code can grade cleanly. Deterministic checks and validated graders gate. An unvalidated LLM judge does not earn gate authority until it agrees with your own labels.
- Alert on online drift, then triage. Sampled-production scores are noisy proxies; a dip opens an investigation and feeds the golden set — it does not auto-block, because the input distribution is always moving.
Here is a minimal CI-plus-schedule config: a gated regression job that runs on every pull request and nightly, and an alert-only capability job alongside it.
# .github/workflows/agent-evals.yml
name: agent-evals
on:
pull_request: # every prompt, tool, or dependency change
schedule:
- cron: "0 7 * * *" # daily — catch silent provider model updates
jobs:
regression:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run the golden set
run: |
edd run --suite regression \
--trials 5 \ # report pass^5, not a lucky run
--gate pass_rate>=0.98 # block the merge / page on schedule
capability:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run the stretch set (alert-only)
run: edd run --suite capability --trials 5 --no-gate One caution that costs teams the most time: the harness is the most common failure point, not the model. Read the transcripts when a regression eval flips red. A grader bug can drop a score as easily as real drift can — in one documented case a capable model scored 42% on a benchmark until the grading bugs were fixed, then jumped to 95%. Before you blame a provider for drift, confirm your own grader did not move.
- Two suites: regression (target ~100%, gate) and capability (starts low, alert-only).
- Run on every change and on a schedule, to catch silent provider updates.
- Every real failure becomes a permanent golden-set case so it can't regress again.
- Grade the outcome (tests, or final-state comparison); avoid exact tool-call matching.
- Report pass^k across multiple trials, not a single lucky run.
- Add online evals on ~5–10% of traffic; close the loop into the golden set.
- Gate deterministic checks; alert on judges and online drift until validated.
- When an eval flips red, read the transcript — rule out a grader bug before blaming drift.
Where this fits
Regression evals are the standing guarantee that an agent which worked yesterday still works today, across every surface that can move without your involvement. They are also the property that lets an agent change your codebase without quietly breaking it. If you are building this from scratch, start with the harness, then split it into these two suites.
See also: how to write evals for an AI coding agent for the harness this builds on, using evals to make a codebase safe for AI to modify for the safety payoff, and the overview of eval-driven development for how the pieces connect. For keeping a human in the loop on what the agent is allowed to do unattended, looprails.dev covers the oversight side.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
Grounded in the EDD codex — Part VI (regression vs capability evals, CI and scheduled gates, golden sets, reading transcripts), Part IV (agent trajectories, outcome grading, pass^k reliability), and Part V (online evals on sampled traffic, tracing, the offline→online loop).