Technique
LLM-as-judge evals: when and how (and when not)
An LLM judge is a model you point at another model's output and ask "is this good?" It is the only practical way to grade qualities no assertion can express — tone, helpfulness, faithfulness. It is also fallible, biased, and gameable. Used in the right place and validated, it is a real guardrail. Used as a default, it quietly lies to you.
When a judge is the right tool — and when it isn't
Reach for an LLM judge only when the thing you care about is subjective quality you cannot grade with code. If a deterministic check exists — the JSON parses, the total matches, the test suite passes — use it. A code grader is faster, cheaper, reproducible, and can't be talked into a higher score.
The boundary is sharp, and the codex draws it with hard evidence. On tasks with verifiable ground truth — knowledge, reasoning, math, code — LLM judges perform close to random when asked to pick the correct answer. They are not good at deciding which of two answers is right. They are good at approximating which of two answers a human would prefer. Those are different jobs. So:
- Code grader for anything verifiable: format, math, "does it compile / pass the tests."
- LLM judge only for open-ended quality — was the summary faithful, the tone right, the explanation helpful.
Even when a judge fits, it is the last resort in the stack, not the first. Vendor guidance consistently ranks grading methods code-based first, human review second, and LLM-based last — flexible but to be tested before you scale it. The judge earns its place only where rules genuinely can't capture the nuance.
How to make the judge trustworthy
A judge is a prompt, and most of its quality lives in the rubric. The patterns that move it from coin-flip to credible recur across vendor playbooks and the research:
- Write a rubric with concrete anchors, not "rate 1–5." Show, don't tell: describe exactly what passes and what fails. Vague scales produce vague, drifting scores.
- Reason before you score (CoT). Make the judge lay out its reasoning first, then emit a verdict. This is the single most repeated lever for lifting agreement with humans; the G-Eval pattern — reason from the rubric, then fill in the score — is the reference design.
- Give a reference answer when you have one. Reference-based grading dramatically outperforms reference-free; judge reliability drops noticeably without a reference in the prompt. A good answer to compare against turns a vibe check into a comparison.
- Prefer binary pass/fail over a Likert scale. A 1–5 number is hard to anchor and easy to fudge; pass/fail with a written critique is more stable and forces a crisp criterion.
- Pin the temperature and version the prompt. The judge is itself non-deterministic. Hold its settings fixed so a "pass" is reproducible and a score change means the output changed, not the weather.
A small worked rubric for a faithfulness judge — anchors, reason-before-score, one binary verdict:
FAITHFULNESS — does the answer stay grounded in the provided source?
First, reason step by step: list each claim in the answer, then check
whether the source supports it. Only after reasoning, output a verdict.
PASS — every claim is supported by the source; no invented facts,
numbers, or entities; "I don't know" when the source is silent.
FAIL — any claim contradicts the source, or adds a fact the source
does not contain (a hallucination), however fluent it sounds.
Ignore tone, length, and writing style. Judge only grounding.
Output: reasoning (3-6 sentences), then VERDICT: PASS or FAIL. The biases, and how to blunt them
A strong judge can reach roughly the level of agreement humans have with each other on open-ended preference — the headline that makes judges viable. That headline hides systematic, exploitable biases. Three are first-class, and each has a mitigation:
| Bias | What it does | Mitigation |
|---|---|---|
| Position | Favors an answer by where it sits in the prompt. Severe enough to flip verdicts purely by reordering — the same two answers can swap winners. | Score in both orders and average (swap-and-average). Never trust a single-order pairwise verdict. |
| Verbosity / length | Rewards longer answers for being longer, so a generator can inflate its score by padding. | Control for length — instruct against it, cap it, or statistically debias the preference so length stops paying off. |
| Self-preference | Over-rewards text that "sounds like" the judge — familiar, low-perplexity output, including from its own model family. | Don't grade your own family blindly; where you can, judge with a different model than the one that generated. |
A related design choice: pairwise (A-vs-B) grading is more stable than absolute scoring but amplifies bias and is more easily gamed by spurious features; absolute scoring is noisier but more robust to manipulation. Pick the protocol to fit the task rather than defaulting to one.
Validate the judge before you trust it
A judge is a measuring instrument, and an uncalibrated instrument is worse than none — it gives you confident numbers that are wrong. Who validates the validators? You do, against human labels, before the judge gates anything.
- Measure precision and recall, not raw agreement. When passes and fails are imbalanced, raw accuracy flatters a judge that just guesses the majority class. Use Cohen's kappa (chance-corrected) over raw percent concordance.
- Iterate the prompt until it agrees with a human expert. The practitioner pattern is "critique shadowing": one principal domain expert makes binary pass/fail calls with written critiques; those become few-shot material for the judge prompt, refined until human–judge agreement is high — often in only a few passes.
- Expect criteria drift. You need criteria to grade, but grading is what reveals the criteria. Rubrics are discovered iteratively against real outputs, not specified perfectly up front — the same co-evolving spec at the heart of EDD. LLM-suggested criteria often don't match human preference until you validate them.
A judge is an attack surface
Anything that gates production is a target. Short universal adversarial suffixes can be appended to an output to push a judge toward maximum scores regardless of actual quality, and such attacks can transfer across judge models. Absolute scoring is more vulnerable than comparative. The practical takeaways: don't let a single judge be the only gate in an adversarial or high-stakes setting, watch for outputs that game the grader rather than the task, and keep a human or a deterministic check in the loop where the stakes justify it. A judge is a useful guardrail, not a tamper-proof one.
When to use a judge — and when not
| Situation | |
|---|---|
| Use a judge | Grading subjective quality with no deterministic check — faithfulness, tone, helpfulness, completeness. |
| Use a judge | You have a rubric with anchors, reason-before-score, and a reference answer to compare against. |
| Use a judge | You've validated it against human labels and re-validate periodically for drift. |
| Don't use a judge | The answer is objectively verifiable (math, code, format, exact match) — use a test or code grader. |
| Don't use a judge | You'd grade an output with a model from its own family and call it neutral. |
| Don't use a judge | It's a generic, unvalidated 1–5 metric you wired up without looking at any data. |
| Don't use a judge | It is the sole gate in an adversarial setting where outputs can be crafted to fool it. |
Bottom line
The LLM judge is the part of the eval stack you reach for last and trust least without proof. Use it only where quality is genuinely subjective; build it with a clear anchored rubric, reason-before-score, a reference answer, and a pinned temperature; blunt position, verbosity, and self-preference bias; validate it against human labels with precision/recall and kappa before it gates anything; and remember it can be gamed. Done that way, a judge grades the things tests can't — and stays honest enough to be worth grading with.
Next: how to write evals for an AI coding agent, how to build an eval harness for an LLM app, or the underlying evidence in the codex.
Get new eval-driven development essays by email
Practical evals for AI-assisted and agent code — the executable spec and the guardrail, not the vibe check. No spam, unsubscribe anytime.
Grounded in the EDD codex — esp. Part II (LLM-as-judge: bias, rubric design, CoT grading, validation, robustness) and the cross-cutting synthesis on when a judge is the right grader.