promptdojo_

When assert runs out of road, send in another model

Lesson 01 ranked four eval patterns by strictness: exact match, contains, shape, judge. The first three are mechanical — string compare, regex, schema validate. The fourth is another model reading your model's output and grading it. That's LLM-as-judge.

You reach for it when the answer is correct in many forms ("explain photosynthesis to a 5-year-old," "translate this email," "draft a support reply") and string equality is useless. The 2023 paper that made this respectable, Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena (Zheng et al., NeurIPS 2023, arxiv.org/abs/2306.05685), showed that GPT-4-as-judge agreed with human raters on over 80% of comparisons — the same level humans agree with each other. That's the green light the field needed.

It's also the pattern that ships the most subtle bugs.

Two shapes you'll see in the wild

ShapeQuestion askedOutput
Pairwise"Given outputs A and B, which is better?"A / B / tie
Rubric"Score this output against this rubric."pass / fail (+ critique)

Pairwise is what Chatbot Arena uses. Cheap relatively (you're not generating a number, just a vote), more consistent than absolute ratings — humans are better at "this one over that one" than "is this a 7 or an 8." Cost: O(n²) comparisons, no absolute quality floor.

Rubric is what you want for CI and regression evals. One number per case, plug into pytest, fail the build if the rate drops. Hamel Husain's blunt rule from his eval methodology guide: use binary pass/fail, not 1–5 Likert. Likert scores drift, the difference between a 3 and a 4 is undefined, and averaging them produces false precision. Binary forces you to write the rubric until it's unambiguous. (hamel.dev/blog/posts/llm-judge/)

What everyone calls it

FrameworkPrimitive
LangSmithevaluatorsLLMEvaluator, custom rubric configs
OpenAI Evalsgradersscore_model / label_model
Inspect AI (UK AISI)scorersmodel_graded_qa(), model_graded_fact()
Promptfooassert: llm-rubric, assert: factuality
DeepEvalGEval (Liu et al. G-Eval paper, arxiv.org/abs/2303.16634)
Anthropic SDKNo native primitive — use tools=[...] to force structured verdict (the "tool-use trick")

Same primitive everywhere: send the input + output + rubric, get a structured verdict back. You'll write the framework-free version in step 8.

The trap that defines this lesson

LLM-as-judge has four documented biases — position, length, self-preference, style. They're not edge cases; they're the default. A judge will pick the first output you show it more often than chance. It will pick the longer output regardless of whether length matters. It will rate outputs from its own model family higher. Step 3 names them with citations and step 7 fixes the biggest one in code.

What you'll build

A judge_rubric(question, answer, rubric) that returns a binary verdict + critique. Then a fix for the Likert-score trap. Then a fix for position bias (the "run both orders, count consistent wins" mitigation from Zheng et al.). Then a real pairwise judge that detects its own bias and votes tie when it disagrees with itself.

read, then continue.