LLM-as-judge — when the judge is another model (step 9/9) · eval-driven ai development

Checkpoint

One last thing before we move on. Same surface as a write step — but the lesson doesn't complete until this passes.

Final drill. Build a pairwise judge that detects its OWN position bias and returns tie when it disagrees with itself across orders.

Write pairwise(question, output_a, output_b) that:

Calls fake_judge(question, first, second) twice:
- Once with first=output_a, second=output_b — read result as "the output named output_a sits in slot A this round."
- Once with first=output_b, second=output_a — output_b is now in slot A.
Each call returns "A" or "B" (the slot that won).
Translate slot wins back to which physical output won that round.
If the SAME physical output won both rounds, return that output's label ("a" or "b").
If the rounds disagreed, return "tie".

Then run a multi-case suite. Expected output:

case 1: a wins  (consistent)
case 2: tie     (position-biased — judge disagrees with itself)
case 3: b wins  (consistent)

Checkpoint

One last thing before we move on. Same surface as a write step — but the lesson doesn't complete until this passes.

Final drill. Build a pairwise judge that detects its OWN position bias and returns tie when it disagrees with itself across orders.

Write pairwise(question, output_a, output_b) that:

Calls fake_judge(question, first, second) twice:
- Once with first=output_a, second=output_b — read result as "the output named output_a sits in slot A this round."
- Once with first=output_b, second=output_a — output_b is now in slot A.
Each call returns "A" or "B" (the slot that won).
Translate slot wins back to which physical output won that round.
If the SAME physical output won both rounds, return that output's label ("a" or "b").
If the rounds disagreed, return "tie".

Then run a multi-case suite. Expected output:

case 1: a wins  (consistent)
case 2: tie     (position-biased — judge disagrees with itself)
case 3: b wins  (consistent)

full-screen editor opens — close anytime to keep reading.

LLM-as-judge — when the judge is another model — step 9 of 9