LLM-as-judge — when the judge is another model (step 6/9) · eval-driven ai development

promptdojo_›phase 04 · shipping discipline›ch 21 · eval-driven ai development

lesson 2 of 3 · llm-as-judge — when the judge is another modelstep 6 / 9

The judge returns both a 1-5 numeric score AND a binary passed flag (some teams emit both during transition). The code is using the score with a hardcoded threshold — score >= 4 — but the rubric was designed for binary pass/fail and the 1-5 axis is uncalibrated. As Hamel Husain puts it: averaging Likert scores produces false precision.

Two cases come in: one where the model judged 4/5 with passed: False (the answer is technically wrong but well-written — style bias), and one where the model judged 3/5 with passed: True (the answer is correct but plain).

Fix the gate so it reads the binary verdict instead of the numeric score.

Expected output:

case 1: ship=False
case 2: ship=True

The break is on line 8 — but read the whole snippet first.

⌘↵ runs the editor.read, then continue.

Fix the gate so it reads the binary verdict instead of the numeric score.

Expected output:

case 1: ship=False
case 2: ship=True

The break is on line 8 — but read the whole snippet first.

this step needs the editor

on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.

save my spot follow @TFisPython for the app launch

open this same url on a laptop to keep going today.

LLM-as-judge — when the judge is another model — step 6 of 9

this step needs the editor