LLM-as-judge — when the judge is another model — step 6 of 9
The judge returns both a 1-5 numeric score AND a binary passed
flag (some teams emit both during transition). The code is using
the score with a hardcoded threshold — score >= 4 — but the
rubric was designed for binary pass/fail and the 1-5 axis is
uncalibrated. As Hamel Husain puts it: averaging Likert scores
produces false precision.
Two cases come in: one where the model judged 4/5 with passed: False
(the answer is technically wrong but well-written — style bias),
and one where the model judged 3/5 with passed: True (the answer
is correct but plain).
Fix the gate so it reads the binary verdict instead of the numeric score.
Expected output:
case 1: ship=False
case 2: ship=True
The judge returns both a 1-5 numeric score AND a binary passed
flag (some teams emit both during transition). The code is using
the score with a hardcoded threshold — score >= 4 — but the
rubric was designed for binary pass/fail and the 1-5 axis is
uncalibrated. As Hamel Husain puts it: averaging Likert scores
produces false precision.
Two cases come in: one where the model judged 4/5 with passed: False
(the answer is technically wrong but well-written — style bias),
and one where the model judged 3/5 with passed: True (the answer
is correct but plain).
Fix the gate so it reads the binary verdict instead of the numeric score.
Expected output:
case 1: ship=False
case 2: ship=True
this step needs the editor
on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.