promptdojo_

The judge returns both a 1-5 numeric score AND a binary passed flag (some teams emit both during transition). The code is using the score with a hardcoded threshold — score >= 4 — but the rubric was designed for binary pass/fail and the 1-5 axis is uncalibrated. As Hamel Husain puts it: averaging Likert scores produces false precision.

Two cases come in: one where the model judged 4/5 with passed: False (the answer is technically wrong but well-written — style bias), and one where the model judged 3/5 with passed: True (the answer is correct but plain).

Fix the gate so it reads the binary verdict instead of the numeric score.

Expected output:

case 1: ship=False
case 2: ship=True
The break is on line 8 — but read the whole snippet first.

this step needs the editor

on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.

open this same url on a laptop to keep going today.