LLM-as-judge — when the judge is another model (step 8/9) · eval-driven ai development

Build a rubric-style eval suite. Write run_judge_suite(cases) that:

Takes a list of dicts, each shaped {"question": str, "answer": str, "rubric": str}.
For each case, calls judge_rubric(question, answer, rubric). The judge returns {"passed": bool, "critique": str}.
Counts how many cases pass.
Returns a dict {"total": <count>, "passed": <count of passes>, "failed": <count of fails>, "pass_rate": <0.0-1.0 rounded to 2 places>}.

The script will run a 4-case suite. Expected output:

total=4 passed=2 failed=2 pass_rate=0.5

⌘↵ runs the editor.read, then continue.

Build a rubric-style eval suite. Write run_judge_suite(cases) that:

Takes a list of dicts, each shaped {"question": str, "answer": str, "rubric": str}.
For each case, calls judge_rubric(question, answer, rubric). The judge returns {"passed": bool, "critique": str}.
Counts how many cases pass.
Returns a dict {"total": <count>, "passed": <count of passes>, "failed": <count of fails>, "pass_rate": <0.0-1.0 rounded to 2 places>}.

The script will run a 4-case suite. Expected output:

total=4 passed=2 failed=2 pass_rate=0.5

this step needs the editor

on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.

open this same url on a laptop to keep going today.