LLM-as-judge — when the judge is another model — step 8 of 9
Build a rubric-style eval suite. Write run_judge_suite(cases) that:
- Takes a list of dicts, each shaped
{"question": str, "answer": str, "rubric": str}. - For each case, calls
judge_rubric(question, answer, rubric). The judge returns{"passed": bool, "critique": str}. - Counts how many cases pass.
- Returns a dict
{"total": <count>, "passed": <count of passes>, "failed": <count of fails>, "pass_rate": <0.0-1.0 rounded to 2 places>}.
The script will run a 4-case suite. Expected output:
total=4 passed=2 failed=2 pass_rate=0.5
⌘↵ runs the editor.read, then continue.
Build a rubric-style eval suite. Write run_judge_suite(cases) that:
- Takes a list of dicts, each shaped
{"question": str, "answer": str, "rubric": str}. - For each case, calls
judge_rubric(question, answer, rubric). The judge returns{"passed": bool, "critique": str}. - Counts how many cases pass.
- Returns a dict
{"total": <count>, "passed": <count of passes>, "failed": <count of fails>, "pass_rate": <0.0-1.0 rounded to 2 places>}.
The script will run a 4-case suite. Expected output:
total=4 passed=2 failed=2 pass_rate=0.5
this step needs the editor
on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.