Build a rubric-style eval suite. Write run_judge_suite(cases) that:
- Takes a list of dicts, each shaped
{"question": str, "answer": str, "rubric": str}. - For each case, calls
judge_rubric(question, answer, rubric). The judge returns{"passed": bool, "critique": str}. - Counts how many cases pass.
- Returns a dict
{"total": <count>, "passed": <count of passes>, "failed": <count of fails>, "pass_rate": <0.0-1.0 rounded to 2 places>}.
The script will run a 4-case suite. Expected output:
total=4 passed=2 failed=2 pass_rate=0.5