Add evals and traces — measure the agent, don't trust it — step 8 of 9
Write run_eval_suite(cases) that:
- Takes a list of dicts, each shaped
{"question": str, "expected": str, "max_turns": int}. - For each case:
- Calls
run_agent(question)(provided) which returns{"answer": str, "trace": list}. - Asserts THREE things, AND combines into a single boolean:
a)
expectedsubstring is inanswer, b)len(trace)is<= max_turns, c)trace[-1]["stop_reason"] == "end_turn".
- Calls
- Counts passes (all three asserts true) and fails.
- Returns
{"total": <count>, "passed": <count>, "failed": <count>, "pass_rate": <0.0-1.0 rounded to 2 places>}.
Three cases provided. Expected output:
total=3 passed=2 failed=1 pass_rate=0.67
⌘↵ runs the editor.read, then continue.
Write run_eval_suite(cases) that:
- Takes a list of dicts, each shaped
{"question": str, "expected": str, "max_turns": int}. - For each case:
- Calls
run_agent(question)(provided) which returns{"answer": str, "trace": list}. - Asserts THREE things, AND combines into a single boolean:
a)
expectedsubstring is inanswer, b)len(trace)is<= max_turns, c)trace[-1]["stop_reason"] == "end_turn".
- Calls
- Counts passes (all three asserts true) and fails.
- Returns
{"total": <count>, "passed": <count>, "failed": <count>, "pass_rate": <0.0-1.0 rounded to 2 places>}.
Three cases provided. Expected output:
total=3 passed=2 failed=1 pass_rate=0.67
this step needs the editor
on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.