Add evals and traces — measure the agent, don't trust it (step 8/9) · capstone

Write run_eval_suite(cases) that:

Takes a list of dicts, each shaped {"question": str, "expected": str, "max_turns": int}.
For each case:
- Calls run_agent(question) (provided) which returns {"answer": str, "trace": list}.
- Asserts THREE things, AND combines into a single boolean: a) expected substring is in answer, b) len(trace) is <= max_turns, c) trace[-1]["stop_reason"] == "end_turn".
Counts passes (all three asserts true) and fails.
Returns {"total": <count>, "passed": <count>, "failed": <count>, "pass_rate": <0.0-1.0 rounded to 2 places>}.

Three cases provided. Expected output:

total=3 passed=2 failed=1 pass_rate=0.67

⌘↵ runs the editor.read, then continue.

Write run_eval_suite(cases) that:

Takes a list of dicts, each shaped {"question": str, "expected": str, "max_turns": int}.
For each case:
- Calls run_agent(question) (provided) which returns {"answer": str, "trace": list}.
- Asserts THREE things, AND combines into a single boolean: a) expected substring is in answer, b) len(trace) is <= max_turns, c) trace[-1]["stop_reason"] == "end_turn".
Counts passes (all three asserts true) and fails.
Returns {"total": <count>, "passed": <count>, "failed": <count>, "pass_rate": <0.0-1.0 rounded to 2 places>}.

Three cases provided. Expected output:

total=3 passed=2 failed=1 pass_rate=0.67

this step needs the editor

on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.

open this same url on a laptop to keep going today.