Add evals and traces — measure the agent, don't trust it (step 9/9) · capstone

Checkpoint

One last thing before we move on. Same surface as a write step — but the lesson doesn't complete until this passes.

Final drill. Build the fully instrumented agent + eval suite. Write run_agent(question) that:

Loops up to 5 times calling fake_create(question, turn_index) which returns a dict with stop_reason, text, and a nested usage dict containing input_tokens and output_tokens (matching the SDK shape established in lesson 02).
Builds a trace list — one fresh dict per turn, with keys turn (1-based), stop_reason, tokens (input + output).
On end_turn: returns {"answer": <text>, "trace": <trace>}.
On cap: returns {"answer": "capped", "trace": <trace>}.

Then write run_eval_suite(cases) that:

For each case {"question", "expected", "max_turns"}, runs the agent.
All-three-true: expected in answer, len(trace) <= max_turns, trace[-1]["stop_reason"] == "end_turn".
Returns {"total", "passed", "failed", "pass_rate" (rounded 2)}.

Two cases run for you. Expected output:

total=2 passed=1 failed=1 pass_rate=0.5

⌘↵ runs the editor.read, then continue.

Checkpoint

One last thing before we move on. Same surface as a write step — but the lesson doesn't complete until this passes.

Final drill. Build the fully instrumented agent + eval suite. Write run_agent(question) that:

Loops up to 5 times calling fake_create(question, turn_index) which returns a dict with stop_reason, text, and a nested usage dict containing input_tokens and output_tokens (matching the SDK shape established in lesson 02).
Builds a trace list — one fresh dict per turn, with keys turn (1-based), stop_reason, tokens (input + output).
On end_turn: returns {"answer": <text>, "trace": <trace>}.
On cap: returns {"answer": "capped", "trace": <trace>}.

Then write run_eval_suite(cases) that:

For each case {"question", "expected", "max_turns"}, runs the agent.
All-three-true: expected in answer, len(trace) <= max_turns, trace[-1]["stop_reason"] == "end_turn".
Returns {"total", "passed", "failed", "pass_rate" (rounded 2)}.

Two cases run for you. Expected output:

total=2 passed=1 failed=1 pass_rate=0.5

this step needs the editor

on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.

open this same url on a laptop to keep going today.