Add evals and traces — measure the agent, don't trust it — step 9 of 9
Checkpoint
One last thing before we move on. Same surface as a write step — but the lesson doesn't complete until this passes.
Final drill. Build the fully instrumented agent + eval suite.
Write run_agent(question) that:
- Loops up to 5 times calling
fake_create(question, turn_index)which returns a dict withstop_reason,text, and a nestedusagedict containinginput_tokensandoutput_tokens(matching the SDK shape established in lesson 02). - Builds a
tracelist — one fresh dict per turn, with keysturn(1-based),stop_reason,tokens(input + output). - On
end_turn: returns{"answer": <text>, "trace": <trace>}. - On cap: returns
{"answer": "capped", "trace": <trace>}.
Then write run_eval_suite(cases) that:
- For each case
{"question", "expected", "max_turns"}, runs the agent. - All-three-true:
expectedinanswer,len(trace) <= max_turns,trace[-1]["stop_reason"] == "end_turn". - Returns
{"total", "passed", "failed", "pass_rate" (rounded 2)}.
Two cases run for you. Expected output:
total=2 passed=1 failed=1 pass_rate=0.5
⌘↵ runs the editor.read, then continue.
Checkpoint
One last thing before we move on. Same surface as a write step — but the lesson doesn't complete until this passes.
Final drill. Build the fully instrumented agent + eval suite.
Write run_agent(question) that:
- Loops up to 5 times calling
fake_create(question, turn_index)which returns a dict withstop_reason,text, and a nestedusagedict containinginput_tokensandoutput_tokens(matching the SDK shape established in lesson 02). - Builds a
tracelist — one fresh dict per turn, with keysturn(1-based),stop_reason,tokens(input + output). - On
end_turn: returns{"answer": <text>, "trace": <trace>}. - On cap: returns
{"answer": "capped", "trace": <trace>}.
Then write run_eval_suite(cases) that:
- For each case
{"question", "expected", "max_turns"}, runs the agent. - All-three-true:
expectedinanswer,len(trace) <= max_turns,trace[-1]["stop_reason"] == "end_turn". - Returns
{"total", "passed", "failed", "pass_rate" (rounded 2)}.
Two cases run for you. Expected output:
total=2 passed=1 failed=1 pass_rate=0.5
this step needs the editor
on desktop today; in the app (coming soon). save your spot and we'll bring you back here when you're ready.