Add evals and traces — measure the agent, don't trust it — step 1 of 9
An agent without measurement is a bet, not a feature
Lesson 03 made the agent reject bad inputs. This lesson makes the agent measurable. Two pieces, both load-bearing in production:
1. Trace — record everything, every turn
A trace is the structured record of every turn the agent took:
which tool was called, what it returned, what stop_reason came
back, how many tokens flowed in and out. Chapter 20 explained the
shape of traces. This lesson wires one into your agent.
The minimum viable trace, per turn:
{
"turn": 1,
"stop_reason": "tool_use",
"tools_called": ["search"],
"input_tokens": 412,
"output_tokens": 38,
"validation_errors": 0,
}
That's enough to answer the four questions you'll be asked when something breaks: how many turns did it take, what stopped it, what tools ran, what did it cost. Production tracers (LangSmith, Helicone, Phoenix) capture more — span trees, full inputs/outputs, prompt diffs across runs — but the shape above is what every single one of them encodes.
2. Eval suite — assert on outputs, automate the loop
Chapter 21 ranked four eval patterns. The agent doesn't change
which pattern you pick — the agent's output is just a string and
the suite runs the same way. What changes: the cases. An eval case
for an agent is (question, expected_property), where
expected_property might be:
- exact — the answer string equals a known reference
- contains — the answer contains a required substring
("Paris", "$199.99",
2024-04-15) - schema — the answer matches a JSON schema (when the agent returns structured output)
- judge — an LLM-as-judge (lesson 21-02) verifies semantic correctness
For an agent capstone, the practical default is contains for
fact-recall cases plus an LLM-as-judge for "did it sound like a
human did this." Two patterns covers 90% of capstone needs.
What assert you write, and where
The eval suite calls the agent with a question, gets back the result + trace, then asserts properties of both:
result = run_agent("capital of France?")
assert "Paris" in result["answer"] # output check
assert len(result["trace"]) <= 3 # turn cap
assert result["trace"][-1]["stop_reason"] == "end_turn" # didn't error out
Three assertions, three different layers. Output correctness, turn budget, exit cleanliness. Real production suites have 50–200 cases; the principle is the same.
What you'll build
A run_agent_with_trace(question) that emits a trace per turn,
then run_eval_suite(cases) that runs the agent on each case and
asserts both output and trace properties. Then the bug everyone
hits: writing an eval that just re-asserts what the agent did
(tautological), instead of checking against an independent ground
truth.
Add evals and traces — measure the agent, don't trust it — step 1 of 9
An agent without measurement is a bet, not a feature
Lesson 03 made the agent reject bad inputs. This lesson makes the agent measurable. Two pieces, both load-bearing in production:
1. Trace — record everything, every turn
A trace is the structured record of every turn the agent took:
which tool was called, what it returned, what stop_reason came
back, how many tokens flowed in and out. Chapter 20 explained the
shape of traces. This lesson wires one into your agent.
The minimum viable trace, per turn:
{
"turn": 1,
"stop_reason": "tool_use",
"tools_called": ["search"],
"input_tokens": 412,
"output_tokens": 38,
"validation_errors": 0,
}
That's enough to answer the four questions you'll be asked when something breaks: how many turns did it take, what stopped it, what tools ran, what did it cost. Production tracers (LangSmith, Helicone, Phoenix) capture more — span trees, full inputs/outputs, prompt diffs across runs — but the shape above is what every single one of them encodes.
2. Eval suite — assert on outputs, automate the loop
Chapter 21 ranked four eval patterns. The agent doesn't change
which pattern you pick — the agent's output is just a string and
the suite runs the same way. What changes: the cases. An eval case
for an agent is (question, expected_property), where
expected_property might be:
- exact — the answer string equals a known reference
- contains — the answer contains a required substring
("Paris", "$199.99",
2024-04-15) - schema — the answer matches a JSON schema (when the agent returns structured output)
- judge — an LLM-as-judge (lesson 21-02) verifies semantic correctness
For an agent capstone, the practical default is contains for
fact-recall cases plus an LLM-as-judge for "did it sound like a
human did this." Two patterns covers 90% of capstone needs.
What assert you write, and where
The eval suite calls the agent with a question, gets back the result + trace, then asserts properties of both:
result = run_agent("capital of France?")
assert "Paris" in result["answer"] # output check
assert len(result["trace"]) <= 3 # turn cap
assert result["trace"][-1]["stop_reason"] == "end_turn" # didn't error out
Three assertions, three different layers. Output correctness, turn budget, exit cleanliness. Real production suites have 50–200 cases; the principle is the same.
What you'll build
A run_agent_with_trace(question) that emits a trace per turn,
then run_eval_suite(cases) that runs the agent on each case and
asserts both output and trace properties. Then the bug everyone
hits: writing an eval that just re-asserts what the agent did
(tautological), instead of checking against an independent ground
truth.