Read the trace, not the chat — find the broken turn before reading the user's complaint — step 1 of 9
When the model lies, the trace tells you why
A user complains. The chat shows a wrong answer. Three things will happen if you start by reading the chat:
- You'll re-read the same wrong sentence five times looking for the bug.
- You'll think the model "hallucinated" and tweet about it.
- You'll ship a regex post-filter that hides the symptom.
None of those finds the actual bug. The trace does. Always read the trace first. The chat is the symptom; the trace is the evidence.
This lesson teaches the methodology: find the broken turn from the trace, classify the failure class, fix UPSTREAM of where it surfaced, then add an eval that would have caught it. Chapter 20 gave you trace literacy — what each field means. Chapter 21 gave you eval discipline — what to assert. This lesson wires them into a single triage flow.
What's in a trace, restated
From the capstone you already know the shape:
{
"turn": 1,
"stop_reason": "tool_use",
"tools_called": ["search"],
"tokens": 412,
"validation_errors": 0,
}
Six fields. Read them in this order when triaging:
validation_errors: any non-zero value? That turn shows the model called a tool with bad args. Likely the source.stop_reason: anything other thantool_use/end_turn?max_tokenstruncates output mid-thought;pause_turnmeans a hosted tool ran out of time;refusalis a safety stop.tools_called: what tools ran? Re-running the same tool 3+ times in a row is a tool-loop bug — the tool result didn't actually help.tokens: spike on one turn? The model received a giant prompt — chunking issue, or a tool returned a 50K-character response.
Most bugs land on the first or third row. The trace is sorted; the first turn that violates an assertion is usually where the bug started.
Where the bug actually lives
Production AI bugs sort into four classes. The trace tells you which:
| Class | Symptom in the trace | Actual fix |
|---|---|---|
| Retrieval | tool returned wrong/missing data | fix retrieval, not the prompt |
| Prompt | model interpreted instructions wrong | fix the prompt, add a few-shot |
| True hallucination | no tool ran, no retrieval; model made it up | constrain output (schema, citations) |
| Downstream mangling | trace looks fine; output is wrong | fix the JSON parse / regex / display layer |
The wrong fix applied to the wrong class is the most common waste of debugging time. The fix lives at the layer the trace points to, not at the layer where the user sees the symptom.
What "fix upstream of where it surfaced" means
A user sees a JSON parse error. The natural reflex: catch the parse error, retry, ship. The trace says: the model returned valid JSON on turn 1, and your post-processor mangled it on the way to the display layer.
The fix isn't "catch and retry." The fix is "stop mangling the JSON." Otherwise next month a different bug will produce the same symptom and you'll add a second retry, then a third.
What you'll build
A find_suspect_turn(trace) that scans the trace and returns the
first turn that violates an assertion (non-zero validation errors,
non-end_turn final stop, etc.). Then a summarize_trace(trace)
that produces a one-line summary suitable for filing in a bug
ticket. Then the discipline drill: given a trace + a user
complaint, classify the failure class.
Read the trace, not the chat — find the broken turn before reading the user's complaint — step 1 of 9
When the model lies, the trace tells you why
A user complains. The chat shows a wrong answer. Three things will happen if you start by reading the chat:
- You'll re-read the same wrong sentence five times looking for the bug.
- You'll think the model "hallucinated" and tweet about it.
- You'll ship a regex post-filter that hides the symptom.
None of those finds the actual bug. The trace does. Always read the trace first. The chat is the symptom; the trace is the evidence.
This lesson teaches the methodology: find the broken turn from the trace, classify the failure class, fix UPSTREAM of where it surfaced, then add an eval that would have caught it. Chapter 20 gave you trace literacy — what each field means. Chapter 21 gave you eval discipline — what to assert. This lesson wires them into a single triage flow.
What's in a trace, restated
From the capstone you already know the shape:
{
"turn": 1,
"stop_reason": "tool_use",
"tools_called": ["search"],
"tokens": 412,
"validation_errors": 0,
}
Six fields. Read them in this order when triaging:
validation_errors: any non-zero value? That turn shows the model called a tool with bad args. Likely the source.stop_reason: anything other thantool_use/end_turn?max_tokenstruncates output mid-thought;pause_turnmeans a hosted tool ran out of time;refusalis a safety stop.tools_called: what tools ran? Re-running the same tool 3+ times in a row is a tool-loop bug — the tool result didn't actually help.tokens: spike on one turn? The model received a giant prompt — chunking issue, or a tool returned a 50K-character response.
Most bugs land on the first or third row. The trace is sorted; the first turn that violates an assertion is usually where the bug started.
Where the bug actually lives
Production AI bugs sort into four classes. The trace tells you which:
| Class | Symptom in the trace | Actual fix |
|---|---|---|
| Retrieval | tool returned wrong/missing data | fix retrieval, not the prompt |
| Prompt | model interpreted instructions wrong | fix the prompt, add a few-shot |
| True hallucination | no tool ran, no retrieval; model made it up | constrain output (schema, citations) |
| Downstream mangling | trace looks fine; output is wrong | fix the JSON parse / regex / display layer |
The wrong fix applied to the wrong class is the most common waste of debugging time. The fix lives at the layer the trace points to, not at the layer where the user sees the symptom.
What "fix upstream of where it surfaced" means
A user sees a JSON parse error. The natural reflex: catch the parse error, retry, ship. The trace says: the model returned valid JSON on turn 1, and your post-processor mangled it on the way to the display layer.
The fix isn't "catch and retry." The fix is "stop mangling the JSON." Otherwise next month a different bug will produce the same symptom and you'll add a second retry, then a third.
What you'll build
A find_suspect_turn(trace) that scans the trace and returns the
first turn that violates an assertion (non-zero validation errors,
non-end_turn final stop, etc.). Then a summarize_trace(trace)
that produces a one-line summary suitable for filing in a bug
ticket. Then the discipline drill: given a trace + a user
complaint, classify the failure class.