Check before you trust
Chapter zero says “check the work.” This chapter turns that habit into evals.
The model can produce a sentence that is polished, confident, and wrong. That is not a rare edge case. It is the normal failure mode of language models.
So the question is not:
Did the answer sound good?
The question is:
What would prove this output did the job?
For a meeting-notes tool, checks might be:
- every action item has an owner
- deadlines are copied from the notes, not invented
- uncertainty is flagged instead of hidden
- the follow-up email does not promise work nobody agreed to do
For a JSON extraction tool, checks might be:
- the output parses
- required fields exist
- numbers are in range
- unknown fields are rejected
- examples that failed before stay fixed
That is the core of eval-driven AI development. You are not trying to make AI feel reliable. You are building a gate that catches unreliable output before it reaches a user.
A tool without checks is a demo. A tool with checks can become a system.