promptdojo_›phase 06 · applied builds›ch 30 · harness engineering

lesson 2 of 5 · the ratchet — every mistake becomes a rulestep 2 / 9

The pilot's checklist test for every line in your prompt

Osmani's load-bearing sentence in Agent Harness Engineering, lightly paraphrased to capture the spirit (the exact wording in the post: "every line in a good AGENTS.md should be traceable back to a specific thing that went wrong"):

"Every line in a good system prompt should trace back to a specific, historical failure."

This sentence is doing the work of an entire engineering principle. Read it twice.

The metaphor underneath it is the pilot's checklist. Every item on the pre-takeoff checklist for a Boeing 737 traces to a specific crash, a specific near-miss, or a specific incident the FAA investigated. Nothing on the checklist is theoretical. Nothing is "we thought it might be a good idea." Every line earned its place because not having that line cost lives.

The harness-engineering parallel is exact. Lines in AGENTS.md / CLAUDE.md / .cursorrules / skill files are checklist items. Each one is a guardrail against a recurring failure. If you can't say which failure a line is guarding against, the line shouldn't be there.

The two-sentence test

For every line in your system-prompt-equivalent, you should be able to answer two questions in under five seconds:

What failure did this rule earn its place against? "Agent commented out a failing test on 2026-03-15." "Agent ran git push --force to main on 2026-04-02." "Agent kept using inline styles after we adopted Tailwind on 2026-02-20."
What would have to be true for this rule to become unnecessary? "The model is post-trained to never silently disable tests." "The harness has a pre-tool hook that blocks force-push to protected branches." "Tailwind enforcement is built into the model's project-context layer."

If you can answer both, the rule earned its place AND has explicit removal criteria. That's a ratchet entry.

If you can't answer #1, it's cargo cult. Delete it.

If you can't answer #2, it's debt. Tag it for re-evaluation.

Why cargo-cult rules cost you

Every line in a system prompt is a tax. Three taxes, specifically:

Token cost. Even with prompt caching, system prompts are budget. A 1,500-line AGENTS.md eats attention you'd rather have on the task.
Attention cost. The model has to weigh every rule on every turn. The more rules, the more dilute each one's weight becomes. HumanLayer's published number: their team keeps CLAUDE.md under 60 lines on purpose.
Conflict cost. Two rules can subtly contradict. When they do, the model picks one (often the wrong one). Fewer rules → fewer conflicts.

A cargo-culted AGENTS.md is worse than no AGENTS.md. It pretends to encode guardrails but actually trades attention for nothing.

How harness engineers write rules

Two patterns the experienced practitioners use:

Date-stamp every rule. # 2026-03-15 — never comment out a failing test; came up when agent disabled CartTest in PR #842. The date and the citation make the trace unambiguous. When you re-audit the file six months later, you have the breadcrumb.
Name the artifact, not the abstraction. Bad: "Be careful with destructive commands." Good: "Never run git push --force, rm -rf, DROP TABLE, or gh pr merge --admin without explicit user confirmation." The model can't act on abstractions; it can act on specific strings.

These two patterns alone separate ratcheted rules from cargo-culted ones.

The two-direction discipline

The ratchet metaphor implies one-way motion. In practice, the discipline has two directions:

Forward (add): observed failure → new rule → date-stamped → committed.
Backward (retire): stale rule → model now handles it natively (you can demonstrate this with an eval) → removed, with the removal also logged.

Forward is the easier direction; almost everyone gets there eventually. Backward is the habit most teams skip. They end up with a CLAUDE.md that grew from 30 lines to 300 over a year, half of those rules now redundant, the model's attention diluted.

Osmani is explicit about this in the post: "Harnesses don't shrink, they move." Lines get retired as new failure modes open up. The total volume of harness work doesn't shrink. But the specific lines should move — yesterday's pre-tool hook is today's model-native behavior is tomorrow's removed line.

What you're about to drill

The next four steps are a tight loop on this exact discipline:

A multiple choice that makes you pick which of four AGENTS.md lines actually earned its place.
The "working backwards" reading — start with behavior, derive the rule.
The ratchet-log reading — what an actual log entry looks like.
A second multiple choice on which line to retire because the model now handles it natively.

Then two write drills: one that appends a new ratchet entry, one that audits which entries to remove. The checkpoint stitches them together.

The thing to keep in mind throughout: the pilot's checklist is short on purpose. Yours should be too.

⌘↵ runs the editor.read, then continue.

promptdojo_›phase 06 · applied builds›ch 30 · harness engineering

lesson 2 of 5 · the ratchet — every mistake becomes a rulestep 2 / 9

The pilot's checklist test for every line in your prompt

"Every line in a good system prompt should trace back to a specific, historical failure."

This sentence is doing the work of an entire engineering principle. Read it twice.

The two-sentence test

For every line in your system-prompt-equivalent, you should be able to answer two questions in under five seconds:

What failure did this rule earn its place against? "Agent commented out a failing test on 2026-03-15." "Agent ran git push --force to main on 2026-04-02." "Agent kept using inline styles after we adopted Tailwind on 2026-02-20."
What would have to be true for this rule to become unnecessary? "The model is post-trained to never silently disable tests." "The harness has a pre-tool hook that blocks force-push to protected branches." "Tailwind enforcement is built into the model's project-context layer."

If you can answer both, the rule earned its place AND has explicit removal criteria. That's a ratchet entry.

If you can't answer #1, it's cargo cult. Delete it.

If you can't answer #2, it's debt. Tag it for re-evaluation.

Why cargo-cult rules cost you

Every line in a system prompt is a tax. Three taxes, specifically:

Token cost. Even with prompt caching, system prompts are budget. A 1,500-line AGENTS.md eats attention you'd rather have on the task.
Attention cost. The model has to weigh every rule on every turn. The more rules, the more dilute each one's weight becomes. HumanLayer's published number: their team keeps CLAUDE.md under 60 lines on purpose.
Conflict cost. Two rules can subtly contradict. When they do, the model picks one (often the wrong one). Fewer rules → fewer conflicts.

A cargo-culted AGENTS.md is worse than no AGENTS.md. It pretends to encode guardrails but actually trades attention for nothing.

How harness engineers write rules

Two patterns the experienced practitioners use:

Date-stamp every rule. # 2026-03-15 — never comment out a failing test; came up when agent disabled CartTest in PR #842. The date and the citation make the trace unambiguous. When you re-audit the file six months later, you have the breadcrumb.
Name the artifact, not the abstraction. Bad: "Be careful with destructive commands." Good: "Never run git push --force, rm -rf, DROP TABLE, or gh pr merge --admin without explicit user confirmation." The model can't act on abstractions; it can act on specific strings.

These two patterns alone separate ratcheted rules from cargo-culted ones.

The two-direction discipline

The ratchet metaphor implies one-way motion. In practice, the discipline has two directions:

Forward (add): observed failure → new rule → date-stamped → committed.
Backward (retire): stale rule → model now handles it natively (you can demonstrate this with an eval) → removed, with the removal also logged.

What you're about to drill

The next four steps are a tight loop on this exact discipline:

A multiple choice that makes you pick which of four AGENTS.md lines actually earned its place.
The "working backwards" reading — start with behavior, derive the rule.
The ratchet-log reading — what an actual log entry looks like.
A second multiple choice on which line to retire because the model now handles it natively.

Then two write drills: one that appends a new ratchet entry, one that audits which entries to remove. The checkpoint stitches them together.

The thing to keep in mind throughout: the pilot's checklist is short on purpose. Yours should be too.

⌘↵ runs the editor.read, then continue.

The ratchet — every mistake becomes a rule — step 2 of 9

The pilot's checklist test for every line in your prompt

The two-sentence test

Why cargo-cult rules cost you

How harness engineers write rules

The two-direction discipline

What you're about to drill

The ratchet — every mistake becomes a rule — step 2 of 9

The pilot's checklist test for every line in your prompt

The two-sentence test

Why cargo-cult rules cost you

How harness engineers write rules

The two-direction discipline

What you're about to drill