Evaluator-optimizer — write a draft, let a judge critique it, revise — step 3 of 9
The judge's output is a contract, not a vibe
The single most-broken evaluator-optimizer in the wild has the same bug: parsing the judge's verdict from free text. The code looks something like:
verdict = judge(draft)
if "good" in verdict.lower() or "pass" in verdict.lower():
return draft
This breaks the first time the judge says "this is not good" —
the substring match sees "good", returns happily, and ships the
broken draft.
Three hardening levels, in production order:
Level 1 — XML tags (Anthropic cookbook style)
Force the judge to wrap its verdict in a fixed token:
<evaluation>PASS</evaluation>
or
<evaluation>NEEDS_IMPROVEMENT</evaluation>
<feedback>...</feedback>
You parse by extracting the contents of <evaluation>. Robust enough
for prototypes; the cookbook ships it.
Level 2 — Structured output (preferred today)
Use the model's structured-output mode and a JSON schema:
{
"status": {"type": "string", "enum": ["PASS", "REJECT"]},
"feedback": {"type": "string"},
"score": {"type": "integer", "minimum": 1, "maximum": 10}
}
The model literally cannot return "PaSs" or "good" — the schema
forbids it. Anthropic does this via tools=[...] (the tool-use
trick). OpenAI does it via response_format={"type": "json_schema"}.
Vercel AI SDK does it via generateObject + Zod.
Level 3 — Validate-then-default
Even with structured output, one more belt-and-suspenders line:
status = verdict.get("status", "REJECT")
if status not in {"PASS", "REJECT"}:
status = "REJECT" # default to "loop again" — never to PASS
The default-on-anomaly is always REJECT, never PASS. Because the failure mode of "model returned garbage and we shipped" is worse than "model returned garbage and we did one more revision."
Why a separate judge prompt matters
The other half of the contract: the judge cannot share a system prompt with the generator. Self-preference bias is documented (arxiv.org/abs/2410.21819) — when the same prompt-personality writes and judges, you get a 90%+ pass rate on iteration 1, regardless of quality. The judge needs to be told to find specific defects, not asked for an opinion. Different model entirely is even better when you can afford it.
Three numbers production loops always have
feedback = None
draft = None
for i in range(MAX_ITERATIONS): # 1. cap
draft = generator(task, feedback=feedback)
verdict = judge(draft) # 2. structured verdict
if verdict["status"] == "PASS":
return draft
feedback = verdict["feedback"] # 3. feedback flows back
return draft # bail with the latest draft, never raise
Three load-bearing pieces: MAX_ITERATIONS, structured verdict, feedback threaded back into the generator. Miss any one and the loop fails — runs forever, ships garbage, or never improves.
Evaluator-optimizer — write a draft, let a judge critique it, revise — step 3 of 9
The judge's output is a contract, not a vibe
The single most-broken evaluator-optimizer in the wild has the same bug: parsing the judge's verdict from free text. The code looks something like:
verdict = judge(draft)
if "good" in verdict.lower() or "pass" in verdict.lower():
return draft
This breaks the first time the judge says "this is not good" —
the substring match sees "good", returns happily, and ships the
broken draft.
Three hardening levels, in production order:
Level 1 — XML tags (Anthropic cookbook style)
Force the judge to wrap its verdict in a fixed token:
<evaluation>PASS</evaluation>
or
<evaluation>NEEDS_IMPROVEMENT</evaluation>
<feedback>...</feedback>
You parse by extracting the contents of <evaluation>. Robust enough
for prototypes; the cookbook ships it.
Level 2 — Structured output (preferred today)
Use the model's structured-output mode and a JSON schema:
{
"status": {"type": "string", "enum": ["PASS", "REJECT"]},
"feedback": {"type": "string"},
"score": {"type": "integer", "minimum": 1, "maximum": 10}
}
The model literally cannot return "PaSs" or "good" — the schema
forbids it. Anthropic does this via tools=[...] (the tool-use
trick). OpenAI does it via response_format={"type": "json_schema"}.
Vercel AI SDK does it via generateObject + Zod.
Level 3 — Validate-then-default
Even with structured output, one more belt-and-suspenders line:
status = verdict.get("status", "REJECT")
if status not in {"PASS", "REJECT"}:
status = "REJECT" # default to "loop again" — never to PASS
The default-on-anomaly is always REJECT, never PASS. Because the failure mode of "model returned garbage and we shipped" is worse than "model returned garbage and we did one more revision."
Why a separate judge prompt matters
The other half of the contract: the judge cannot share a system prompt with the generator. Self-preference bias is documented (arxiv.org/abs/2410.21819) — when the same prompt-personality writes and judges, you get a 90%+ pass rate on iteration 1, regardless of quality. The judge needs to be told to find specific defects, not asked for an opinion. Different model entirely is even better when you can afford it.
Three numbers production loops always have
feedback = None
draft = None
for i in range(MAX_ITERATIONS): # 1. cap
draft = generator(task, feedback=feedback)
verdict = judge(draft) # 2. structured verdict
if verdict["status"] == "PASS":
return draft
feedback = verdict["feedback"] # 3. feedback flows back
return draft # bail with the latest draft, never raise
Three load-bearing pieces: MAX_ITERATIONS, structured verdict, feedback threaded back into the generator. Miss any one and the loop fails — runs forever, ships garbage, or never improves.