promptdojo_

Schemas, Pydantic, and validation — making the model return real data — step 1 of 9

Free-form text is not data

The first time you ask Claude to "extract the user's email from this support ticket," it returns exactly what you wanted: [email protected]. The second time, it returns Sure! The email is [email protected].. The third time, The email address you're looking for is [email protected].

Three different shapes. Your downstream code expected a string. Now it either has to regex-extract the email out of natural language every time, or fail.

This is why every production AI feature — without exception — uses structured output: you tell the model exactly what JSON shape to return, and you validate the response when it comes back.

The pattern AI ships every time:

import anthropic
from pydantic import BaseModel

class Ticket(BaseModel):
    email: str
    severity: int          # 1-5
    summary: str

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=512,
    messages=[{"role": "user", "content": "Extract from: <ticket text>"}],
)

raw = response.content[0].text
ticket = Ticket.model_validate_json(raw)   # parse + validate in one call
print(ticket.email)

Three pieces. The schema (the BaseModel class), the prompt (asking for JSON), and the validation step (model_validate_json). All three matter. Skip the schema and you're back to regex-on-natural-language. Skip the validation and you'll find out about the model's hallucinated field at 3am from a NoneType error in production.

Browser note: Pydantic loads in Pyodide via micropip.install("pydantic"), but for fast feedback we use plain dict validation in these drills. Same logic, spelled out, so you can read and write the pattern by hand. Switching to BaseModel later is a two-line change.

Where AI specifically gets this wrong

  • Trusting the model on first try. Models lie. They drop required fields, return strings where you wanted ints, and invent enum values you never defined. Validate every response.
  • Forgetting response_format / tool use. On OpenAI, the modern canonical way is Structured Outputs: response_format={"type": "json_schema", "json_schema": {...}} — the API guarantees the response conforms to your schema. The older {"type": "json_object"} mode only guarantees valid JSON, not your shape. On Anthropic you typically use a tool definition (or the newer output_format parameter). Without one, the model wraps its JSON in prose.
  • Catching ValidationError too broadly. When Pydantic rejects a response, you usually want to retry with the error message back to the model — not silently fall through.

Run the editor. We extract a name and validate the shape by hand.

read, then continue.