promptdojo_

Four breakpoints, variable last

Anthropic's prompt caching API lets you place up to 4 cache breakpoints per request. Each breakpoint says "everything up to this point is stable; cache it as a prefix." The model API hashes the prefix and reuses it on subsequent calls.

The canonical layout for an agent:

BreakpointWhat's before itWhy cache here
1System promptAlmost never changes. Highest hit rate.
2Tool definitionsStable across deploys; ~500–2000 tokens.
3Few-shot examplesStable for a session or a release.
4(often unused)Optional — for multi-turn conversation history.

Then the variable content (user question) goes AFTER the last breakpoint. The cache fires on the stable prefix; the variable suffix is paid full-price every call.

Why 4 and not unlimited

Each breakpoint costs API overhead. 4 is enough to cache the four big stable layers in a typical agent (system, tools, examples, history). More than 4 buys you nothing because most prompts don't have 5 distinct stable layers.

A common waste: putting cache_control on every text block. The API silently uses only the last 4. The early ones are no-ops.

The TTL economics

Two TTL options, very different prices:

TTLWrite costRead costUse when
5m (default)1.25× input0.1× inputSub-5-minute sessions; high call volume per session
1h2× input0.1× inputLong-running agents; sessions over 5 min

The 1-hour cache costs 60% more to write but doesn't expire mid-conversation. For an agent loop that takes 10 minutes per session, the 5-minute cache fires once at the start, expires mid-session, and you pay full input price for the second half. Use 1h.

The default-default trap: prior to March 2026 the default TTL was 1 hour. Anthropic changed it to 5 minutes. Code that omits ttl now silently uses 5m and quietly overspends on long sessions. Always set ttl explicitly.

The cache-busting bug, illustrated

# WRONG — variable input first, breaks cache prefix.
messages = [{"role": "user", "content": [
    {"type": "text", "text": user_question},                # varies → no prefix shared
    {"type": "text", "text": LONG_FEW_SHOT_EXAMPLES,
     "cache_control": {"type": "ephemeral"}},               # too late; never cached
]}]

# RIGHT — stable first, variable last.
messages = [{"role": "user", "content": [
    {"type": "text", "text": LONG_FEW_SHOT_EXAMPLES,
     "cache_control": {"type": "ephemeral"}},               # stable; cached as prefix
    {"type": "text", "text": user_question},                # varies; full price (correctly)
]}]

This is the bug. Once you see it, you spot it immediately. Before you've seen it, you'll ship it.

What we're NOT covering

OpenAI's automatic caching kicks in on prompts >1024 tokens with no opt-in. The trade-off: less control, no per-block breakpoints. For most use cases it Just Works; for fine-grained control you'd use Anthropic. Vercel's AI SDK passes through to whichever provider you're targeting. Google Gemini has its own cached_content API with similar shape. The principle (stable prefix, variable suffix) is universal.

read, then continue.