The decision that's killed more AI startups than any model swap

This lesson is a theory lesson. There is barely any code. That is on purpose. The previous four lessons of this chapter taught you how to build RAG — chunking, embeddings, retrieval, ranking. This lesson zooms out and asks a different question: should you be building RAG at all?

Because every team building an AI product right now is, knowingly or not, picking one of three forks. The fork they pick determines their cost curve, their freshness story, their latency budget, and whether they're still in business in 18 months. The fork is invisible in the demo. It shows up at scale.

The three forks

There are three big buckets for getting knowledge into a model at inference time: RAG (retrieve at query), long-context (stuff the prompt), and fine-tune (bake into weights). Practitioners argue about whether this is the full taxonomy or whether you should also count tool-calling-into-live-systems, agentic multi-hop retrieval, and knowledge-graph traversal as separate categories. They have a point. For most product decisions, though, the three big buckets are the useful unit of analysis. We'll teach to those, with a note in the rubric on when the cleaner trichotomy falls apart.

Every product is some combination of these, but the architecture pivots on one:

Pull-from-source (RAG). Keep the knowledge in a vector store (or a SQL table, or a search index). At query time, retrieve the relevant chunks and stuff them into the prompt. The model never sees the corpus during training. It just sees today's slice.
Stuff-the-context (long-context). Put the entire corpus in the prompt. Gemini's 2M-token context window made this a real option for whole-codebase and whole-contract use cases. With prompt caching, the marginal cost per query collapses.
Bake-into-the-model (fine-tune). Train the model on your corpus so the knowledge lives in the weights. The model doesn't need to be told anything at inference time — it already "knows." Style, format, and domain language all transfer.

These look like alternatives. They're really tradeoffs along five axes, and the right answer is almost always "use the fork whose tradeoffs match the product."

The five tradeoff axes

Every product evaluation reduces to these five questions:

Freshness. How often does the corpus change? Hourly? Quarterly? Never? RAG dominates when freshness is hours-to-days. Long-context dominates when the corpus is stable for the session. Fine-tune is hopeless when freshness matters at all — the knowledge in the weights is frozen at training time.
Governance. Can you point at the specific source the model used? RAG gives you citations for free (you know which chunks went in). Long-context gives you the whole document as the citation. Fine-tune gives you nothing — the model can't tell you which training example produced an output.
Cost per call. RAG's cost is retrieval + a small prompt. Long-context's cost is the full prompt every call, unless the prompt cache hits. Fine-tune's cost is a small inference call, but the training cost amortizes only if you call the model a lot.
Latency budget. RAG adds a retrieval hop (50-500ms). Long- context adds prompt-processing time (linear in token count, but cached on Anthropic's models if you're careful). Fine-tune is the lowest-latency option per call.
Capability gap. Is the model bad at your domain language, your formatting conventions, your house style? RAG can't fix that — it can only give the model the right facts. Fine-tune is the only fork that fixes capability gaps in the model itself.

Why the wrong fork kills startups

The graveyard is full of teams that picked the wrong fork. A few shapes of failure that keep repeating:

Fine-tuned on facts. The legal-tech that fine-tuned GPT-4 on case law in 2024 got crushed when GPT-5.5 with a 1M context window shipped in 2026. The fine-tune was a freshness liability — every new case meant retraining. Long-context (or RAG) would have aged better.
RAG'd a stable corpus. A company built a retrieval pipeline for a 200-page product manual that hasn't changed in three years. Six engineers, twelve months, a vector store. The whole manual fits in a 200K-token context window and would cost less per query with prompt caching. They built infra for a freshness problem they didn't have.
Long-contexted a hot corpus. A startup tried to keep a daily- updating customer database in the prompt. Cache misses on every update, cost went vertical, latency tanked. Should have been RAG.

Jason Liu's "RAG is dead" thread on X (December 2024) was provocative because it pointed at one real failure mode (overbuilt RAG when long-context would do) and got read as "stop using RAG at all." It's neither. It's: pick the fork that matches the corpus.

What this lesson is

The next five reading steps walk through each fork with named examples, then synthesize a five-axis rubric you can apply to any product. The two MC steps test whether you can pick the right fork for a scenario. The write and checkpoint steps codify the rubric into a pick_fork(product_spec) function you can paste into your own planning docs.

By the end you should be able to read any AI product launch and predict which fork they picked, and whether the picked fork matches the corpus shape. That intuition is worth more than any specific RAG library you'll learn.

⌘↵ runs the editor.read, then continue.

promptdojo_›phase 04 · shipping discipline›ch 22 · context and retrieval

lesson 4 of 4 · rag vs long context vs fine-tune — the decision that's killed more ai startups than any model swapstep 1 / 9

The decision that's killed more AI startups than any model swap

The three forks

Every product is some combination of these, but the architecture pivots on one:

Pull-from-source (RAG). Keep the knowledge in a vector store (or a SQL table, or a search index). At query time, retrieve the relevant chunks and stuff them into the prompt. The model never sees the corpus during training. It just sees today's slice.
Stuff-the-context (long-context). Put the entire corpus in the prompt. Gemini's 2M-token context window made this a real option for whole-codebase and whole-contract use cases. With prompt caching, the marginal cost per query collapses.
Bake-into-the-model (fine-tune). Train the model on your corpus so the knowledge lives in the weights. The model doesn't need to be told anything at inference time — it already "knows." Style, format, and domain language all transfer.

These look like alternatives. They're really tradeoffs along five axes, and the right answer is almost always "use the fork whose tradeoffs match the product."

The five tradeoff axes

Every product evaluation reduces to these five questions:

Freshness. How often does the corpus change? Hourly? Quarterly? Never? RAG dominates when freshness is hours-to-days. Long-context dominates when the corpus is stable for the session. Fine-tune is hopeless when freshness matters at all — the knowledge in the weights is frozen at training time.
Governance. Can you point at the specific source the model used? RAG gives you citations for free (you know which chunks went in). Long-context gives you the whole document as the citation. Fine-tune gives you nothing — the model can't tell you which training example produced an output.
Cost per call. RAG's cost is retrieval + a small prompt. Long-context's cost is the full prompt every call, unless the prompt cache hits. Fine-tune's cost is a small inference call, but the training cost amortizes only if you call the model a lot.
Latency budget. RAG adds a retrieval hop (50-500ms). Long- context adds prompt-processing time (linear in token count, but cached on Anthropic's models if you're careful). Fine-tune is the lowest-latency option per call.
Capability gap. Is the model bad at your domain language, your formatting conventions, your house style? RAG can't fix that — it can only give the model the right facts. Fine-tune is the only fork that fixes capability gaps in the model itself.

Why the wrong fork kills startups

The graveyard is full of teams that picked the wrong fork. A few shapes of failure that keep repeating:

Fine-tuned on facts. The legal-tech that fine-tuned GPT-4 on case law in 2024 got crushed when GPT-5.5 with a 1M context window shipped in 2026. The fine-tune was a freshness liability — every new case meant retraining. Long-context (or RAG) would have aged better.
RAG'd a stable corpus. A company built a retrieval pipeline for a 200-page product manual that hasn't changed in three years. Six engineers, twelve months, a vector store. The whole manual fits in a 200K-token context window and would cost less per query with prompt caching. They built infra for a freshness problem they didn't have.
Long-contexted a hot corpus. A startup tried to keep a daily- updating customer database in the prompt. Cache misses on every update, cost went vertical, latency tanked. Should have been RAG.

What this lesson is

⌘↵ runs the editor.read, then continue.

RAG vs long context vs fine-tune — the decision that's killed more AI startups than any model swap — step 1 of 9

The decision that's killed more AI startups than any model swap

The three forks

The five tradeoff axes

Why the wrong fork kills startups

What this lesson is

RAG vs long context vs fine-tune — the decision that's killed more AI startups than any model swap — step 1 of 9

The decision that's killed more AI startups than any model swap

The three forks

The five tradeoff axes

Why the wrong fork kills startups

What this lesson is