The threshold is the single most important number in your pipeline

Run the editor. Naive top-5 returns all five chunks, including the clearly-noise chunk_56 at 0.12. The thresholded version filters first, returning only the three relevant chunks even though you asked for up to 5.

That's the property you want in production. Top-k caps the upper bound. The threshold sets the lower bound. Use both.

The cutoff is per-model

The right threshold depends on the embedding model's similarity distribution. Here's the calibration you'll actually do:

Take 50 query+chunk pairs you KNOW are relevant. Compute cosine.
Take 50 query+chunk pairs you KNOW are irrelevant. Compute cosine.
The threshold is the value that separates them best.

For most modern embedding models, the right cutoff lands somewhere between 0.4 and 0.7. Anything lower than 0.3 is essentially "model guessed semantic similarity from grammar alone," not from meaning. Anything higher than 0.9 means "near-duplicate" — usually not what you want for retrieval breadth.

What "score is below threshold" means in practice

When the threshold filters every candidate (no result clears the bar), you have three options:

Return empty. The model gets no retrieved context. Your prompt should handle this case explicitly: "If you don't have enough information, say so." This is the right move.
Return the best you have anyway. The model will hallucinate from garbage. This is the bad move.
Fall back to a different retriever (keyword search, BM25, a broader corpus). This is the expensive-but-right move.

Option 1 is what production RAG systems do. The "I don't know" response feels worse than a confident wrong answer, but it's catastrophically less expensive.

The threshold's siblings: dedupe and stable sort

A threshold catches noise. Dedupe catches redundancy. Stable sort catches order-flicker. You need all three before you ship.

Dedupe pattern:

seen = set()
deduped = []
for c, s in sorted(scored, key=lambda x: x[1], reverse=True):
    if c not in seen:
        seen.add(c)
        deduped.append((c, s))

Stable sort is automatic in Python — sorted is stable, so ties preserve insertion order. That means if you sort by score and the top-3 has two 0.87s, they come out in the order they were inserted. Predictable. The reason this matters: when you re-run the same query twice and the top-3 reshuffles between calls, the LLM's answer shifts too. Stable sort kills that flicker.

⌘↵ runs the editor.read, then continue.

promptdojo_›phase 04 · shipping discipline›ch 22 · context and retrieval

lesson 3 of 4 · retrieval that finds the right thing — top-k, thresholds, and the rerank step everyone skipsstep 3 / 9

The threshold is the single most important number in your pipeline

That's the property you want in production. Top-k caps the upper bound. The threshold sets the lower bound. Use both.

The cutoff is per-model

The right threshold depends on the embedding model's similarity distribution. Here's the calibration you'll actually do:

Take 50 query+chunk pairs you KNOW are relevant. Compute cosine.
Take 50 query+chunk pairs you KNOW are irrelevant. Compute cosine.
The threshold is the value that separates them best.

What "score is below threshold" means in practice

When the threshold filters every candidate (no result clears the bar), you have three options:

Return empty. The model gets no retrieved context. Your prompt should handle this case explicitly: "If you don't have enough information, say so." This is the right move.
Return the best you have anyway. The model will hallucinate from garbage. This is the bad move.
Fall back to a different retriever (keyword search, BM25, a broader corpus). This is the expensive-but-right move.

Option 1 is what production RAG systems do. The "I don't know" response feels worse than a confident wrong answer, but it's catastrophically less expensive.

The threshold's siblings: dedupe and stable sort

A threshold catches noise. Dedupe catches redundancy. Stable sort catches order-flicker. You need all three before you ship.

Dedupe pattern:

seen = set()
deduped = []
for c, s in sorted(scored, key=lambda x: x[1], reverse=True):
    if c not in seen:
        seen.add(c)
        deduped.append((c, s))

⌘↵ runs the editor.read, then continue.

Retrieval that finds the right thing — top-k, thresholds, and the rerank step everyone skips — step 3 of 9

The threshold is the single most important number in your pipeline

The cutoff is per-model

What "score is below threshold" means in practice

The threshold's siblings: dedupe and stable sort

Retrieval that finds the right thing — top-k, thresholds, and the rerank step everyone skips — step 3 of 9

The threshold is the single most important number in your pipeline

The cutoff is per-model

What "score is below threshold" means in practice

The threshold's siblings: dedupe and stable sort