Retrieval that finds the right thing — top-k, thresholds, and the rerank step everyone skips — step 3 of 9
The threshold is the single most important number in your pipeline
Run the editor. Naive top-5 returns all five chunks, including the
clearly-noise chunk_56 at 0.12. The thresholded version filters
first, returning only the three relevant chunks even though you
asked for up to 5.
That's the property you want in production. Top-k caps the upper bound. The threshold sets the lower bound. Use both.
The cutoff is per-model
The right threshold depends on the embedding model's similarity distribution. Here's the calibration you'll actually do:
- Take 50 query+chunk pairs you KNOW are relevant. Compute cosine.
- Take 50 query+chunk pairs you KNOW are irrelevant. Compute cosine.
- The threshold is the value that separates them best.
For most modern embedding models, the right cutoff lands somewhere between 0.4 and 0.7. Anything lower than 0.3 is essentially "model guessed semantic similarity from grammar alone," not from meaning. Anything higher than 0.9 means "near-duplicate" — usually not what you want for retrieval breadth.
What "score is below threshold" means in practice
When the threshold filters every candidate (no result clears the bar), you have three options:
- Return empty. The model gets no retrieved context. Your prompt should handle this case explicitly: "If you don't have enough information, say so." This is the right move.
- Return the best you have anyway. The model will hallucinate from garbage. This is the bad move.
- Fall back to a different retriever (keyword search, BM25, a broader corpus). This is the expensive-but-right move.
Option 1 is what production RAG systems do. The "I don't know" response feels worse than a confident wrong answer, but it's catastrophically less expensive.
The threshold's siblings: dedupe and stable sort
A threshold catches noise. Dedupe catches redundancy. Stable sort catches order-flicker. You need all three before you ship.
Dedupe pattern:
seen = set()
deduped = []
for c, s in sorted(scored, key=lambda x: x[1], reverse=True):
if c not in seen:
seen.add(c)
deduped.append((c, s))
Stable sort is automatic in Python — sorted is stable, so ties
preserve insertion order. That means if you sort by score and the
top-3 has two 0.87s, they come out in the order they were inserted.
Predictable. The reason this matters: when you re-run the same query
twice and the top-3 reshuffles between calls, the LLM's answer
shifts too. Stable sort kills that flicker.
Retrieval that finds the right thing — top-k, thresholds, and the rerank step everyone skips — step 3 of 9
The threshold is the single most important number in your pipeline
Run the editor. Naive top-5 returns all five chunks, including the
clearly-noise chunk_56 at 0.12. The thresholded version filters
first, returning only the three relevant chunks even though you
asked for up to 5.
That's the property you want in production. Top-k caps the upper bound. The threshold sets the lower bound. Use both.
The cutoff is per-model
The right threshold depends on the embedding model's similarity distribution. Here's the calibration you'll actually do:
- Take 50 query+chunk pairs you KNOW are relevant. Compute cosine.
- Take 50 query+chunk pairs you KNOW are irrelevant. Compute cosine.
- The threshold is the value that separates them best.
For most modern embedding models, the right cutoff lands somewhere between 0.4 and 0.7. Anything lower than 0.3 is essentially "model guessed semantic similarity from grammar alone," not from meaning. Anything higher than 0.9 means "near-duplicate" — usually not what you want for retrieval breadth.
What "score is below threshold" means in practice
When the threshold filters every candidate (no result clears the bar), you have three options:
- Return empty. The model gets no retrieved context. Your prompt should handle this case explicitly: "If you don't have enough information, say so." This is the right move.
- Return the best you have anyway. The model will hallucinate from garbage. This is the bad move.
- Fall back to a different retriever (keyword search, BM25, a broader corpus). This is the expensive-but-right move.
Option 1 is what production RAG systems do. The "I don't know" response feels worse than a confident wrong answer, but it's catastrophically less expensive.
The threshold's siblings: dedupe and stable sort
A threshold catches noise. Dedupe catches redundancy. Stable sort catches order-flicker. You need all three before you ship.
Dedupe pattern:
seen = set()
deduped = []
for c, s in sorted(scored, key=lambda x: x[1], reverse=True):
if c not in seen:
seen.add(c)
deduped.append((c, s))
Stable sort is automatic in Python — sorted is stable, so ties
preserve insertion order. That means if you sort by score and the
top-3 has two 0.87s, they come out in the order they were inserted.
Predictable. The reason this matters: when you re-run the same query
twice and the top-3 reshuffles between calls, the LLM's answer
shifts too. Stable sort kills that flicker.