The Retrieval Emptiness Problem: Why Your RAG Refuses to Say 'I Don't Know'

April 16, 2026 · 10 min read

Software Engineer

Ask a production RAG system a question your corpus cannot answer and watch what happens. It rarely says "I don't have that information." Instead, it retrieves the five highest-ranked chunks — which, having nothing better to match, are the five least-bad chunks of unrelated content — and hands them to the model with a prompt that reads something like "answer the user's question using the context below." The model, trained to be helpful and now holding text that sort of resembles the topic, produces a confident answer. The answer is wrong in a way that's architecturally invisible: the retrieval succeeded, the generation succeeded, every span was grounded in a retrieved document, and the user walked away misled.

This is the retrieval emptiness problem. It isn't a bug in any single layer. It's the emergent behavior of a pipeline that treats "top-k" as a contract and never asks whether the top-k is any good. Research published at ICLR 2025 on "sufficient context" quantified the effect: when Gemma receives sufficient context, its hallucination rate on factual QA is around 10%. When it receives insufficient context — retrieved documents that don't actually contain the answer — that rate jumps to 66%. Adding retrieved documents to an under-specified query makes the model more confidently wrong, not less.

The fix most teams reach for is "raise the similarity threshold." It helps a little and then stops helping. Real solutions require rethinking retrieval as a classification problem whose output space includes a null answer, not a ranking problem whose output is always a list.

Why top-k is the wrong abstraction

Vector search is a ranking primitive. You give it a query embedding and it returns the k nearest neighbors in cosine space, sorted. The primitive is neutral about whether those neighbors are relevant. A query about "tax treatment of stock options in Germany" against a corpus of cooking recipes will still return five recipes, ranked by whichever ones happen to share the most incidental vocabulary with the query vector. Distance in embedding space does not equal semantic relevance; it equals "this was the closest thing we had."

The problem compounds at the prompt-assembly step. Standard RAG prompts are written in the imperative: "use the following context to answer the question." There is no slot in the template for "none of this context is relevant, so admit you don't know." The LLM receives five irrelevant passages with a directive to use them, and it does. Researchers call this the abstention paradox: retrieval, which was supposed to reduce hallucination by grounding outputs in real documents, instead increases hallucination when the documents are off-topic, because adding any context raises the model's unwarranted confidence.

A 2025 survey of abstention in LLMs frames this as a calibration failure. Models don't know what their retrieved context means — they just treat it as true. The upstream pipeline never signaled "this retrieval is low-confidence." So "low-confidence retrieval" and "high-confidence retrieval" arrive at the model indistinguishable. The model treats them identically. The generation is grounded in text that happened to rank highest, not text that is actually responsive.

Similarity thresholds are necessary but not sufficient

The first reflex for most teams is adding a cosine similarity cutoff: if the top result scores below 0.7, return "no answer." This is the right direction and the wrong implementation. Three problems surface almost immediately.

First, the 0.7 number is a folk default, not a calibrated value. Common threshold bands look like: 0.9+ for near-exact matches, 0.7–0.8 for relevant-but-imperfect, 0.5–0.6 for loosely related, 0.3–0.4 for weak connections. But those bands are model-specific and corpus-specific. Switch from text-embedding-3-small to a multilingual model and the whole distribution shifts. Index a corpus of highly similar technical documents and every score compresses toward 0.9, making 0.7 meaningless. The right threshold for your system can only be discovered empirically by holding out queries, labeling them, and seeing where the signal-to-noise ratio inflects.

Second, a global threshold is wrong even within a single system because different query types have different score distributions. A specific factual query ("what's Jane's email?") should hit precision-tier matches at 0.75+, while an exploratory query ("what do we know about pricing strategy?") legitimately retrieves documents at 0.5–0.6 that are genuinely useful. Holding both to 0.7 either drops useful exploratory results or passes bad specific-query results. Per-query-class calibration — treating threshold as a function of query intent — typically captures 2–3x the precision improvement of a global cutoff.

Third, similarity is a proxy for relevance, not a measurement of it. Two sentences can have high cosine similarity because they share vocabulary and syntactic structure while discussing different things. A tax question about stock options will cosine-match a document about corporate compensation even if the document never addresses taxation. The bi-encoder embedding is optimized for retrieval speed, not for relevance judgment. Raising the threshold past a certain point starts dropping genuinely useful matches before it filters out the confidently-irrelevant ones.

The two-stage pattern: retrieve wide, classify narrow

The production pattern that actually works treats retrieval and relevance as separate problems. The first stage casts a wide net — retrieve 50 or 100 candidates from vector search with no strict threshold. The second stage runs a cross-encoder reranker over the candidates, scoring each (query, document) pair with a model that was trained specifically to judge relevance, not to produce fast similarity.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Retrieval Emptiness Problem: Why Your RAG Refuses to Say 'I Don't Know'

Why top-k is the wrong abstraction

Similarity thresholds are necessary but not sufficient

The two-stage pattern: retrieve wide, classify narrow

Recommended Reading

About Tian Pan

Why top-k is the wrong abstraction​

Similarity thresholds are necessary but not sufficient​

The two-stage pattern: retrieve wide, classify narrow​

Recommended Reading

About Tian Pan

Why top-k is the wrong abstraction

Similarity thresholds are necessary but not sufficient

The two-stage pattern: retrieve wide, classify narrow