Skip to main content

The Retrieval Emptiness Problem: Why Your RAG Refuses to Say 'I Don't Know'

· 10 min read
Tian Pan
Software Engineer

Ask a production RAG system a question your corpus cannot answer and watch what happens. It rarely says "I don't have that information." Instead, it retrieves the five highest-ranked chunks — which, having nothing better to match, are the five least-bad chunks of unrelated content — and hands them to the model with a prompt that reads something like "answer the user's question using the context below." The model, trained to be helpful and now holding text that sort of resembles the topic, produces a confident answer. The answer is wrong in a way that's architecturally invisible: the retrieval succeeded, the generation succeeded, every span was grounded in a retrieved document, and the user walked away misled.

This is the retrieval emptiness problem. It isn't a bug in any single layer. It's the emergent behavior of a pipeline that treats "top-k" as a contract and never asks whether the top-k is any good. Research published at ICLR 2025 on "sufficient context" quantified the effect: when Gemma receives sufficient context, its hallucination rate on factual QA is around 10%. When it receives insufficient context — retrieved documents that don't actually contain the answer — that rate jumps to 66%. Adding retrieved documents to an under-specified query makes the model more confidently wrong, not less.

The fix most teams reach for is "raise the similarity threshold." It helps a little and then stops helping. Real solutions require rethinking retrieval as a classification problem whose output space includes a null answer, not a ranking problem whose output is always a list.

Why top-k is the wrong abstraction

Vector search is a ranking primitive. You give it a query embedding and it returns the k nearest neighbors in cosine space, sorted. The primitive is neutral about whether those neighbors are relevant. A query about "tax treatment of stock options in Germany" against a corpus of cooking recipes will still return five recipes, ranked by whichever ones happen to share the most incidental vocabulary with the query vector. Distance in embedding space does not equal semantic relevance; it equals "this was the closest thing we had."

The problem compounds at the prompt-assembly step. Standard RAG prompts are written in the imperative: "use the following context to answer the question." There is no slot in the template for "none of this context is relevant, so admit you don't know." The LLM receives five irrelevant passages with a directive to use them, and it does. Researchers call this the abstention paradox: retrieval, which was supposed to reduce hallucination by grounding outputs in real documents, instead increases hallucination when the documents are off-topic, because adding any context raises the model's unwarranted confidence.

A 2025 survey of abstention in LLMs frames this as a calibration failure. Models don't know what their retrieved context means — they just treat it as true. The upstream pipeline never signaled "this retrieval is low-confidence." So "low-confidence retrieval" and "high-confidence retrieval" arrive at the model indistinguishable. The model treats them identically. The generation is grounded in text that happened to rank highest, not text that is actually responsive.

Similarity thresholds are necessary but not sufficient

The first reflex for most teams is adding a cosine similarity cutoff: if the top result scores below 0.7, return "no answer." This is the right direction and the wrong implementation. Three problems surface almost immediately.

First, the 0.7 number is a folk default, not a calibrated value. Common threshold bands look like: 0.9+ for near-exact matches, 0.7–0.8 for relevant-but-imperfect, 0.5–0.6 for loosely related, 0.3–0.4 for weak connections. But those bands are model-specific and corpus-specific. Switch from text-embedding-3-small to a multilingual model and the whole distribution shifts. Index a corpus of highly similar technical documents and every score compresses toward 0.9, making 0.7 meaningless. The right threshold for your system can only be discovered empirically by holding out queries, labeling them, and seeing where the signal-to-noise ratio inflects.

Second, a global threshold is wrong even within a single system because different query types have different score distributions. A specific factual query ("what's Jane's email?") should hit precision-tier matches at 0.75+, while an exploratory query ("what do we know about pricing strategy?") legitimately retrieves documents at 0.5–0.6 that are genuinely useful. Holding both to 0.7 either drops useful exploratory results or passes bad specific-query results. Per-query-class calibration — treating threshold as a function of query intent — typically captures 2–3x the precision improvement of a global cutoff.

Third, similarity is a proxy for relevance, not a measurement of it. Two sentences can have high cosine similarity because they share vocabulary and syntactic structure while discussing different things. A tax question about stock options will cosine-match a document about corporate compensation even if the document never addresses taxation. The bi-encoder embedding is optimized for retrieval speed, not for relevance judgment. Raising the threshold past a certain point starts dropping genuinely useful matches before it filters out the confidently-irrelevant ones.

The two-stage pattern: retrieve wide, classify narrow

The production pattern that actually works treats retrieval and relevance as separate problems. The first stage casts a wide net — retrieve 50 or 100 candidates from vector search with no strict threshold. The second stage runs a cross-encoder reranker over the candidates, scoring each (query, document) pair with a model that was trained specifically to judge relevance, not to produce fast similarity.

Cross-encoders are slower than bi-encoders by design: they concatenate query and document and run a full transformer pass over the pair, which is what lets them catch the subtle mismatches bi-encoders miss. A typical production recipe pulls 150 candidates from vector search, reranks the top 30 with a cross-encoder, and sends the top 3–8 to the LLM. The cross-encoder scores are calibrated and sit on a different scale than cosine similarity — a reranker score of 0.5 means "moderately relevant," not "half-matched," and the score distribution is much more stable across query types.

Now the threshold decision moves to the reranker output, where it actually means something. And it gets you a new capability for free: if no document passes the reranker threshold, the pipeline can route to abstention instead of handing the LLM the best of a bad lot. The reranker is acting as a relevance classifier with a null answer. That's the architectural shift — the pipeline now has a legitimate "no" output.

You can push this further with a dedicated retrieval-worthiness classifier sitting in front of the retrieval step. Before even querying the index, a lightweight classifier decides whether the query is in-domain for this corpus. Out-of-domain queries skip retrieval entirely and return a clear "I don't cover that topic" response. This matters because for truly out-of-domain queries, any retrieval result will be irrelevant, and rerankers are not infinitely reliable at rejecting them all. Making retrieval itself conditional on a domain-match signal reduces the pressure on downstream filters.

Calibrating abstention without over-refusing

The failure mode on the other side is over-refusal: your system starts saying "I don't have that" for questions it could answer, destroying the usefulness that retrieval was supposed to provide. A 2025 study on whether retrieval-augmented models know when they don't know found that models trained aggressively for refusal often refuse questions they could have answered correctly from their internal knowledge alone — the retrieval layer signaled "bad context," and the model took that as "refuse," even when refusal was wrong.

The production lesson is that abstention needs to combine multiple signals, not rely on any single threshold. Useful inputs include:

  • Retrieval signal: highest reranker score and its distance from the next-highest. A cluster of high scores suggests a well-supported answer; a single borderline score with nothing near it suggests a weak match dressed up as a strong one.
  • Sufficient-context signal: a classifier — often just a small LLM with a structured prompt — that reads the retrieved context and judges whether it contains enough information to answer the question. Google's research showed this can be trained to 93% accuracy on a binary sufficient/insufficient classification.
  • Model self-confidence: sampling-based confidence (does the model give the same answer five times?) or logprob-based confidence. This is noisy on its own but useful as a cross-check against the retrieval signal.
  • Query-type prior: factual questions should abstain more aggressively than exploratory ones, because being wrong on "what's the deadline?" is worse than being vague on "what are we thinking about pricing?"

Combining sufficient-context scoring with model self-confidence produces a selective generation framework that, in the Google study, improved the accuracy-coverage trade-off by roughly 10 percentage points over confidence-only methods. The point isn't the exact number — it's that orthogonal signals stack, and any single one used alone will either over-refuse or under-refuse.

Making "I don't have that" a first-class response

The deepest version of the retrieval emptiness problem is not technical but architectural: most RAG systems are designed in a way that makes abstention structurally unlikely. The prompt template assumes retrieval worked. The UX assumes a confident answer is incoming. The evaluation harness measures response quality on questions the corpus covers, not refusal quality on questions it doesn't. Every layer is biased toward "produce output."

Teams that successfully ship trustworthy RAG invert this. They treat the "I don't have that information" response as a product feature with its own UX, its own prompt path, its own evaluation set, and its own observability. In practice this means:

  • An explicit abstention path in the prompt-assembly code, triggered by the reranker and sufficient-context signals, that swaps in a different system prompt asking the model to state what it would need to answer and to suggest where the user might look.
  • An eval set of known-unanswerable queries — questions the corpus genuinely doesn't cover — scored on refusal rate and refusal quality. Without this set, you cannot tell whether raising a threshold improved trust or just shipped more "I don't know" for questions you should have answered.
  • Separate dashboards for refusal rate, hallucination rate on insufficient-context queries, and coverage loss. These three metrics trade against each other, and you can't optimize them as one number.
  • A feedback channel where users can flag "you refused something you should have answered." Over-refusal is usually silent — users just stop asking — so you have to actively surface it.

The most trustworthy thing a RAG system can say is that it doesn't know. The tooling to earn that trust doesn't come from better retrieval — it comes from treating abstention as an engineered capability. A RAG that always answers will always hallucinate the tails of its distribution. A RAG that knows its limits is the one users can actually rely on.

References:Let's stay in touch and Follow me for more thoughts and updates