Skip to main content

No Results Is Not Absence: Why Agents Treat Retrieval Failure as Proof

· 10 min read
Tian Pan
Software Engineer

The most dangerous sentence in an agent transcript is not a hallucination. It is four calm words: "I could not find it." The agent sounds epistemically humble. It sounds like due diligence. It sounds, to any downstream reader or caller, exactly like a fact. And yet the statement carries no information about whether the thing exists. It only carries information about what happened when a specific tool, invoked with a specific query, consulted a specific index that the agent happened to have access to at that moment.

Between those two readings lies a production incident waiting to happen. A support agent tells a customer "we have no record of your order" because a replication lag delayed the write to the read replica by ninety seconds. A coding agent declares "there are no tests for this module" because it searched a directory that did not contain the test folder. A compliance agent replies "no prior violations on file" because the audit index had not ingested last week's report. In each case the agent's output is grammatically a negation, but epistemically it is a shrug that has been re-typed as a claim.

This is not a model quality problem. Even a perfectly calibrated LLM cannot know what it did not see. The fault sits upstream at the boundary where retrieval tools hand their results to the reasoning loop, and it is almost always a contract design bug: the tool returned "empty" when it should have returned something richer, and the agent collapsed that ambiguity into a confident negation because that is what natural language likes to do with empty sets.

The Four Causes That Collapse Into One Output

When an agent says "not found," at least four structurally different upstream states can produce the phrase, and by the time the words reach the user they are indistinguishable:

  • Wrong tool for the scope. The agent reached for search_current_tickets when the customer's question was about a ticket closed last quarter and archived to cold storage. The tool did its job. It searched where it was told to search. The archive was simply not in scope.
  • Malformed query. The agent translated "the thing I ordered on Tuesday" into a string search for "the thing" in a system that expects an order ID or a SKU. The database returned zero rows. It was telling the truth about the query it was given.
  • Stale or misindexed corpus. The record exists in the system of record but has not propagated to the retrieval index. Survey work on production RAG systems suggests more than half of post-launch retrieval failures trace back to freshness gaps of this kind — the index is a lagging snapshot of a moving target, and the gap is rarely instrumented.
  • Permission-filtered away from this principal. The record exists, the index knows it exists, but the query ran under an identity that is not allowed to see it. Azure AI Search's security filter model and Auth0's writing on agent access control both make this explicit: filtered retrieval is working correctly when it hides rows, but the hiding is invisible to the agent.

All four produce the same string. The agent has no way to distinguish "I looked everywhere, the world contains no such record" from "something I called returned []." Philosophers have a name for the move from the second to the first: the argument from ignorance. Absence of evidence is evidence of absence only under a strong prior that one has looked in the right places with the right instruments. Tool contracts in most agent frameworks do not encode that prior, and the model has no way to reconstruct it from the wire format.

Why the Model Compresses Ambiguity Into a Claim

An LLM is a fluency engine. When it reads a tool result of {"results": []}, it is being handed a sparse, ambiguous, structurally impoverished observation and asked to render it into prose that will be consumed by a human. The most natural English rendering of an empty list is a negation. "There are no results" is a better sentence, stylistically, than "a system call returned an empty array and I cannot warrant what that implies." The model is not lying. It is doing exactly what it was trained to do: produce plausible text.

This is where the epistemic asymmetry shows up. A positive retrieval result is self-certifying — the record is right there, the agent can quote its fields, the reader can audit it. A negative retrieval result is a claim about the complement of everything searched, which is not observable. The agent is being asked to warrant something it cannot witness, and the linguistic interface makes it cheap to do so implicitly.

Research on hallucination taxonomy treats this as a distinct failure class: the agent does not invent a record that does not exist (the classic hallucination), it instead invents certainty about non-existence because the tool contract gave it no vocabulary for uncertainty. Some practitioners call it a fluency collapse — structurally meaningful ambiguity is flattened into a grammatically confident output because the model has no dedicated token for "I'm not sure the search was exhaustive."

Redesigning the Tool Return Contract

The fix is not at the prompt layer. No amount of "be careful about negatives" instruction will survive long context and tool chaining. The fix is in the return shape of the tool itself. A retrieval tool should emit enough provenance that the agent could, if it wanted to, distinguish the four cases above without guessing. That looks roughly like this:

  • Scope disclosure. The tool states what corpus or shard it searched, with a timestamp. Not "orders" but "orders_index, snapshot 2026-04-23T14:02Z, covering last 90 days." The agent now has the information it needs to notice when the user asked about something outside that window.
  • Query echo. The tool echoes the query it actually executed, after any server-side rewriting or normalization. The agent can now reason about whether its query was well-formed for the index's schema, rather than trusting that the text it sent was what ran.
  • Freshness contract. The tool declares its indexing lag — either an SLO ("up to 60 seconds behind") or the exact lag for this request. An agent reading "60 seconds behind" has the information it needs to mention the caveat when the user's question is about a just-now event.
  • Permission-filter acknowledgement. When rows were omitted due to access control, the tool should say so, even if it cannot say which rows. A flag like filtered_by_policy: true collapses the leakage risk of enumerating hidden records while still warning the caller that the empty set may not be a full empty set.
  • Exhaustiveness flag. Did the tool scan the full candidate set, or did it return early after a top-k cutoff with no match among the top k? These are very different epistemic situations, and paginated retrieval conflates them by default.

A tool return that carries these fields turns "no results" from a claim into a bundle of observations. Whether the agent uses them correctly is a second problem, but at least the agent now can.

Requiring Positive Existence Proofs Before Emitting Negations

Once the tool layer is richer, the agent layer gets an invariant worth enforcing: any user-visible negation must be backed by a positive existence proof that the search was authoritative. That invariant sounds abstract, but it cashes out as concrete guardrails in the agent loop.

For a support agent answering "do I have a refund pending," the loop should refuse to emit "no refund on file" unless the retrieval tool reports that it searched the authoritative ledger (not a cached mirror), within the user's tenant, under a principal that can see refund rows, and the index lag is below the threshold where a just-submitted refund could still be invisible. If any of those conditions is unmet, the only honest output is a hedge: "I can see the last 24 hours of refund activity and there's nothing in that window — anything older I'd need to check a separate system for." That is longer, it is less crisp, and it is the correct answer to the question actually being asked.

Practitioners building internal agents sometimes implement this as a small wrapper in the tool dispatch layer: before any tool result can be narrated as a negation, the wrapper checks that the result's scope and freshness metadata meet a policy for the user's question type. If they do not, the wrapper rewrites the empty result into an uncertainty token the agent is trained to surface. This is not sophisticated engineering. It is just refusing to let fluency smear over epistemics.

The Incident Pattern Nobody Files Under the Right Category

The reason this failure mode is chronically under-instrumented is that it does not look like a bug in the usual sense. Nothing errors. No tool returns a 500. The agent emits clean, polite, well-formed English. The customer receives an answer. Three weeks later someone notices that returning-customer retention dropped two points, or a VIP escalation reveals that four unrelated users were told "no record" for records that plainly existed, or a legal review finds that a compliance agent confidently greenlit a transaction because the relevant sanction flag had not yet propagated.

Post-mortems for incidents in this class tend to file under "hallucination" or "knowledge base freshness" or "access control misconfiguration" — one root cause per incident, each filed separately, each assigned to a different team. The pattern underneath — agent treated a null retrieval as an authoritative negation — is the same across all of them and rarely gets a dedicated track. Teams that have started naming it explicitly find that a remarkable number of their "miscellaneous" support failures collapse into one tractable problem with one architectural fix.

Two telemetry signals help surface it. The first is a rate of tool results of the form "empty + no scope metadata" paired with agent outputs containing an authoritative negation — a trivially computable ratio if tool returns are structured, and a leading indicator that the contract has not been tightened. The second is a sampled human review of agent transcripts specifically looking for sentences of the form "no X exists" or "there is no record" and asking the reviewer whether the search underlying that claim was in fact exhaustive, scoped correctly, and timely. That review uncovers an embarrassing number of confident negations that turn out to be polite shrugs.

The Honest Default for Open-World Retrieval

Production retrieval is an open world. The universe of things an agent might be asked about is vastly larger than the indexes it has access to, the indexes are noisy samples of the systems of record, the systems of record are themselves incomplete, and the identity the agent is acting on behalf of can see only a slice of any of the above. Under those conditions, the honest default is that a null retrieval is weak evidence of non-existence, not strong evidence. Strong evidence requires a claim about scope, freshness, and authority that the tool must affirmatively provide.

The fix is tedious and mostly unglamorous: rewriting tool return shapes, teaching the agent to consume scope and freshness metadata, adding a dispatch-layer guard that refuses to narrate negations when the evidence is thin, training reviewers to audit negations as carefully as they audit positive claims. It does not look like the AI engineering that shows up in conference talks. But the alternative — shipping agents whose fluency is allowed to convert silence into testimony — is the substrate on which a large class of quiet, confidence-laundering production incidents is currently being built. The agents sound certain. They are not. The fix starts at the tool contract, and it starts by refusing to let "no results" ever mean "no such thing."

References:Let's stay in touch and Follow me for more thoughts and updates