Skip to main content

The 'What Changed' Query Is the RAG Question Your Index Can't Answer

· 10 min read
Tian Pan
Software Engineer

A user asks your assistant, "what changed about our refund policy this quarter?" The system returns a confident, well-formatted summary of the current refund policy. The user nods, closes the chat, and acts on information that has nothing to do with the question they asked. Nothing in your eval suite caught this. Nothing in your faithfulness metric flagged it. The retrieval looked perfect — it returned highly-relevant chunks. The synthesis looked perfect — it cited every chunk it used. The only problem is that the question was about change, and your index has no concept of change.

This is the failure mode that vector-similarity retrieval cannot fix by tuning. Two versions of the same document have nearly-identical embeddings — that is what good embeddings do, they collapse semantically equivalent text into the same neighborhood. So when you ask "what changed," the retriever returns one of the versions, the LLM summarizes that version, and the answer is silently a hallucination of nothing-changed. The user cannot tell. Your eval set probably cannot tell either, because your eval set is built around "what is X" questions, not "what's different about X now."

The shape of the problem is bigger than a missing feature. Delta queries — "what changed," "how is Q3 different from Q2," "what's new since I last looked," "did we update the SLA after the incident" — are a meaningful fraction of real production traffic on any RAG system that sits on top of evolving documents. Engineering ship release notes. Legal teams compare contract versions. Customer success teams need to know which policies updated this month. None of these queries are well-served by an index built around "find the most similar chunk." They need an index that knows what came before, what came after, and what is the same chunk seen at two different points in time.

Why Vector Similarity Cannot Detect Change

The intuition that breaks first-time builders is the assumption that "more retrieval" will eventually surface both versions. It will not. A well-tuned embedding model maps "the refund window is 30 days" and "the refund window is 14 days" to vectors that are very close to each other — they share most of their tokens, share the same syntactic pattern, share the same domain. Cosine similarity puts them next to each other in the embedding space. The retriever has no axis along which to prefer one over the other for a query that does not specify a time. It picks one and serves it. The LLM downstream has exactly one document to read, and the document tells a coherent story about refunds, and the answer it generates is coherent and wrong.

This is a structural limitation, not a tuning problem. You cannot fix it by raising k, lowering k, switching embedding models, adding a re-ranker, or rewriting the prompt. None of those interventions teach the retriever the difference between "the version of this document last quarter" and "the version of this document this quarter." Both versions occupy the same semantic neighborhood. Both versions look like good answers to the surface-level query. The retriever is doing exactly the job it was designed to do — and that job does not include time.

VersionRAG, published in late 2025, measured this directly. On a benchmark of version-sensitive questions over evolving technical documentation, plain RAG scored 58 to 64 percent accuracy — barely above random for binary questions. The accuracy ceiling was not a tuning gap. It was the same structural ceiling: the system could not reliably distinguish a question about "the doc as of now" from a question about "how the doc changed." A version-aware variant that explicitly modeled document evolution as a hierarchical graph hit 90 percent on the same benchmark, with 97 percent fewer indexing tokens. The lift came from the architecture, not from a better embedder.

The Hallucination Nobody Notices

Content hallucinations — the LLM inventing a fact that contradicts the source — are caught by faithfulness checks. Every modern RAG eval framework grades the output against the retrieved context and flags claims that the context does not support. This is the standard hallucination defense, and it works.

Delta hallucinations slip through that defense entirely. The LLM is being faithful to the context it was given. The context is the current version of the document. The summary of that version is accurate. The hallucination is at the level of what the user asked, not what the document says — the model answered "tell me about X" when the question was "tell me how X changed." The faithfulness metric scores this as a clean pass, because the metric was designed to compare claims against retrieved chunks, not to compare the framing of the answer against the framing of the question.

Detecting this requires a different category of eval — one that includes delta queries with ground-truth diffs and grades the answer on whether it produced a comparison rather than a description. Most teams do not have this category in their eval suite. They have "factual" queries, "summarization" queries, maybe "multi-hop" queries. They do not have "delta" queries as a first-class slice. So the failure mode never registers in the dashboard, never burns the error budget, and ships to production silently. The first time anyone notices is when a customer escalates because they made a decision based on outdated information that the assistant confidently presented as current.

What an Index That Knows About Change Looks Like

The architecture that fixes this has four components, and each of them is missing from the default RAG starter kit.

First, a temporal index that retains versioned documents with diff metadata. When a document is updated, you do not throw away the old embedding — you keep it, tag it with a version number and a validity window, and store a content-addressable hash of each chunk so the system knows which chunks actually changed. LiveVectorLake's design uses SHA-256 over semantic chunks for this; the practical effect is that a "minor edit to one paragraph" preserves the embeddings of every other chunk and only re-indexes the one that actually moved. This is not just storage hygiene — it is the substrate that delta queries need.

Second, a query intent classifier that routes delta queries to a delta-specific retrieval pipeline. Plain similarity search is the wrong default for "what changed." The classifier — which can be a small model or even a regex over query templates — detects intent markers like "what changed," "how is X different from Y," "what's new in," "since last quarter," and routes those queries to a different retrieval path that fetches the before and after of the relevant document, not just one similar chunk. VersionRAG describes this routing layer explicitly; the architectures that omit it are stuck in the 58 percent regime.

Third, a chunking strategy that preserves cross-version chunk identity. When you re-chunk a document on a new version, you need a way to say "this chunk in v2 corresponds to that chunk in v1." Without this, even if you retrieve both versions, the LLM cannot align them — it gets two unrelated-looking blobs of text and has no scaffolding to compare. The fix is structural chunking (by section heading, by clause number, by paragraph anchor) rather than purely length-based chunking, plus a stable chunk identifier that survives edits to the chunk's contents.

Fourth, a synthesis prompt designed to compare rather than summarize. The default RAG prompt — "given these chunks, answer the question" — is wrong for delta queries. The right prompt for a delta query gives the model a labeled before-version and after-version and asks it to produce a structured diff: what was added, what was removed, what was rephrased without changing meaning, what was changed in substance. This is a different generation task with a different output schema, and treating it as a variant of the default summarization prompt produces vague mush.

The Eval Slice You Are Probably Missing

A delta-query eval is concrete. You build a corpus of versioned documents — release notes are an easy starting point, but contracts, policies, technical specs, and config files all work. For each version pair, you write a small set of delta queries with ground-truth diffs that name the actual changes. Then you grade two things: did the system retrieve the right pair of versions, and did the synthesis produce a comparison that mentions the actual changes without inventing changes that did not happen.

The grading rubric for delta synthesis has to penalize three failure modes that the standard faithfulness metric does not catch. The first is the silent-current-version answer — system summarizes the latest version as if the question were a content question. The second is the silent-stale-version answer — system summarizes the prior version, often because the embedding for the older version happens to score slightly higher on the query. The third is the fabricated-change answer — system invents a difference between versions that the diff does not contain, usually because the LLM defaults to producing a contrastive structure when the prompt frames the question as "what changed."

Running this eval against your existing RAG system is uncomfortable. The numbers will be much lower than your other slices. That is the point. The gap between your "what is X" performance and your "what changed about X" performance is the gap your users are already living with.

The Bigger Pattern: Retrieval as Similarity Is One Mode

Delta queries are the most visible case of a deeper issue: vector similarity retrieval is one mode of a richer retrieval problem space, and most production RAG systems are stuck operating in only that mode. The other modes are real and answer common questions:

  • Trend queries — "how has our pricing evolved over the last two years" — need temporally-ordered retrieval, not similarity-ranked retrieval.
  • Counterfactual queries — "what would have happened if we had used the old policy on this case" — need retrieval that can pull a non-current version of a document conditioned on a hypothetical.
  • Comparative queries — "how does our SLA differ from the industry standard" — need retrieval that pulls from two distinct corpora and aligns them, not retrieval that mixes them.
  • Provenance queries — "where did this number first appear in the docs" — need retrieval that walks document history, not retrieval that ranks by current relevance.

Each of these has the same shape as the delta query problem. The default similarity-ranked retriever returns confident, well-formatted, plausible-looking answers that fail in a category-specific way the faithfulness metric does not catch. Each of these failure modes is invisible until you build an eval slice for it.

What to Do This Week

If you are running a RAG system on documents that change — and almost every business RAG system is — there is a triage you can do this week before any architecture work. Sample your production traffic, classify queries by intent, and count what fraction are delta-shaped. If it is meaningful, build a small delta eval slice and run it against your current system. Then look at the result without flinching.

The teams that ship the next generation of useful RAG products are the ones who treat retrieval as a problem space with multiple modes, not a single similarity search dressed up with re-rankers. The team that keeps treating "find the closest chunk" as the universal primitive is shipping a system whose silent failure rate scales with the rate at which the underlying documents change. In a business where documents change all the time, that rate is the rate at which the assistant is wrong about the questions users actually ask.

References:Let's stay in touch and Follow me for more thoughts and updates