Skip to main content

The RAG Read-After-Write Race: When Your Vector Index Cites a Document That No Longer Exists

· 10 min read
Tian Pan
Software Engineer

A user asks your assistant a question at 14:32:07. Your retriever fires at 14:32:08 and pulls back five chunks from the policy handbook. The model thinks for a few seconds, drafts a response, and at 14:32:12 streams back an answer that confidently cites section 4.3 — the section that an admin deleted at 14:32:10 because it was wrong. The user reads an authoritative quotation from a document that no longer exists, complete with a clickable link that returns 404.

Nothing in your stack errored. The retriever returned a valid hit. The model produced fluent, grounded prose. The citation pointed at a real chunk ID that was real when the retrieval happened. And yet the answer is, by every reasonable definition, a hallucination — not because the model made something up, but because the world changed underneath the pipeline between the moment it looked and the moment it spoke.

This is the RAG read-after-write race, and most production pipelines have no defense against it.

The Consistency Model You Never Designed

Every RAG pipeline silently assumes a consistency model — read-your-writes for the user who just edited a document, monotonic reads across the steps of a single answer, snapshot isolation between retrieval and generation. None of these assumptions are enforced. They're inherited by accident from the fact that, in the dev environment, nothing changes during the four seconds it takes to answer.

Production is not the dev environment. Documents are edited, deleted, re-permissioned, and superseded constantly. The vector index is a cache of embeddings computed from those documents at indexing time, and like every cache, it has a coherence problem the moment its source can change without notifying it. The HNSW graph sitting behind your retriever doesn't know that the chunk it just returned was deleted from Postgres ninety seconds ago. The reranker that scored those chunks doesn't know that one of them has been edited to say the opposite of what it used to. The model that generates the citation doesn't know any of this either, because the only signal it gets is the chunk text — there is no consistency token, no version, no read timestamp, nothing that would let it say "this is stale."

If you've ever worked on a database, you recognize this immediately. Your RAG pipeline is a distributed read across two systems (the vector index and the source of truth) with no isolation guarantees and no protocol for detecting that the systems disagree. The reason it usually works is the same reason eventually-consistent caches usually work: the window for inconsistency is small, and most reads don't land in it. But "usually" is the worst SLO target a security-relevant or compliance-relevant system can have.

The Three Failure Modes Nobody Calls Race Conditions

The race shows up in three distinct shapes, and they get classified as different bugs by people who don't realize they share a root cause.

The deleted-chunk citation. A document is removed from the source corpus while a request is in flight. The vector index hasn't been informed yet — most production setups use lazy deletion or batch index updates because rebuilding HNSW on every delete is expensive — so the retriever happily returns the chunk. The user sees a quote from content that the company has explicitly chosen to remove. In a regulated context (a retracted medical guideline, a recalled financial document, a removed policy), this is not a "stale cache" problem; this is your AI putting words into the company's mouth that the company has formally repudiated. The standard incident-review response is to blame the model for hallucinating, which buries the actual bug in the index plumbing.

The edited-chunk inversion. Worse than deletion: the document still exists, but its content has been edited between retrieval and generation. The chunk in the model's context says "we offer a 30-day refund window." The chunk on disk now says "we do not offer refunds." The model writes a polite, well-cited answer promising the 30-day refund. Your customer-support team learns about the change six weeks later when refund requests start citing the AI's promise. There is no detectable hallucination here; the model is faithfully grounding its answer in retrieved context that was, at the moment of retrieval, accurate. The bug is that the context is no longer accurate at the moment of generation, and your pipeline has no concept of "moment."

The stale-permission leak. This is the one that keeps security architects up at night. A user's access to a document was revoked at 09:00. The permission change won't propagate into the vector index's metadata until the next sync at 09:15. Between 09:00 and 09:15, every search that user runs can return chunks from documents they no longer have access to. The retriever filters by metadata that is itself stale, and the filter passes. There is no logged access denial, because the access check used the cached permissions, which still said yes. As one practitioner put it, permissions APIs from third-party data sources generally don't provide webhooks, so re-indexing permissions is always a polling-interval problem — and the polling interval is your exposure window.

These three failures look unrelated in a Jira backlog. They're the same race condition firing in different parts of the pipeline.

The Architecture Options

Treating retrieval as if it were a database read — with the same expectations of isolation that you'd never accept in a banking system but blithely accept in a chatbot — opens up a small menu of architectural responses, ordered by how much you trust your source of truth to be a fast key-value lookup.

Snapshot isolation at retrieval time. Pin every retrieval to a logical timestamp. The retriever returns not just chunks but versioned chunks — chunk_id plus the revision the embedding was computed from. The model is told: "this answer is grounded as of T0." If you cite, you cite the version. If the document is edited later, the citation still resolves to the version the model actually saw. This costs you a versioning scheme on the storage layer (every chunk needs an immutable identifier per revision) and a UI affordance that says "this answer is based on the document as of <timestamp>." It's cheap to implement if your chunks already live in something like a CDC-fed event log; it's painful if your "source of truth" is a Confluence space that has no history API.

Revalidate before cite. Right before the model emits a citation — or right before the response leaves the response buffer — re-fetch the chunk by ID and confirm three things: the chunk still exists, its content hash matches what was retrieved, and the user still has access. If any check fails, drop the citation and either re-prompt the model with the surviving context or surface a "this answer is partially stale, regenerating" message. This is more expensive (one extra round trip per citation) but it's the only option that gives you read-your-writes semantics for users who just edited the underlying source. It's also the only option that closes the stale-permission leak, because the access check happens at output time, not at retrieval time.

Transactional retrieval. For high-stakes domains — medical guidance, legal advice, financial advice, compliance reporting — go further: fail the request if the source data changed during the in-flight window. This is the equivalent of optimistic concurrency control on a database row. The user's retry happens against fresh state. You'll lose some throughput; you'll gain the ability to say "no answer was ever generated against stale data." For domains where a wrong answer cited from a deleted document is a regulatory event, that's a trade you make.

Streaming re-indexing. None of the above is a substitute for narrowing the inconsistency window in the first place. The 2026 generation of streaming RAG architectures uses Change Data Capture against the source-of-truth database, with sub-second embedding lag (the delay between a document update and the new embedding being indexed). The race is still there, but the window shrinks from minutes to seconds, which collapses the probability mass of any given request landing in it. CDC isn't a consistency model — it's a way to make your eventually-consistent system more eventually.

The Eval That Catches It

The reason this bug is everywhere is that nobody evals for it. The standard RAG eval suite measures retrieval precision, generation faithfulness, and citation accuracy — all against a static corpus that doesn't move during the eval run. None of those metrics catches a race condition, because there is no race in the test harness.

The eval that does catch it is structurally different. You construct test cases where the corpus mutates between retrieve and generate. The setup is straightforward: capture a real query, fork the source document, and inject the mutation (delete the chunk, edit it to assert the opposite, or revoke the test user's access) on a timer that fires after the retriever returns and before the model finishes streaming. Then check the answer.

The metrics that fall out of this eval are not the metrics in your current dashboard. Stale citation rate — the fraction of answers that cite a chunk that no longer exists at output time. Inverted citation rate — the fraction of answers that cite a chunk whose current content contradicts what the model used. Permission leak rate — the fraction of answers that cite a chunk the requesting user could not currently retrieve. None of these will appear in any RAG benchmark you can buy, because benchmarks are built on frozen corpora and the bug only exists in motion.

If your team's first reaction to "let's measure stale citation rate" is "but we don't have a way to mutate the corpus during a request," that's the signal that your eval harness was designed under the assumption that source-of-truth changes are someone else's problem. They're not. They're the problem.

The Architectural Realization

A vector index is a cache. Every cache needs a coherence protocol. The fact that your cache holds embeddings of natural-language text instead of cached SQL rows doesn't change the engineering — it just makes the inconsistency invisible to the people downstream, because the failure mode is "model wrote a fluent paragraph about a thing that isn't true anymore" rather than "page 500."

The teams that ship reliable RAG in 2026 have internalized this. They version their chunks. They surface the version in the citation. They re-validate before output for high-stakes paths. They run mutate-during-request evals as part of CI, not as a post-incident exercise. They treat the embedding lag as an SLO with an alert, not a graph nobody looks at. And they have a deletion protocol that propagates to the index synchronously for content that has been retracted, even when it costs them latency.

The teams that don't will keep filing tickets that say "the model hallucinated a deleted document" and pushing prompt-engineering fixes that can't possibly work, because the bug isn't in the prompt. The bug is that the world is concurrent and the pipeline pretends it isn't.

References:Let's stay in touch and Follow me for more thoughts and updates