Retrieval Cascade Failure: How Document Deletion Poisons Your RAG Pipeline
A user asks your support bot when the refund window closes. The bot answers "60 days" with cheerful confidence and a citation. The policy page that says "60 days" was deleted from the CMS three months ago. The new policy is 14. Nobody on your team knows the bot is wrong until a customer escalates.
This is a retrieval cascade failure: the document is gone from the source of truth, but its embedding is still in the index, still ranking high on cosine similarity, still feeding the model a ghost. RAG pipelines treat embedding indexes as caches of source content, but most teams build the cache without building the invalidation. Inserts get all the engineering attention. Deletes get a TODO comment.
The failure is hard to spot because the system looks healthy. Retrieval still returns top-k results. The model still cites a source. The source URL might even still resolve to a 404 or a redirect. From the outside, the only signal is that answers are subtly wrong on questions about content that has churned. By the time a human catches it, the bad answer has been served to thousands of users.
Why Vector Indexes Don't Forget
The first surprise for most teams: deleting a row from a vector database is rarely a hard delete. Graph-based indexes like HNSW — which power Weaviate, Qdrant, pgvector's HNSW mode, and most other production systems — can't actually remove a node without breaking the graph's connectivity guarantees. So they don't. They mark the node with a tombstone and keep it in the graph.
The tombstone is checked at query time and filtered from results. That works as long as the tombstone is checked. It also works as long as the cleanup process runs before the index degrades. Weaviate's default cleanupIntervalSeconds is 300, and there's a documented window where a deleted object remains searchable until the cleanup batch completes. For a low-churn corpus, that window is harmless. For a fast-moving knowledge base — product docs, ticket histories, news feeds — that window is where ghost retrievals live.
When cleanup does run, it has its own pathology. Removing a node from an HNSW graph severs both incoming and outgoing edges. Other nodes lose neighbors. If the deleted node was on the only path to a region of the graph, that region becomes orphaned — unreachable during search even though its data is intact. Repair operations exist (re-search for replacement neighbors, re-link the graph), but they're expensive and they're not run on every delete. Heavy churn without periodic full rebuilds will silently shrink your effective recall.
The HNSW research literature classifies these strategies into three buckets: logical deletion (tombstones, fastest, leaves graph degradation behind), physical deletion (cuts edges, creates orphans), and full rebuild (correct, but reindexing a billion-vector store is not a Tuesday-afternoon job). Production systems use lazy deletion by default and hope you'll trigger rebuilds on a schedule.
The Cascade: One Document, Many Chunks, Many Failures
A single source document doesn't produce a single embedding. It produces N chunks, each with its own vector. Modern chunking strategies — sliding windows, semantic boundaries, overlap — can multiply that N further. A deletion in the source corresponds to a fan-out of vector deletions in the index, and every chunk you miss is a ghost.
This is where the cascade gets expensive. The naive ingestion pipeline tracks documents, but the index stores chunks. If your document → chunk mapping isn't durable — for example, if you regenerate chunk IDs on every ingest, or chunk boundaries shift slightly between runs — you can't reliably enumerate "all chunks belonging to document X" at deletion time. You delete the ones you can find. The orphans that drifted out of your tracking table stay in the index forever.
Edits make this worse than deletes. When a document is updated, the correct behavior is to delete its old chunks and insert new ones. If the old chunks aren't deterministically identifiable, you end up with both versions live in the index. A retrieval might return the obsolete chunk, the current chunk, or a mix — and the model has no signal which is canonical. Your "updated" knowledge base is now an unsorted pile of every version of every document.
The fix is unglamorous: assign a stable document ID and a stable chunk ID at ingestion time, store both as metadata on every vector, and persist the document → chunk mapping in a sidecar table you control. When a document is deleted in the source, you query the sidecar for its chunk IDs, then issue deletes against the vector store by those exact IDs. Skip the sidecar and you're guessing.
CDC, Not Cron, for Source-Index Sync
Many teams sync their vector index with a nightly batch job. Read all documents, hash them, compare to the previous run, push deletes and updates. This works at small scale and breaks predictably as the corpus grows. The job takes longer than the gap between runs. Failures partway through leave the index in an inconsistent state. Documents deleted between runs serve ghost retrievals for up to 24 hours.
Change data capture is the production answer. Instead of polling the source, you subscribe to its change log — Postgres logical replication, MongoDB change streams, Kafka topics from upstream services — and react to delete events as they happen. Tools like Milvus-CDC and pgvector's logical decoding integrations exist precisely for this pattern. The latency to invalidate a deleted vector drops from "next batch run" to "seconds."
CDC also gives you something cron can't: an authoritative ordering. If a document is created, updated, and deleted within a minute, the change stream sees those events in order. The vector index ends up with no chunks for that document, which is correct. A cron job that hashes content might miss the entire lifecycle and leave stale chunks behind because the document existed between snapshots.
The catch with CDC is that you now own a streaming pipeline with all that implies — exactly-once semantics, dead-letter queues, replay tooling, schema migrations on the change events themselves. For high-stakes corpora (compliance docs, support knowledge bases, anything where wrong answers create legal or safety exposure), the operational overhead is worth it. For a personal-project chatbot, nightly batch is fine. Choose deliberately, not by default.
Retrieval-Time Defenses
Even with a perfect ingestion pipeline, you want a backstop at query time. Two patterns are worth wiring in from day one.
Existence check against source of truth. Every retrieved chunk carries a document ID. Before passing the top-k chunks to the model, verify each ID still resolves in the source system. This is one extra lookup per chunk — usually a hash-map join against a freshness table you maintain — and it catches anything the ingestion pipeline missed. The cost is one round trip; the benefit is that no ghost reaches the model. For a small enough freshness table, you can keep it in memory and the lookup is free.
Last-verified timestamps in metadata. Store an ingested_at and a last_verified_at on every chunk. The retrieval layer filters chunks whose last_verified_at is older than a configurable threshold. A background job re-verifies recently-retrieved chunks (those getting traffic) more aggressively. This gives you graceful degradation: if your ingestion pipeline breaks for a weekend, retrieval starts excluding stale chunks instead of confidently serving them. The tradeoff is that you're now exposing a cliff — a chunk that was perfectly fine 24 hours ago suddenly stops being retrieved — so make the threshold per-document-type, not global.
A third pattern, decay-weighted scoring, deserves a mention but is easier to get wrong. The idea is to multiply cosine similarity by a freshness decay factor so older chunks rank lower. Done well, this surfaces fresh content without hiding old-but-still-valid content. Done poorly, it creates a permanent recency bias where documents that were updated last Tuesday always beat documents that were correct on day one and never needed to change. If you ship this, ship it with offline evals on a fixed query set, or you won't know when you've broken retrieval quality in the name of freshness.
What to Do Monday
If you're running a RAG system in production and you've never audited deletion behavior, here's the smallest useful first step: pick a sample of recently-deleted source documents (say, 50 from the last week), and for each one, check whether any of its chunks still appear in your top-50 retrieval results for plausible queries. If the answer is "yes, often," you have a ghost-embedding problem and you have it now.
The full fix is multi-layered: stable document and chunk IDs, a sidecar mapping table, CDC-based sync, retrieval-time existence checks, and a periodic full-rebuild schedule that absorbs the graph degradation tombstones leave behind. Each layer is doable in a sprint. Together they take a quarter. The teams that ship them stop seeing "the bot confidently cites deleted policies" tickets, which is not a metric anyone tracks but is the one that matters.
Treat your vector index as a derived view of the source corpus, not as the corpus itself. Caches that don't invalidate aren't caches — they're slow, expensive sources of truth that disagree with the real one. RAG pipelines drift quietly because the failure mode is always a confident wrong answer, never a loud error. Build invalidation with the same care you build retrieval, or accept that "good answers most of the time" is the ceiling of the system you've shipped.
- https://docs.weaviate.io/weaviate/config-refs/indexing/vector-index
- https://www.pinecone.io/blog/hnsw-not-enough/
- https://arxiv.org/pdf/2407.07871
- https://openreview.net/pdf?id=lnaC19Pd30
- https://medium.com/@vasanthancomrads/incremental-indexing-strategies-for-large-rag-systems-e3e5a9e2ced7
- https://www.simplevector.io/blog/why-reindexing-embeddings-is-a-lie/
- https://glenrhodes.com/data-freshness-rot-as-the-silent-failure-mode-in-production-rag-systems-and-treating-document-shelf-life-as-a-first-class-reliability-concern/
- https://optyxstack.com/rag-reliability/metadata-filters-in-rag-why-good-documents-disappear-before-retrieval-starts
- https://zilliz.com/glossary/change-data-capture-(cdc)
- https://particula.tech/blog/update-rag-knowledge-without-rebuilding
