Poisoned at the Source: RAG Corpus Decay and Data Governance for Vector Stores

April 17, 2026 · 11 min read

Software Engineer

Your RAG system was working fine at launch. Three months later it's confidently wrong about a third of user queries — and your traces show nothing broken. The retriever is fetching documents. The model is generating responses. The pipeline looks healthy. The problem is invisible: every vector in your store still has a similarity score, but half of them are pointing to facts that no longer exist.

This is corpus decay. It doesn't throw errors. It doesn't trigger alerts. It accumulates quietly in the background, and by the time you notice it through user complaints or quality degradation, your vector store has become a liability.

Vector databases are infrastructure. Engineers who build production RAG systems eventually learn this the hard way. You cannot treat your retrieval corpus as a static artifact that you load once and forget. It ages. It contradicts itself. It gets poisoned. And unlike your application database — where stale rows cause obvious, traceable failures — stale embeddings produce confident, plausible-sounding hallucinations that pass all your existing metrics.

Why Staleness Is Structurally Invisible

The core problem with RAG corpus decay is that vector similarity measures topical relevance, not temporal relevance. A document from 18 months ago that discusses the same concept as a recent document will score equally well — or better, because it may be more exhaustively written.

When you index a document, you're converting it into a position in high-dimensional space. That position encodes meaning, not age. There is no gravitational decay pulling outdated documents away from query centroids. A policy document superseded six months ago occupies exactly the same region as its replacement. When a user asks about that policy, both documents surface. The model synthesizes them. The answer is somewhere between current and outdated — and without provenance metadata in your prompt, neither the model nor the user can tell which facts are live.

The second invisibility factor is confidence. LLMs are trained to produce fluent, coherent responses. They don't express uncertainty proportional to how old their retrieved context is. A retrieval hit from a three-year-old product spec generates the same tone as a hit from yesterday's update. Your monitoring dashboards measure retrieval relevance scores and answer fluency — neither of those signals degrades when the corpus ages.

This creates what practitioners call a "silent failure loop": the system continues functioning, metrics stay stable, and quality steadily erodes in ways that only become visible through user behavior (longer edit rates, more retries, eventual abandonment) or high-stakes failures.

The Three Categories of Corpus Contamination

Thinking about corpus health as a single problem leads to single-threaded solutions. In practice, RAG corpora degrade through three distinct mechanisms that require different mitigations.

Temporal decay is the most common. Content that was accurate when ingested becomes wrong due to external change: product features get updated, regulations change, people change roles, APIs get deprecated. The vector store doesn't know any of this happened. Documents stay indexed indefinitely unless you explicitly remove them. Law firms experience this acutely — case decisions released in the morning affect advice that afternoon, but a static corpus forces lawyers to work with whatever snapshot was last indexed. This isn't an edge case; it's the default behavior of every RAG system that doesn't explicitly address freshness.

Internal conflict is more subtle. As corpora grow, you accumulate multiple documents that make contradictory claims about the same facts. Version 1.2 of your product docs says one thing; version 1.4 says the opposite; both are indexed. The model retrieves both, detects contradiction, and either hedges ("this may have changed") or arbitrarily picks one. Neither outcome is good. Conflict accumulates especially fast when you ingest from multiple sources with inconsistent update cycles — internal wikis, external documentation, exported CRM data, Slack exports — none of which are synchronized with each other.

Adversarial injection is the least discussed but increasingly real threat. Because retrievers rank by vector similarity, an attacker who can insert content into your corpus can craft documents whose embeddings are optimized to intercept specific query types. These "vector magnets" — text that reads as legitimate to humans but is positioned in embedding space to capture security- or finance-related queries — redirect users toward attacker-controlled content. The attack doesn't require access to your model weights or prompts; it only requires the ability to get a document ingested. For corpora that accept user-contributed content, third-party integrations, or web crawls, this surface area is significant.

TTL Design: Freshness as a First-Class Schema Property

The foundational fix for temporal decay is assigning explicit expiration metadata to every document chunk at ingest time. This means treating your vector store like a cache, not a database.

Every chunk in your index should carry at minimum:

created_at: ingestion timestamp
expires_at: expected staleness threshold based on content type
source_version: a hash or version identifier from the upstream document
last_verified_at: when the content was last confirmed still current

The expiration policy should be calibrated to content volatility. Product documentation might carry a 90-day TTL before requiring re-verification. News articles might be set to 7 days. Reference data (glossary terms, stable concepts) might be set to a year. The key is that no document should have an infinite TTL — every chunk should have an explicit answer to "when does this expire?"

During retrieval, your pipeline should actively filter on freshness in addition to semantic similarity. A blended ranking approach that weights both dimensions prevents high-scoring but outdated documents from dominating results:

final_score = (semantic_score × 0.7) + (freshness_boost × 0.3)

The freshness boost is a decaying function of document age relative to its TTL. Documents past their TTL threshold should be either excluded from retrieval entirely or flagged with explicit uncertainty in the prompt context.

Expired documents shouldn't silently disappear — you want an audit trail. A separate archive index or tombstone record lets you track what was removed and when, which is essential for debugging unexpected quality regressions.

Ingest-Time Conflict Detection and Deduplication

Waiting until retrieval to handle duplicates and conflicts is too late. By the time you're serving queries, contradictory content is already in your index and will surface with high probability on relevant queries.

Conflict detection has to happen at ingest. A robust ingest pipeline applies checks at multiple levels:

At the document level, compute a content hash (SHA-256 of the raw text) before any processing. If an identical hash already exists in your index, skip reinsertion. This eliminates exact duplicates — the most common source of retrieval noise.

At the chunk level, run an approximate nearest-neighbor search on each new embedding before inserting it. If a new chunk has cosine similarity above a threshold (typically 0.95+) with an existing chunk from a different source document, surface it for review rather than silently inserting. This catches semantic duplicates: different documents making identical claims, which multiplies their effective retrieval weight and biases the model toward that claim regardless of whether it's accurate.

For conflict detection — true contradictions, not just duplicates — you need entity resolution. When a new document makes a claim about a named entity (a product version, a policy, a specific configuration value), your pipeline should check whether existing indexed documents make conflicting claims about the same entity. This is harder and requires some structured metadata extraction, but it catches the most dangerous category of corpus contamination.

Stream processing frameworks like Kafka Streams or Apache Flink work well here because they can maintain state across the ingest stream, enabling cross-document conflict detection without loading your entire corpus into memory on each ingestion job.

Access Control at the Retrieval Layer

In multi-tenant deployments, or any system where different users should see different subsets of your corpus, access control must be enforced at retrieval time — not just at the application layer.

The failure mode is common: teams build RBAC at the API layer ("user X can call this endpoint") without realizing that the retrieval pipeline itself doesn't enforce those boundaries. A query that hits the vector store will pull documents from across the entire index unless you explicitly filter by access metadata. If your corpus contains documents from multiple clients, business units, or sensitivity levels, a misconfigured filter can leak sensitive content into any user's retrieved context.

The correct pattern embeds access metadata into document chunks at ingest time — tenant ID, security classification, allowed roles — and applies that metadata as a mandatory pre-filter on every retrieval call. The vector similarity search only considers documents within the user's access scope.

The technical tradeoff is between pre-filtering and post-filtering. Pre-filtering narrows the search space before ANN lookup, which is faster for corpora with high access fragmentation (many users, each seeing a small slice). Post-filtering runs the full ANN search and then drops unauthorized results, which performs better when most users can see most documents. For mixed corpora, a hybrid approach — namespace-level pre-filtering combined with metadata post-filtering — typically provides the best balance.

Recent research on dynamic RBAC partitioning (HONEYBEE) shows that intelligent partitioning strategies can achieve up to 13x lower query latency versus naive row-level security, which matters at the retrieval scales typical of enterprise deployments.

Corpus Monitoring: What to Actually Measure

Traditional RAG evaluation frameworks focus on retrieval relevance, faithfulness, and answer quality. These metrics all degrade when your corpus rots — but they don't tell you why quality degraded, and they don't give you leading indicators before failure reaches users.

Corpus health monitoring requires its own metric layer:

Staleness distribution: A histogram of document ages in your active retrieval index. You should know the median, 95th percentile, and maximum age of indexed content. This tells you whether your TTL policies are actually executing and whether documents are being refreshed on schedule.

Conflict density: The rate at which newly ingested documents semantically conflict with existing indexed content. Rising conflict density signals either corpus contamination or a source that's publishing corrections faster than your pipeline is processing them.

Retrieval provenance coverage: What percentage of retrieved chunks carry complete source_version and last_verified_at metadata. Any chunk without this metadata is effectively invisible to freshness-based filtering.

Embedding drift: When you upgrade your embedding model or retrain on new data, the geometric meaning of positions in your vector space changes. Documents embedded with the old model no longer map correctly relative to queries encoded with the new one. Track cosine distance between embeddings of the same document across model versions — stable systems show less than 0.005 drift; values above 0.05 indicate that re-indexing is necessary.

Platforms like Arize AI, Langfuse, and Evidently AI provide RAG-specific observability that can surface these metrics alongside traditional retrieval quality signals.

Treating RAG as a Data Engineering Problem

The framing that RAG is primarily an AI problem — a question of which model to use, which retrieval algorithm to tune, which prompt template to write — leads to the decay patterns described above. The retrieval layer is fundamentally a data system. It has the same lifecycle challenges as any production database: schema evolution, data quality, access control, freshness guarantees, conflict resolution.

The teams that build durable RAG systems have converged on a set of disciplines that would look familiar to any data engineer:

A data catalog for the corpus: lineage tracking that records source, ingestion timestamp, version, and transformation history for every document. When quality regresses, you need to trace which documents contributed to the bad answers.

A data contract between ingestion and retrieval: explicit schemas for what metadata every chunk must carry before it can be indexed, with validation that rejects ingestion of chunks missing required fields.

A refresh pipeline that treats corpus updates like migrations — incremental, versioned, with rollback capability. Not a one-off script that runs when someone remembers to trigger it.

A quality gate at ingest that runs conflict detection, deduplication, and freshness validation before any document enters the index. Garbage in is still garbage out regardless of how capable the model is.

The 40% of production RAG failures that trace to data quality issues aren't a model problem or an infrastructure problem. They're a consequence of treating the corpus as a secondary concern. Fix the governance layer and the model has something trustworthy to work with.

The practical takeaway: audit your vector store today. Pull the age distribution of your indexed documents. Find the oldest chunk in your active retrieval index. If you don't have the metadata to answer that question, that's the first thing to fix — not your retriever, not your prompt template, not your model.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Poisoned at the Source: RAG Corpus Decay and Data Governance for Vector Stores

Why Staleness Is Structurally Invisible

The Three Categories of Corpus Contamination

TTL Design: Freshness as a First-Class Schema Property

Ingest-Time Conflict Detection and Deduplication

Access Control at the Retrieval Layer

Corpus Monitoring: What to Actually Measure

Treating RAG as a Data Engineering Problem

Recommended Reading

About Tian Pan

Why Staleness Is Structurally Invisible​

The Three Categories of Corpus Contamination​

TTL Design: Freshness as a First-Class Schema Property​

Ingest-Time Conflict Detection and Deduplication​

Access Control at the Retrieval Layer​

Corpus Monitoring: What to Actually Measure​

Treating RAG as a Data Engineering Problem​

Recommended Reading

About Tian Pan

Why Staleness Is Structurally Invisible

The Three Categories of Corpus Contamination

TTL Design: Freshness as a First-Class Schema Property

Ingest-Time Conflict Detection and Deduplication

Access Control at the Retrieval Layer

Corpus Monitoring: What to Actually Measure

Treating RAG as a Data Engineering Problem