Provenance Debt in AI Knowledge Bases: When Your RAG System Learns From Itself
Your RAG system is probably indexing its own outputs. You just don't know it yet.
It starts innocuously: someone adds a quarterly summary document to the knowledge base. That summary was written by the same LLM that queries the knowledge base. Six months later, a developer adds AI-generated release notes. Then auto-generated support FAQs. Then a synthesized onboarding guide. None of these documents are labeled as AI-generated. To the retrieval system, they look identical to human-written primary sources. Now when your model retrieves context to answer a question, a significant portion of that context is the compressed, possibly-distorted output of a prior model run — and your accuracy metrics are still green.
This is provenance debt: the accumulation of AI-generated content in retrieval corpora without source markers, creating a feedback loop where each generation of model outputs becomes raw material for the next.
Why It Doesn't Show Up on Your Dashboard
The insidious part of provenance debt is that it fails silently. Standard RAG evaluation measures things like retrieval accuracy, answer relevance, and faithfulness to retrieved context. None of these metrics detect the composition of your corpus.
Research on retrieval collapse makes this concrete: when AI-generated content makes up 50% of available documents in a corpus, over 68% of BM25 retrieval results already come from AI-generated sources. By 67% corpus contamination, more than 80% of retrieved results are synthetic. But measured retrieval accuracy stays flat. Your dashboard shows a healthy system. The system is not healthy.
The deeper problem is that AI-generated content is often structurally superior at retrieval time. It's dense, well-formatted, keyword-rich, and written at the right abstraction level for your typical queries. It beats messy human-written source material in relevance rankings. So contamination doesn't just grow — it actively displaces primary sources in the top results.
The Compounding Error Mechanism
What makes provenance debt dangerous isn't just that your corpus contains AI-generated content. It's that errors in that content compound across retrieval cycles.
Consider this sequence: a model summarizes three source documents about a technical topic. The summary is 90% accurate — a reasonable result. That summary gets added to the knowledge base. Six months later, a new query retrieves that summary as one of its top sources. The model generates an answer that's 90% faithful to the summary. Now you have 90% × 90% = 81% fidelity to the original sources, and you've lost the ability to trace back the error. Add another generation and you're at 73%. The chain continues.
This isn't hypothetical. Model collapse research demonstrates exactly this mechanism at training scale: when models train on predecessor-generated text, there's a consistent decrease in lexical, syntactic, and semantic diversity across successive iterations. Early-stage collapse is particularly hard to detect because aggregate performance metrics can improve even as the model loses fidelity on minority data and tail distributions.
The RAG version of this is retrieval-time model collapse: your system has narrowed its effective knowledge to a filtered, compressed reflection of what previous model runs considered important enough to say.
Where Provenance Debt Accumulates
It's worth being specific about which document types are highest risk, because the remediation strategies differ.
AI-generated summaries are the most common source. When teams add executive summaries, quarterly rollups, or topic overviews to their knowledge base, those documents carry no marker distinguishing them from primary research. They're retrieved with equal authority.
Auto-generated documentation is the second major vector. Release notes generated by LLM, API reference summaries, and onboarding guides written by an AI writing tool all land in the same retrieval bucket as human-authored specs.
Support and FAQ content is particularly risky because it often contains model-generated hedges and qualifications that get stripped out in summarization. "This may vary based on your configuration" becomes "this works as follows" in the next generation's synthesis.
Synthesized analyses — competitive research, trend summaries, technical evaluations written with LLM assistance — carry the additional risk that the original analyst's uncertainty doesn't transfer. The model generates confident prose about uncertain conclusions.
Retrieval Policies That Prevent Feedback Loops
The fix isn't to ban AI-generated content from your knowledge base. It's to treat source class as a first-class attribute of every document, and to write retrieval policies that account for it.
Source-class tagging at ingestion time. Every document should carry a source_class metadata field: human_primary, human_summary, ai_generated, ai_assisted, third_party. This field gets stamped at ingestion — not inferred later. Once content is in the corpus without provenance metadata, recovering that information is expensive and incomplete. Tagging at write time costs almost nothing.
Generation-time provenance annotation. When your system writes content back to a knowledge base or document store, it should include a provenance block: the generation timestamp, the model version, the source document IDs that were retrieved to produce it, and a flag marking it as synthetic. This is the RAG equivalent of git blame: you can reconstruct the chain of provenance at query time.
Retrieval policies that weight by source class. You can implement this at the reranking stage without changing your retrieval architecture. After the initial retrieval pass, apply a source-class penalty to AI-generated documents when human-authored sources are available on the same topic. Don't exclude synthetic content entirely — it may genuinely be the best available answer for some queries. But break the automatic preference that well-formatted synthetic documents get.
Content fingerprinting for near-duplicate detection. AI-generated documents tend to be near-duplicates of each other when they're all synthesizing the same source material. A similarity check at ingestion time catches the case where you have ten AI-generated summaries of the same three papers — keeping one and discarding the rest improves retrieval diversity without losing information.
The Verification Bottleneck
There's a real tension here: the reason teams generate AI content and add it to knowledge bases is that human-authored primary sources don't scale. You have more queries than you have high-quality documentation. Synthetic content fills the gap.
The model collapse literature points toward an approach that resolves this tension: accumulating synthetic data alongside original real data, rather than allowing synthetic content to gradually displace it. Teams that have kept their human-authored primary sources as a stable backbone — and added AI-generated content as supplementary rather than authoritative — avoid the worst feedback loops.
In practice, this means a two-tier corpus structure. Tier 1 is your authoritative layer: human-authored documents, official specs, primary sources. These are indexed separately, tagged with high authority, and never purged to make room for synthetic alternatives. Tier 2 is your enrichment layer: summaries, FAQs, generated analyses. These are indexed with explicit source-class tags and lower retrieval weight when Tier 1 results are available.
This structure lets you get the density and coverage benefits of AI-generated content without creating the feedback loop that provenance debt produces.
Detecting Existing Contamination
If you've never audited your knowledge base for provenance debt, the place to start is a corpus composition audit rather than an accuracy evaluation.
Sample 500 documents from your retrieval corpus. For each document, try to answer: what is the human-authored primary source this was derived from? If you can't answer that for more than 30% of your documents, you have significant provenance debt.
The next signal is retrieval diversity. For your 50 most common query patterns, look at the source distribution across the top 10 retrieved chunks. If more than 60% of retrieved chunks come from documents added in the last 12 months — and your primary source material is older — that's evidence that synthetic content has displaced primary sources in your retrieval rankings.
A harder-to-detect signal: linguistic homogeneity in retrieved context. AI-generated content clusters in embedding space because it uses consistent vocabulary and sentence structures. If your retrieved chunks show unusually high semantic similarity to each other, you may be retrieving variations of the same synthetic voice rather than diverse primary perspectives.
What Gets Harder to Fix Over Time
Provenance debt has one property that distinguishes it from most technical debt: it gets harder to remediate as your corpus grows, because the synthetic documents have been the basis for more synthetic documents.
If you discover that a widely-referenced summary document in your knowledge base was AI-generated and contains errors, you can't just delete it. You need to audit every document that was generated with that summary as context, and every document generated from those documents. The dependency graph is rarely tracked, which means the realistic response is to rebuild the affected section of your corpus from primary sources.
This is the argument for treating provenance tracking as infrastructure rather than metadata cleanup. The cost of stamping every document with its source class and generation provenance at ingestion time is trivial. The cost of reconstructing that information after the fact — or worse, discovering you cannot — is not.
Build the provenance layer first. Let the AI-generated content be clearly labeled as such from the moment it enters the system. Then your retrieval policies can use that label to make informed decisions, rather than treating every document as equivalent to a primary source just because it ended up in the same index.
- https://www.nature.com/articles/s41586-024-07566-y
- https://arxiv.org/abs/2404.01413
- https://arxiv.org/html/2602.16136v1
- https://arxiv.org/html/2401.05856v1
- https://www.anthropic.com/news/contextual-retrieval
- https://www.regal.ai/blog/rag-hygiene
- https://atlan.com/know/llm-knowledge-base-data-quality/
- https://www.lakera.ai/blog/training-data-poisoning
- https://arxiv.org/html/2506.00054v1
- https://www.kapa.ai/blog/rag-gone-wrong-the-7-most-common-mistakes-and-how-to-avoid-them
