Provenance Debt in AI Knowledge Bases: When Your RAG System Learns From Itself
Your RAG system is probably indexing its own outputs. You just don't know it yet.
It starts innocuously: someone adds a quarterly summary document to the knowledge base. That summary was written by the same LLM that queries the knowledge base. Six months later, a developer adds AI-generated release notes. Then auto-generated support FAQs. Then a synthesized onboarding guide. None of these documents are labeled as AI-generated. To the retrieval system, they look identical to human-written primary sources. Now when your model retrieves context to answer a question, a significant portion of that context is the compressed, possibly-distorted output of a prior model run — and your accuracy metrics are still green.
This is provenance debt: the accumulation of AI-generated content in retrieval corpora without source markers, creating a feedback loop where each generation of model outputs becomes raw material for the next.
