The Indexing Policy Committee Nobody Convened: RAG Corpus Governance Beyond the One-Time Migration
Two years ago, a team pointed their retrieval index at the wiki, the Zendesk export, and a snapshot of the public docs. Last week, that same index returned a deprecated runbook that told an SRE to restart a service that no longer exists. The runbook had been deprecated for eighteen months. Nobody owned its retirement, so nobody retired it. The agent confidently cited it. The model wasn't wrong; the corpus was.
This is the failure mode that doesn't show up in retrieval evals: the corpus is treated as a one-time engineering decision when it's actually an ongoing governance problem. The team that scoped the initial ingestion is long gone. The legal review that should have flagged the customer-confidential PDFs never happened, because nobody told legal there was a pipeline. The "freshness strategy" is a Slack message from someone who left in Q3. The retrieval index has become a shared inbox for every document anyone ever scraped, and the bar for inclusion has drifted to "whatever was easy to ingest."
The architectural realization that doesn't land in most postmortems: the retrieval index is a product surface with editorial standards, not a search backend. Treat it like the latter and you will ship a system that confabulates from contradictory sources, leaks across tenants, surfaces content legal would have prohibited if asked, and quietly degrades for years before anyone notices.
The fix isn't a better embedding model. It's an indexing-policy committee — and the four axes they need to be arguing about every quarter.
The Legal Axis: Nobody Told Legal About Your RAG Pipeline
The first time the corpus comes up in a regulatory conversation should not be after the incident. But it usually is. RAG sits across legal, information governance, and IT, and gets built inside AI teams that operate outside any of those control frameworks. The corpus inherits the access posture of whoever scraped it, not of the user querying it.
The concrete failure modes are well-documented. Cross-tenant leakage is the loudest: in pen-test exercises against multi-tenant RAG deployments, a query from User A retrieves documents that belong to User B, because vector similarity was the only filter applied. One published red-team study reported cross-tenant retrieval succeeding on every query attempted — twenty out of twenty — with no technical sophistication required. The vector store was never designed with per-document access control as a first-class feature, and the orchestration layer trusted the ranker.
Less loud but more common: regulated content the agent must not surface ends up indexed because nobody classified the source on ingestion. PII in a CRM export, customer-confidential decks pulled from a shared drive, draft policies that aren't yet authoritative — all of it gets chunked and embedded with no provenance metadata strong enough to filter on. By the time someone notices the agent quoting a confidential roadmap to the wrong audience, the chunks are already in the index and you don't have a clean lineage to retire them.
The discipline that closes this gap is per-source-class policy with named owners. Every source class — public docs, internal wiki, support tickets, Slack export, CRM, code comments — gets a classification on ingestion: who is allowed to retrieve it, what tenant scope it belongs to, what regulatory category it falls under. The classification rides with every chunk as metadata, and retrieval-time filters are enforced at the database layer, not the application layer. If your only enforcement point is "the prompt only includes results the user is supposed to see," you have already lost; the prompt is downstream of the leak.
The Freshness Axis: Each Source Class Has Its Own Decay Rate
A common mistake is to talk about "RAG freshness" as if it were a single number. It isn't. Each content class has a different decay rate, and treating them uniformly produces predictable failures.
Yesterday's product spec is gold. Last quarter's incident retro is poison if it now contradicts current behavior. A two-year-old architectural overview might still be the best document on the system; a two-week-old support ticket might already be wrong because the bug was fixed. The right re-index cadence for "current product surface" is daily; for "company history" it might be never. There is no global freshness threshold that satisfies both.
The pattern that works: define a refresh cadence per source class, monitor maximum staleness per class, and alert when it exceeds the class-specific threshold. For high-velocity sources — anything backed by an authoring tool that ships changes hourly — you need a streaming or near-real-time pipeline, not a nightly batch. For low-velocity sources, batch is fine but the deletion path matters more than the refresh cadence.
Deletion is where most teams get stuck. Many vector indexes don't support clean incremental updates, so teams end up running staging-and-swap patterns: insert new vectors with a temporary flag, atomically promote them, then sweep the old ones. That works, but only if the system actually has a notion of "old." If chunks were ingested without a source_doc_id and a source_version, you cannot identify which vectors belong to a deprecated document. You will end up rebuilding the index, which means your "deletion" is on the rebuild cadence, which is usually quarterly, which is why deprecated runbooks live for two years.
The minimum metadata for a survivable corpus: source document ID, source version, ingestion timestamp, source class, owner, expiry. The expiry is the unglamorous one and the one most teams skip. It's the only field that turns deletion from a manual archeology project into a scheduled job.
The Authorship-Trust Axis: A Slack Thread Is Not a Runbook
When the corpus contains both a published runbook and the Slack thread that argued about whether the runbook was right, treating them as equal source quality silently teaches the agent to confabulate. The retrieval ranker doesn't know which one represents the team's considered view. If the Slack thread is more recent, it might rank higher. If the runbook contradicts it, the agent now has two answers and will pick whichever embedding is closer to the query.
Source quality is not a query-time relevance problem. It is an ingestion-time editorial decision. Authoritative sources need to be marked authoritative. Provisional sources — Slack, drafts, hallway-conversation summaries — need to be marked provisional. The retrieval layer should be capable of preferring authoritative sources when both are available and falling back to provisional only when nothing authoritative exists. Without that hierarchy, corpus disagreements become hallucinations from the user's perspective, even though the model technically grounded its answer in retrieved text.
The practical mechanism is an authority tier on every chunk: canonical, reference, provisional, historical. Canonical is what an editor signed off on. Reference is community-maintained but not editorially controlled. Provisional is anything that hasn't been reviewed. Historical is kept for context but should not be returned unless the query is explicitly about prior state. The ranker uses the tier as a hard filter or a strong prior, depending on the use case.
This sounds heavy, and the response is usually "we don't have the editorial bandwidth for this." That's fine — but then you also don't have the bandwidth to debug why the agent is contradicting itself, and you absolutely don't have the bandwidth to explain it to a customer when it ships a wrong answer. The cost of editorial discipline is paid up front; the cost of skipping it is paid forever, in compounding remediation.
The Ownership Axis: The Corpus Is Unowned, So the Bar Drifts
The deepest failure is structural. No single team owns "the corpus." The AI team owns the retrieval pipeline. The docs team owns published documentation. Support owns the ticket archive. Legal owns compliance. Security owns access policy. Each one has a partial view, and none of them owns the union of decisions that determine what the agent actually sees.
What happens in the absence of ownership is predictable: the bar for inclusion drifts toward "whatever was easy to scrape." A ten-line connector to Notion gets merged because someone needed it for a demo. A Google Drive ingestion runs because a stakeholder asked. Six months later, nobody remembers who approved either, and the inclusion criterion has become "did anyone object loudly enough."
The fix is not subtle. Name an owner. Convene a real committee — product, docs, legal, security, the AI team — and treat indexing decisions like IAM grants. Every source class has a stated owner, an inclusion bar, a refresh cadence, a retention policy, and a deletion path. The committee meets quarterly to review what's in the index, the same way a security team reviews what's in production access. Sources without owners get retired by default.
This is administratively boring and politically uncomfortable, which is why nobody volunteers to do it. But the alternative — the indexing-policy committee that nobody convened — is the one paying out the failures. Treat the retrieval index like an editorial product, and the editorial cost shows up on someone's roadmap. Treat it like a search backend, and the cost shows up as a postmortem.
What Lands When Governance Lands
A corpus that is actually governed has visible properties. Provenance metadata on every chunk, so a regression can be traced to the document that taught the agent wrong. A per-source-class policy with a named owner, an inclusion bar, a refresh cadence, and a deletion path. Retrieval-time filters enforced at the database layer, with cross-tenant retrieval tested as a routine red-team exercise rather than a one-off audit. An authority tier that lets the ranker prefer canonical sources over provisional ones. A quarterly review where the committee reads the inclusion list and asks, for each source: do we still believe this belongs in the index?
None of this requires a new vector database. None of it requires a smarter ranker or a bigger embedding model. It requires the discipline to admit that the question "what goes into the retrieval index" is not an engineering question with a one-time answer. It is an editorial question, asked continuously, by people whose job it is to ask it.
The systems that get this right will look slower and more bureaucratic than the ones that don't — until something goes wrong, at which point the difference between "we can identify which document caused this and retire it in an hour" and "we'll have to rebuild the index next quarter and hope" becomes the difference between a fixable bug and a structural problem. The retrieval index is a product surface. Either you decide what's on it, or your scrapers decide for you.
- https://www.informationweek.com/data-management/nobody-told-legal-about-your-rag-pipeline-why-that-s-a-problem
- https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/secure-multitenant-rag
- https://aws.amazon.com/blogs/machine-learning/multi-tenant-rag-implementation-with-amazon-bedrock-and-amazon-opensearch-service-for-saas-using-jwt/
- https://www.daxa.ai/blogs/secure-retrieval-augmented-generation-rag-in-enterprise-environments
- https://articles.chatnexus.io/knowledge-base/content-audit-for-rag-systems-evaluating-your-know/
- https://www.regal.ai/blog/rag-hygiene
- https://ragaboutit.com/the-knowledge-decay-problem-how-to-build-rag-systems-that-stay-fresh-at-scale/
- https://particula.tech/blog/update-rag-knowledge-without-rebuilding
- https://nstarxinc.com/blog/the-next-frontier-of-rag-how-enterprise-knowledge-systems-will-evolve-2026-2030/
- https://www.deasylabs.com/post/using-metadata-in-retrieval-augmented-generation
- https://www.we45.com/post/rag-systems-are-leaking-sensitive-data
- https://www.kiteworks.com/cybersecurity-risk-management/prevent-data-leakage-rag-pipelines/
