The Wiki Edit Mid-Flight When Your RAG Pipeline Read It
A tech writer on your platform team is moving a paragraph. Not metaphorically — literally cutting a section from the onboarding page, pasting it into the runbook, deleting a stub draft on a third page, and rewording a deprecated warning on a fourth. The whole edit takes her about eleven minutes. Your RAG ingest job runs every fifteen. It happens to fire at minute six.
For the next fifteen minutes, your retrieval index contains a state of the wiki that did not exist at any single moment in her mind. The onboarding page still has the section. The runbook still doesn't. The stub draft is captured halfway through being deleted, with a placeholder sentence she never intended to publish. The old deprecated warning is still indexed. When an engineer asks the agent "how do we handle credential rotation in this service," the model retrieves contradictory chunks from the same source and confidently synthesizes whichever was ranked higher. The answer is wrong in a shape no one wrote.
This is a failure mode most teams ship without noticing: the source-of-truth is transactional, the ingest is a poll, and the gap between them is where dirty reads live.
The Shape of the Bug Nobody Catches
The standard RAG architecture is so clean it lulls you. A wiki on one side, a vector database on the other, a scheduled job in the middle that walks the corpus, chunks each page, embeds the chunks, and upserts them into the index. Every fifteen minutes. Every hour. Every night. Pick your latency budget. Ship.
The lie hidden in that diagram is that the wiki is a static document set. It isn't. A mid-size company accumulates 200,000 to 500,000 pages across Confluence or Notion, with editors writing concurrently. Each page is being touched by someone at almost any moment of the working day. The ingest job is a reader without a consistency contract, scanning a database that has multiple writers and no read snapshot.
In transactional database terms, this is a dirty read. The ingest sees rows mid-transaction. The transaction in this case is not a single SQL statement — it is a human's mental edit that spans pages, takes minutes, and has no commit marker the ingest can subscribe to. The pipeline captures the world in the middle of a thought.
The downstream symptom is the part teams notice, if they notice anything. The agent gives a confidently wrong answer that no one on the team would have written. There is no commit, no diff, no audit log that contains the sentence the model just produced — because the sentence is a synthesis across page revisions that never coexisted. By the time a user flags it, the next ingest has already overwritten the inconsistent state, and the bug is unreproducible.
Why the Deleted Draft Keeps Coming Back
The mid-flight read has a sibling failure that is easier to reproduce, harder to fix, and accounts for a surprising fraction of "the agent told me something that isn't even in the wiki anymore" tickets: the resurrected draft.
A writer starts a new page, types two paragraphs, walks away to lunch, comes back, decides the framing is wrong, deletes the whole thing. Total lifespan: forty minutes. Your ingest happened to run at minute twenty-five. The two paragraphs are now in your vector index. The page they came from no longer exists on the wiki side.
On the next ingest pass, the page is gone — but "gone" is information the ingest has to actively detect and act on. If the pipeline iterates over what's currently in the wiki and upserts what it finds, it has no signal that something disappeared. The orphan chunks sit in the vector store forever. They retrieve. They get cited. The writer's abandoned draft becomes, in a real and observable sense, a permanent part of the knowledge base, retrievable in agent responses for as long as the index lives.
This is the failure mode that breaks the trust between the wiki and the agent permanently. Users learn that things they delete don't actually go away. Once that lesson lands, the wiki stops being a place people put high-confidence drafts. The cost of writing anything goes up because deletion is no longer a primitive that works.
Polling Is the Problem, Not the Schedule
The first instinct when you notice these bugs is to crank the ingest frequency. Every fifteen minutes becomes every five minutes becomes every minute. This compresses the dirty-read window without closing it. A one-minute window is still infinite from the perspective of the editor whose paragraph happened to land inside it. You pay more in re-embedding cost — re-embedding a corpus that hasn't materially changed is the single largest unjustified bill in production RAG — and you have not fixed the architectural mismatch.
The architectural fix is to stop polling. Modern wiki platforms emit edit events. Notion has webhooks. Confluence has event listeners. Both can push a change notification to your pipeline within sub-second latency of the edit landing. The ingest becomes event-driven: a CDC subscriber that receives "page X revision Y was just saved" and processes that single change, rather than a job that walks the world every N minutes.
CDC-driven ingest is the architectural rewrite that closes the dirty-read window. Latency drops from minutes to seconds. Cost drops too — you re-embed only what changed, not the corpus. But more importantly, the model of the system finally matches reality: writers commit, the ingest reacts, and the pipeline stops trying to read a moving target as a static snapshot.
The cost of this rewrite is real. CDC requires a message broker, an event consumer, idempotent processing, replay semantics for missed events, dead-letter handling. The first time it goes wrong — the event stream backs up and you find out the index has been stale for six hours — you will miss the simplicity of the cron job. But the cron job was lying to you the whole time. The CDC pipeline tells you when it fails, which is the property you actually need.
- https://www.digitalapplied.com/blog/rag-anti-patterns-7-failure-modes-2026-engineering-guide
- https://apxml.com/courses/large-scale-distributed-rag/chapter-4-scalable-data-ingestion-processing-pipelines/change-data-capture-real-time-rag
- https://www.searchcans.com/blog/build-dynamic-rag-pipeline-evolving-information/
- https://ragaboutit.com/the-rag-freshness-paradox-why-your-enterprise-agents-are-making-decisions-on-yesterdays-data/
- https://apxml.com/courses/optimizing-rag-for-production/chapter-7-rag-scalability-reliability-maintainability/rag-knowledge-base-updates
- https://particula.tech/blog/update-rag-knowledge-without-rebuilding
- https://www.tigerdata.com/blog/vector-databases-are-the-wrong-abstraction
- https://www.nb-data.com/p/23-rag-pitfalls-and-how-to-fix-them
- https://cachee.ai/blog/posts/2026-03-27-how-notion-ai-and-confluence-can-deliver-instant-knowledge-retrieval-across-millions-of-docs
- https://www.comet.com/site/blog/llm-twin-3-change-data-capture/
