The Vector Index Has a Staleness SLO Nobody Set
A user asks your agent what the current price tier is for an enterprise plan. The agent retrieves a chunk, reads it, and answers: "$2,000 per month." Confident, sourced, formatted nicely. The problem is that pricing changed four days ago. The number the agent quoted was true last week. The chunk it retrieved was embedded before the change, and the index has not caught up.
Nobody decided this would happen. There was no design review where someone said "the agent may answer from data up to four days old." There is just a re-indexing job that runs nightly, or weekly, and a content team that edits the source whenever they feel like it, and a gap between those two clocks that nobody measures. That gap is a service level objective. It exists whether or not you wrote it down. The only question is whether you set it on purpose or inherited it by accident.
This is the quiet failure mode of retrieval-augmented systems. Bad retrieval that returns an irrelevant chunk is visible — the answer is obviously wrong, someone files a bug. Stale retrieval returns a relevant chunk that happens to be out of date. The answer looks right. It is well-formed, on-topic, and cites a real document. It is just wrong about time. And time is the one dimension your evals almost never test, because your eval set was frozen on a Tuesday and the world was not.
"The Index Updates Nightly" Is an SLO Stated By Accident
Walk into any team running a production RAG pipeline and ask what their freshness guarantee is. You will usually get a description of a mechanism, not a number. "We re-index nightly." "There's a cron job." "It picks up changes on the next crawl." Those are implementation details that have been quietly promoted to promises.
"We re-index nightly" actually means: a document edited at 2 PM will not be retrievable in its new form until the next run, so the worst-case lag is around ten hours, and the average lag is five. That is the real SLO. It has a distribution. It has a tail. But because nobody phrased it as a commitment, nobody owns the tail, nobody alerts when a run fails, and nobody can tell you what happens when the nightly job silently skips a batch.
The replication-lag analogy is exact, and worth taking seriously. Every database engineer knows that a read replica lags the primary, that the lag is a measured quantity with a p50 and a p99, that it spikes under write load, and that you alert on it. Replication lag is a first-class operational metric with dashboards and pagers. Embedding-index lag is the same physical phenomenon — a downstream copy trailing a source of truth — but it is treated as invisible. The vector index is a replica. It just happens to be a replica that most teams never instrumented.
The difference in treatment is not technical. It is cultural. Database replication lag got monitored because outages taught everyone it mattered. Index lag has not had its outage yet, or rather it has, but the outage looked like "the agent gave a slightly wrong answer" and got written off as a model hallucination instead of a pipeline problem.
Freshness Is a Contract, and Contracts Have Numbers
If you want to fix this, the first move is not technical. It is to write the number down. Retrieval freshness should be an explicit contract between the team that owns the index and everyone who consumes it, and a contract that says "eventually" is not a contract.
A usable freshness contract has a few parts. Maximum lag: the worst-case time between a source change and that change being retrievable, stated as a percentile, because the average is comforting and the tail is what burns you. Scope: which content the guarantee covers, since you almost certainly cannot afford the same freshness for a ten-million-document archive as for the pricing page. Measurement: how lag is observed in production, not estimated from the cron schedule. Owner: a named team that gets paged when the contract is violated.
The metric that matters most is what some practitioners call the stale retrieval rate — the fraction of live queries that return at least one chunk whose embedding was computed before the source document's most recent update. This is the number that translates directly into user harm. A p99 lag of two hours sounds abstract; "3% of answers this week were built on a chunk that was already out of date" does not. It is also measurable without guessing: store the source document's last-modified timestamp alongside the vector, compare it to the chunk's embedding timestamp at query time, and count.
Notice what this reframing does. It moves freshness from "a property of the pipeline" to "a property of each answer." Once you can attribute staleness to individual responses, you can put it in dashboards, regress against it, and argue about it in incident reviews. An SLO you cannot measure per-request is just a slogan.
Surface Data Age — to the Agent and to the User
- https://ragaboutit.com/the-rag-freshness-paradox-why-your-enterprise-agents-are-making-decisions-on-yesterdays-data/
- https://ragaboutit.com/the-knowledge-decay-problem-how-to-build-rag-systems-that-stay-fresh-at-scale/
- https://atlan.com/know/llm-knowledge-base-freshness-scoring/
- https://oneuptime.com/blog/post/2026-01-30-freshness-slos/view
- https://tacnode.io/post/what-is-stale-data
- https://www.dbi-services.com/blog/rag-series-embedding-versioning-with-pgvector-why-event-driven-architecture-is-a-precondition-to-ai-data-workflows/
- https://apxml.com/courses/optimizing-rag-for-production/chapter-7-rag-scalability-reliability-maintainability/rag-knowledge-base-updates
- https://platformengineering.org/blog/the-agent-reliability-score-what-your-ai-platform-must-guarantee-before-agents-go-live
