Skip to main content

The Vector Index Whose Source Updates Never Reached the Embeddings

· 10 min read
Tian Pan
Software Engineer

A support engineer pings the on-call channel. A customer pasted a sentence the assistant retrieved last week, and the policy team replied: we don't say that anymore. They haven't said it for four months. The document in the CMS reads correctly. The embedded chunk in the vector index still reads the old way, with a confident similarity score, surfaced to the model on every relevant query. Nobody changed the retrieval code. Nobody changed the model. The source-of-truth changed, and the index never heard about it.

This is the failure mode of an ingestion pipeline that was designed for creates and grew into a system that also handles updates without anyone designing for updates. The "embed on create" job ran the day each document was first written. The CMS shipped an edit endpoint a quarter later, owned by a different team, who plumbed it into search and into the public-facing renderer and into the changelog feed — every consumer except the one that was a derived dataset hiding behind a different name. Months pass. The corpus drifts. Retrieval starts answering questions about a world the company has formally left behind, and the only signal is a confused customer.

The Create Path Was Loud. The Update Path Was Silent.

Look at the org chart and the failure is obvious in retrospect. The team that built the vector index also built the ingestion job and called it done. The team that owns the CMS ships features against a different roadmap. When edits were added to the product, the edit event fanned out to the consumers that team knew about: the rendered page, the API response, the analytics counter. The vector index was not on the list because it was not part of their mental model of the CMS — it was an AI thing, run by an AI team, against a snapshot the AI team had taken once.

Each team is correct from inside their own scope. The CMS team is shipping edits to all known subscribers. The retrieval team has a working ingestion job. Nobody is sitting in the seam, and the seam is where the freshness contract had to live. The most common shape of this bug is not a clever distributed-systems problem; it is two backlogs that never produced the ticket "wire CMS update events into the embedding worker."

The on-create-only ingestion job survives because it works for the first weeks of every new document. Retrieval looks great in the demo. The eval suite was built against the same snapshot the index was built against, so the eval rubric never disagrees with what retrieval surfaces. The drift between source and index accumulates underneath a green dashboard.

A Vector Index Is a Materialized View

The right mental model for this layer is the one borrowed from databases. A vector index is a materialized view over the source corpus. Every materialized view has two questions: how is it built, and how is it invalidated. Teams that build vector indexes routinely answer the first question — chunker, embedder, write to the store — and assume the second answers itself because nothing in the AI tooling reminded them to ask.

A materialized view that is never invalidated is a cache with no eviction policy. It serves the version of the world it had on its happiest day. Every database engineer has internalized that you do not ship one of those into a production read path, but the same engineer reading the same retrieval architecture diagram does not see the cache, because the diagram says "vector index" and the cache is implicit.

The framing matters because it changes the work. If you treat the index as a materialized view, you stop reasoning about it as a one-time ingestion problem and start reasoning about it as a continuously derived dataset whose contract with the source is the freshness guarantee. Continuously derived datasets need a change feed, a fan-out, a version pin, and a reconciliation path. They need ownership of the gap between source and view, not ownership of the build job.

The Patterns That Close the Loop

Four patterns, in increasing order of investment, recover the invariant.

A change-data-capture feed from the source to the embedding worker. The CMS, the ticketing system, the knowledge base — whatever the corpus is — emits an event for every create, update, and delete. The embedding worker subscribes. On update, it re-chunks the affected document, regenerates the embeddings, and overwrites the existing vectors. On delete, it tombstones the vectors so retrieval cannot surface them. This is the only pattern that closes the loop in real time; everything else is a fallback for when it fails. If the source system does not natively emit change events, you build CDC against its database — a logical replication slot, a trigger-based outbox table, a debezium-style connector. The shape of the work is the same one your data team already does for the warehouse.

A per-document version pinned next to the vector. Store the source's last-modified timestamp and a content hash alongside the embedding. Retrieval can now answer the question "is this chunk still consistent with the source?" without having to consult the source on the hot path. A nightly job can. A dashboard can. A staleness probe can. Without the version pin, the index is opaque — you cannot tell a fresh chunk from a stale one by looking at it, and any answer you give about freshness is a guess based on when you remember running ingestion.

A freshness SLO measured against the source's last-modified clock. Pick the metric that maps to user-visible harm. The median age of indexed content is not interesting; the lag between source update and index update is. Express it as "99% of source updates are reflected in the index within N minutes." Page on it. The metric is only meaningful if you have the version pin, because you need a per-document timestamp on both sides to compute the lag.

A reconciliation job that diffs source against index. Even with CDC, events get dropped, workers crash mid-batch, deletes get lost, schema migrations slip a column. The reconciliation job is the periodic audit that catches what the streaming path missed. Walk the source corpus, compare each document's last-modified timestamp and hash to the version pinned in the index, queue the divergent set into the same embedding worker the CDC path uses. The reconciliation job is also the backfill mechanism for any source whose history predates the CDC connector — you do not need a separate one-shot ingestion job once reconciliation exists, because reconciliation against an empty index is the same code path as reconciliation against a stale one.

Detection That Doesn't Wait for a Customer

If the first signal that retrieval is stale is a customer pasting a stale quote back to support, the system has no observability for its most consequential failure mode. The detection patterns that work do not require user reports.

Freshness as a first-class metric, alongside retrieval latency and answer quality, surfaces drift before it accumulates. A green dashboard that includes a stale-document count refuses to lie about a healthy system. Alert thresholds should reflect the corpus: a legal handbook that gets edited weekly can tolerate hours of lag; a pricing page that drives quotes cannot.

Embedding drift detection on the distribution of newly indexed content catches a different class of failure — when the source corpus is changing but the index is not, the distribution of vectors stops moving. A monitor that compares this week's new-vector distribution against last week's will fire when ingestion silently stops, even if no individual document raises a flag.

End-to-end probes are the cheap version of an eval that the team will actually maintain. Pick a small set of documents that get edited regularly. After each edit, query the retrieval layer with a question whose answer depends on the new wording. If the index returns the old wording, page. This is a synthetic canary, not an eval rubric, and it catches the staleness class of failure even when the eval suite is itself stale against the current product.

Why "We'll Just Reindex Periodically" Fails

The fallback most teams reach for first is a nightly or weekly batch reindex of the entire corpus. It is reassuring to schedule and easy to ship, and it produces an outage-shaped failure mode that is easier to reason about than continuous drift. It also has problems that get worse with scale.

A full reindex costs proportionally to corpus size, not change rate. A million-document corpus where a hundred documents change per day pays a million-document price every night to catch a hundred. The cost is real — embeddings are not free, and the embedding budget is one of the largest line items in a production RAG system. Streaming architectures pay proportionally to the change rate, which is usually one to three orders of magnitude smaller than the corpus rate.

A batched reindex has a partial-update failure mode that streaming does not. If the batch crashes halfway through, the index is in a state where half the documents are at version N and half are at version N plus one, with no visible seam. The retrieval layer cannot tell which half it is querying. The safe pattern for batched reindexing is an alias swap — build a new index entirely, validate against a benchmark query set, atomically swap, retain the old index for rollback — and it adds enough operational weight that most teams ship the unsafe version.

A nightly cadence also bakes in a window where the index is wrong by definition. If a customer-visible document changes at 9 a.m. and the reindex runs at 2 a.m., retrieval is wrong for seventeen hours by design. That window is invisible to internal users because it falls inside business hours and looks like the index "working." It is not a freshness contract; it is a freshness rejection.

The Architectural Realization

The bug is not that the ingestion job was wrong. The ingestion job did exactly what it was built to do: embed each document at creation time. The bug is the assumption that creation was the only event the index needed to know about. The source emitted updates; the consumer registered for creates; the gap is the bug.

Once the team treats the vector index as a materialized view rather than an ingestion target, the conversation changes. The work is not "build a better ingestion job"; it is "own the invariant between source and view." That ownership has artifacts: a change feed, a version pin, a freshness SLO, a reconciliation diff. It has alerts and on-call rotations. It has a contract you can point at when the next team adds a new event type to the CMS, because the version pin will start showing divergence and the SLO will start firing and somebody will get paged in a way that nobody got paged for the silent drift over the last four months.

The teams that ship reliable retrieval are the ones who stop describing the embedding store as a database they wrote to and start describing it as a cache they have to keep coherent with the system of record. The vocabulary forces the right questions: what is the source of truth, what is the invalidation event, what is the staleness SLO, what is the reconciliation path. None of those questions get asked by a team that thinks they are running an ingestion job. All of them get answered by a team that knows they are running a derived dataset. The difference is the customer who never has to paste a stale quote back to support.

References:Let's stay in touch and Follow me for more thoughts and updates