Skip to main content

Embedding Migrations Are the New Schema Migrations

· 12 min read
Tian Pan
Software Engineer

The first time most teams swap an embedding model in production, they treat it as a batch job. Re-run the embedder, build a new index, swap the alias, deploy. Latency stays normal. Error rates stay zero. Every query returns results. And retrieval quality silently regresses for weeks before anyone notices, because the symptom is "users complain the answers feel off," not a red dashboard.

This is not a deployment problem. It is a schema migration that the team has decided to run blind. The old embedding space and the new one are different reference frames; the cosine geometry that used to mean "these two paragraphs are about the same topic" no longer means that with the same numerical confidence. Documents and queries that used to cluster together drift apart non-uniformly. Re-rankers trained on the old distribution start firing on examples that no longer match what they learned. The eval suite that scores green on pointwise relevance misses all of it, because no individual document moved very far while the entire graph rotated.

Treat the swap like a database migration and almost everything that goes wrong becomes preventable. Treat it like a batch job and the regressions arrive on a schedule that nobody owns.

Why "Drop-In Replacement" Is Almost Always a Lie

Vendor announcements describe new embedding models as drop-in replacements for the previous version. Sometimes the dimensionality matches. Sometimes the API signature is identical. The implication is that the old vectors and the new vectors live in the same world, and you can mix queries from one model against documents from the other.

This is wrong in a way that breaks production.

Two embedding spaces with identical dimensionality are not interchangeable. The cosine similarity between a query vector and a document vector is only meaningful when both vectors come from the same model — the math runs to completion regardless, but the reference frame is different. Practitioners who instrument this carefully see cosine scores collapse from 0.85+ on relevant pairs down to the mid-0.6s when models are mixed, and queries that used to retrieve employment-law documents start surfacing real-estate documents that happen to share surface vocabulary. The system keeps responding. The system keeps being wrong.

"Drop-in replacement" should be read as "the schema column type is the same." It says nothing about the semantics of the values stored in that column. A migration from one model to another is a migration of the data itself, not just the column header.

The Eval Suite That Doesn't Catch It

Most teams that run RAG have some eval suite. It usually scores top-k precision or NDCG against a labeled set of query-document pairs. Run it before the migration, run it after, compare the deltas. If the numbers are flat, ship.

The trap: pointwise relevance evals are designed to catch cases where one document moves far in the new space. Embedding migrations don't do that. They rotate the entire graph, often subtly, often non-uniformly across content types. Every document moves a little. The single most relevant document for a query is usually still in the top-k after the swap. But the top-3 may have a new entrant that wasn't there before, and the re-ranker is now operating on a slightly different candidate set than the one it was trained for. Pointwise scores barely move. Downstream answer quality moves a lot.

What catches this is a metric that scores the structural relationship between the two indexes, not the absolute relevance of either. The simplest version is top-k overlap: for a representative query set, what fraction of the documents that the old index retrieved in the top-k are still in the top-k for the new index? Call it neighborhood stability. A migration where 95% of top-10 results are preserved is a different operation from one where only 60% are. The former is a routine swap. The latter is a re-architecture, and any re-ranker, prompt template, or downstream component that was tuned against the old neighborhoods needs a re-tuning plan before traffic moves.

The eval discipline that survives an embedding migration looks like this: pointwise relevance for absolute quality, neighborhood stability for structural drift, and end-to-end answer quality on a held-out task suite for the user-visible regression. Two of those three almost certainly need to be added before the migration starts; if you only have the first one, you are flying blind in exactly the dimensions that matter.

Dual Write, Then Dual Read, Then Switch

Database migrations have a well-understood sequence: add the new column, dual-write to both, backfill the historical rows, dual-read with verification, cut over, drop the old. Embedding migrations should look the same.

A workable pattern, in order:

  • Stand up a parallel index populated by the new model. The dimensionality may match the old one; treat it as a separate index regardless, with its own alias, its own monitoring, and an explicit version tag on every vector. Co-locating old and new vectors in the same index just to save storage is the operational equivalent of mixing two coordinate systems in one geometry library — eventually something queries the wrong one.
  • Dual-write all new and updated documents to both indexes from the moment the parallel index exists. The longer the dual-write window, the smaller the backfill backlog when you are ready to cut over.
  • Backfill historical documents in priority order, hottest content first. The most-queried 10% of documents typically account for the majority of retrieval traffic; prioritizing those means the new index becomes useful for shadow traffic long before the backfill is complete.
  • Shadow queries against the parallel index while live traffic still hits the old one. Compare top-k overlap, candidate-set re-ranker scores, and end-to-end answer quality on a sampled stream. This is where neighborhood-stability metrics earn their keep — they tell you whether the new index is structurally similar enough to roll out unchanged or whether downstream components need re-tuning first.
  • Cohort the rollout by query intent, not by user id. Some intents migrate cleanly (factoid lookup against canonical documents). Others regress hard (multi-hop questions whose retrieval depends on tight neighborhood structure). A naive percent-of-users rollout averages these together and hides which content categories need work; an intent-cohorted rollout exposes them.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates