Skip to main content

Embedding Migrations Are the New Schema Migrations

· 12 min read
Tian Pan
Software Engineer

The first time most teams swap an embedding model in production, they treat it as a batch job. Re-run the embedder, build a new index, swap the alias, deploy. Latency stays normal. Error rates stay zero. Every query returns results. And retrieval quality silently regresses for weeks before anyone notices, because the symptom is "users complain the answers feel off," not a red dashboard.

This is not a deployment problem. It is a schema migration that the team has decided to run blind. The old embedding space and the new one are different reference frames; the cosine geometry that used to mean "these two paragraphs are about the same topic" no longer means that with the same numerical confidence. Documents and queries that used to cluster together drift apart non-uniformly. Re-rankers trained on the old distribution start firing on examples that no longer match what they learned. The eval suite that scores green on pointwise relevance misses all of it, because no individual document moved very far while the entire graph rotated.

Treat the swap like a database migration and almost everything that goes wrong becomes preventable. Treat it like a batch job and the regressions arrive on a schedule that nobody owns.

Why "Drop-In Replacement" Is Almost Always a Lie

Vendor announcements describe new embedding models as drop-in replacements for the previous version. Sometimes the dimensionality matches. Sometimes the API signature is identical. The implication is that the old vectors and the new vectors live in the same world, and you can mix queries from one model against documents from the other.

This is wrong in a way that breaks production.

Two embedding spaces with identical dimensionality are not interchangeable. The cosine similarity between a query vector and a document vector is only meaningful when both vectors come from the same model — the math runs to completion regardless, but the reference frame is different. Practitioners who instrument this carefully see cosine scores collapse from 0.85+ on relevant pairs down to the mid-0.6s when models are mixed, and queries that used to retrieve employment-law documents start surfacing real-estate documents that happen to share surface vocabulary. The system keeps responding. The system keeps being wrong.

"Drop-in replacement" should be read as "the schema column type is the same." It says nothing about the semantics of the values stored in that column. A migration from one model to another is a migration of the data itself, not just the column header.

The Eval Suite That Doesn't Catch It

Most teams that run RAG have some eval suite. It usually scores top-k precision or NDCG against a labeled set of query-document pairs. Run it before the migration, run it after, compare the deltas. If the numbers are flat, ship.

The trap: pointwise relevance evals are designed to catch cases where one document moves far in the new space. Embedding migrations don't do that. They rotate the entire graph, often subtly, often non-uniformly across content types. Every document moves a little. The single most relevant document for a query is usually still in the top-k after the swap. But the top-3 may have a new entrant that wasn't there before, and the re-ranker is now operating on a slightly different candidate set than the one it was trained for. Pointwise scores barely move. Downstream answer quality moves a lot.

What catches this is a metric that scores the structural relationship between the two indexes, not the absolute relevance of either. The simplest version is top-k overlap: for a representative query set, what fraction of the documents that the old index retrieved in the top-k are still in the top-k for the new index? Call it neighborhood stability. A migration where 95% of top-10 results are preserved is a different operation from one where only 60% are. The former is a routine swap. The latter is a re-architecture, and any re-ranker, prompt template, or downstream component that was tuned against the old neighborhoods needs a re-tuning plan before traffic moves.

The eval discipline that survives an embedding migration looks like this: pointwise relevance for absolute quality, neighborhood stability for structural drift, and end-to-end answer quality on a held-out task suite for the user-visible regression. Two of those three almost certainly need to be added before the migration starts; if you only have the first one, you are flying blind in exactly the dimensions that matter.

Dual Write, Then Dual Read, Then Switch

Database migrations have a well-understood sequence: add the new column, dual-write to both, backfill the historical rows, dual-read with verification, cut over, drop the old. Embedding migrations should look the same.

A workable pattern, in order:

  • Stand up a parallel index populated by the new model. The dimensionality may match the old one; treat it as a separate index regardless, with its own alias, its own monitoring, and an explicit version tag on every vector. Co-locating old and new vectors in the same index just to save storage is the operational equivalent of mixing two coordinate systems in one geometry library — eventually something queries the wrong one.
  • Dual-write all new and updated documents to both indexes from the moment the parallel index exists. The longer the dual-write window, the smaller the backfill backlog when you are ready to cut over.
  • Backfill historical documents in priority order, hottest content first. The most-queried 10% of documents typically account for the majority of retrieval traffic; prioritizing those means the new index becomes useful for shadow traffic long before the backfill is complete.
  • Shadow queries against the parallel index while live traffic still hits the old one. Compare top-k overlap, candidate-set re-ranker scores, and end-to-end answer quality on a sampled stream. This is where neighborhood-stability metrics earn their keep — they tell you whether the new index is structurally similar enough to roll out unchanged or whether downstream components need re-tuning first.
  • Cohort the rollout by query intent, not by user id. Some intents migrate cleanly (factoid lookup against canonical documents). Others regress hard (multi-hop questions whose retrieval depends on tight neighborhood structure). A naive percent-of-users rollout averages these together and hides which content categories need work; an intent-cohorted rollout exposes them.
  • Keep both indexes live until you have shipped a rollback at least once in staging. The operational point of dual indexes is not the cutover; it is the rollback. If you cannot demonstrate that a flag flip moves traffic back to the old index in under a minute, you do not have a migration plan, you have a one-way deploy.

The window during which both indexes exist is expensive. That cost is the price of the migration being safe. The teams that try to skip it because storage is doubled are the same teams that re-run the entire migration two weeks later when the regression finally surfaces.

The Re-Ranker Dependency Most Teams Discover Mid-Migration

Cross-encoder re-rankers are trained on candidate sets produced by a specific embedding model. The training data implicitly encodes the distribution of false positives that model produces — the kinds of near-misses that make it through retrieval and need to be filtered out. Swap the embedding model and that distribution changes. Some old false positives no longer appear; new ones do. The re-ranker is now solving a problem it wasn't trained on.

In the lucky case, this shows up as a small dip in re-ranker precision. In the unlucky case, the new candidate sets are different enough that the re-ranker's relative ordering becomes unreliable on entire content categories — typically the categories where the embedding migration moved the most, which are the ones the re-ranker was doing the most work on.

The discipline is to treat re-ranker re-training as a planned dependency of the embedding swap, not a discovered firefight. That means: collect labeled candidate sets from the new index during the dual-write window, retrain or fine-tune the re-ranker against those, and only flip traffic once the re-ranker has been validated on candidates from the new model. If the re-ranker is third-party and not retrainable, the migration plan needs a different mitigation — usually expanding the candidate set width to compensate for ordering noise, or accepting a tunable quality regression on specific intents.

The Cost Frame That Surprises Teams

Re-embedding a corpus is not free, but the line item that surprises teams is rarely the inference cost. At current public pricing, re-embedding a billion tokens with a small embedding model is in the tens of dollars; with a large one it is in the low hundreds. That is real money but it is rarely the constraint.

The constraints that bite are different:

  • Index storage doubles for the migration window. A billion 1024-dimensional vectors are roughly four terabytes before index overhead. Two indexes is eight. Most managed vector databases price storage as a primary line item, and the dual-write window may be the first time the team's budget owner sees what the production index actually costs.
  • GPU throughput is the wall on self-hosted embedders. A single mid-range GPU produces embeddings at low thousands of tokens per second on a seven-billion-parameter model; a billion-document corpus on a single machine is measured in days, not hours. Re-embedding fleets need provisioning weeks ahead of the cutover, not the morning of.
  • Re-ranker re-training has its own labeling and compute cost that does not fit on the embedding line in the budget. Teams that scope the migration as "re-embed the corpus" without scoping it as "re-embed the corpus and retrain the re-ranker and re-validate the eval suite" come back asking for budget halfway through.
  • Migration windows have a quality cost too. During dual-read, queries are doing more work; during cohorted rollouts, the system is operating in a lower-confidence regime. That cost is real; it is paid in user-visible retrieval quality during the window. Pricing it as zero is what produces pressure to skip the dual-write entirely.

The headline number to plan around is not the cost of the new embeddings. It is the cost of the entire window during which two production-grade indexes are running.

Who Owns the User-Visible Regression

Most embedding migrations span an organizational fault line. The platform team owns the index, the embedding pipeline, the dual-write tooling, and the cutover ceremony. The product team owns the eval suite, the user research that reads "answers feel worse this week," and the on-call when something breaks. The migration is executed by the platform team and judged by the product team, and neither team owns the metric that ties them together.

The failure mode is predictable: the platform team ships a clean migration with green operational metrics, the product team starts seeing a slow degradation in user feedback, and several weeks later someone finally connects the dots and the rollback either works (because the old index is still live, in which case good) or doesn't (because the old index was decommissioned to free up storage, in which case the team eats a multi-week regression).

The fix is organizational, not technical: a single owner for "retrieval quality during migration windows," with the authority to halt or roll back the migration based on user-visible signals, not just operational ones. The eval suite they own needs to include neighborhood-stability and end-to-end answer quality, not just pointwise relevance. The on-call rotation they run needs to include the platform team during the window, so the people who can flip the flag are paged at the same time as the people who notice the regression.

The Architectural Realization

An embedding model is not a feature extractor that one team owns. It is a contract between an indexer, a retriever, and a re-ranker — and increasingly a generator that conditions on retrieved context. Migrations of contracts need versioning, dual-write windows, validation gates, and rollback paths the same way a database migration does. The fact that the contract is implicit (encoded in floating-point geometry rather than column types) does not make it less of a contract; it makes it harder to discover when it is broken.

Teams that internalize this stop running embedding swaps as platform tickets and start running them as cross-team migrations with named owners, scheduled checkpoints, and explicit acceptance criteria. The migration takes longer and costs more in the window. It also stops producing the silent quality regression that nobody can attribute and everybody pays for.

The next time someone proposes "just re-embed and swap," the right response is not "what's the timeline." It is: "show me the dual-write plan, the neighborhood-stability target, the re-ranker re-training schedule, and the rollback runbook." If those four things don't exist yet, the migration isn't ready — and treating it like a batch job is the failure mode the discipline exists to prevent.

References:Let's stay in touch and Follow me for more thoughts and updates