Skip to main content

Embedding Model Rotation Is a Database Migration, Not a Deploy

· 11 min read
Tian Pan
Software Engineer

Somewhere in a staging channel, an engineer writes "bumping the embedder to v3, new model scored +4 on MTEB, merging after the smoke test." Two days later support tickets start trickling in about search results that feel "weirdly off." A week later retrieval precision is down fourteen points, cosine scores have collapsed from 0.85 into the 0.65 range, and nobody can explain why — because the deploy looked identical to the last five model bumps. It wasn't a deploy. It was a database migration wearing a deploy's costume.

Embedding model rotation is the most misfiled change type in AI infrastructure. It lands in your system through the same channels as a prompt tweak or a generation-model pin update — a config file, a PR, a CI check — so it gets the governance of a config change. But under the hood, a new embedder does not produce a better version of your old vectors. It produces vectors that live in a different coordinate system entirely, where cosine similarity across the two manifolds is a category error. The correct mental model is not "rev the dependency." It is "swap the primary key encoding on a fifty-million-row table while serving reads."

Teams who treat it as a deploy discover this mid-cutover, usually from the user side first. Teams who treat it as a migration build a shadow index, run dual queries, measure agreement before flipping the alias, and keep the old index warm for a week in case rollback is needed. The difference between these two teams is not sophistication. It is whether someone correctly named the change category in the first sprint planning where it came up.

Why the Manifolds Don't Line Up

Every embedding model defines a high-dimensional space whose geometry reflects how the model was trained — the objective, the data mix, the tokenizer, the projection head. Two models that both claim to "embed English text into 1024 dimensions" produce vectors that are not merely different values of the same quantity. They are measurements in different units, in spaces with different topologies, where the axes mean different things and the notion of "near" is defined by different neighbors.

This is why swapping models and comparing a fresh query embedding against the old stored vectors fails silently. Nothing in the request path errors. The vector arithmetic runs. The database returns k results. The results are just subtly, structurally wrong: semantically adjacent documents stop ranking first, and documents that share surface tokens with the query start outranking documents that share meaning. Your cosine scores don't go to zero — they collapse from 0.85 into a mushy 0.6 band where everything looks roughly similar and nothing looks right. The failure is invisible to every observability dashboard that doesn't already measure retrieval quality, and most don't.

The practical consequence: you cannot do an in-place upgrade. You cannot gradually migrate vector-by-vector while keeping the index queryable, because any query that hits a mixture of old and new vectors is returning results from two different similarity functions blended together. The migration is necessarily a full re-embed plus a cutover. The only question is how disciplined you are about the cutover.

The Migration Playbook

Borrow the discipline directly from database schema migrations. The pattern has four phases, and each one has a failure mode if you skip it.

Phase 1: Shadow index. Create a new vector column, collection, or namespace — depending on your vector store — and run the new embedder over your entire corpus in the background, writing into the shadow. Weaviate supports this via collection aliases and coexisting named vectors. Pinecone and Qdrant support it through multiple indexes or collections you can alias. Postgres with pgvector supports it by adding a second embedding_v2 column built with CONCURRENTLY. The shadow must be populated from the same source of truth as the live index, not from the live index itself — embeddings are not round-trippable, so you can't "translate" old vectors into the new space. You have to re-run the embedder on the source text.

Phase 2: Dual-read with agreement metrics. Before flipping any user traffic, stand up an offline or shadow-traffic path that sends each query to both the old and new indexes and logs the top-k overlap. The standard bar is a golden-query set with labeled relevance — a few hundred queries is enough — and a target overlap somewhere in the 60%–80% range between old and new top-5. One practitioner who published a full migration writeup measured 82% overlap and used that as the go signal. Overlap below your threshold is a red flag: it means either the new model disagrees meaningfully with the old one (in which case you need retrieval evals, not just benchmark scores, to decide if that disagreement is an improvement) or your chunking and preprocessing drifted during the re-embed. Either way, don't cut over.

Phase 3: Staged cutover. Ramp traffic from 5% to 25% to 100% over days, not minutes. Watch click-through rate, downstream answer quality, and any product-level retrieval metric you trust. The reason to go slow is not that the new index might be missing data — it is that retrieval quality regressions do not trigger error alerts. They show up as a slow decline in user satisfaction that you only notice after enough sessions accumulate. A gradual ramp gives you the statistical power to catch a regression before it is 100% of your traffic.

Phase 4: Rollback plan, kept warm. This is the step that gets cut for time and always shouldn't be. The old index must remain live and queryable for at least a week after cutover, ideally longer. Rollback is a feature-flag flip from embedding_v2 back to embedding, not a restore-from-backup operation. If you have already dropped the old column to save storage, you have converted a ninety-second rollback into a multi-day re-embed of your entire corpus — while users are complaining. The whole point of shadow indexing is that rollback is free; throwing away the old index undoes that property.

The Operational Tax You Didn't Budget For

Teams that correctly identify this as a migration still underestimate the bill, because the costs do not show up in the PR. They show up on the invoice and in the latency dashboard.

Re-embedding burst cost. Embedding an entire corpus in one go is the single largest embedding-API bill most teams ever see. For 50 thousand documents the cost is rounding error and the wall-clock time is hours. For 50 million documents, plan for a weekend of pipeline runtime, several thousand dollars in API fees, and enough rate-limit headroom that you don't starve your live traffic of embedding throughput during the backfill. If you are self-hosting the embedder, the GPU-hours for a re-embed dwarf the steady-state serving cost — sometimes by an order of magnitude.

Index warm-up. HNSW and other graph-based indexes are not queryable at production latency the moment the vectors land. The graph needs to be constructed, the navigable small-world links need to form, and the caches need to warm up. A freshly populated index on cold hardware can have p99 latencies three to five times higher than the same index after an hour of real traffic. If you cut over to a cold shadow index the moment re-embedding finishes, you will eat a latency regression even if the retrieval quality is perfect. Burn in the shadow with replayed traffic before the cutover.

Dual-read cost during the migration window. While the shadow index is live but not authoritative, you are paying for two sets of infrastructure. Storage roughly doubles — plan for 2x to 3x of your baseline for the duration. Query cost doubles during the dual-read phase, because every golden query fans out to both indexes. Embedding cost for incoming documents triples during the period where you are still writing to the old index, writing to the new shadow, and running validation embeddings. None of these are shockingly expensive on their own, but they all land in the same billing cycle, and the finance conversation goes better if someone flagged it in advance.

Backfill window vs. freshness. While you re-embed the corpus, new documents are still arriving. You have two choices: pause writes to the old index during the backfill (untenable for most live systems), or dual-write new documents to both indexes throughout the migration. Dual-writing is correct but requires the embedding pipeline to invoke both models for every new document, which means every ingest path has to know about the migration. Teams who forget this end up with a shadow index that is missing the last week of content when they flip the alias, and the first thing users search for is something recent.

Signals You Are Doing It As a Deploy, Not a Migration

The fastest diagnostic is to ask the team: if the new model is bad, how long does it take to roll back? If the answer is a number — minutes, an hour — you are running a migration. If the answer is "we'd have to re-run the old embedder," you are running a deploy, and the system will punish you for the category error eventually.

A few other signals:

  • The PR that "upgrades the embedder" is a one-line config change with no accompanying infrastructure change. There is no new column, no new index, no new alias, no new feature flag.
  • The rollout plan is "merge on Monday, watch the dashboards." There is no traffic ramp, no golden-query baseline, no dual-read window.
  • The metric being tracked is MTEB score or a benchmark on the model card. There is no internal retrieval eval on your domain, so you have no way to tell if your corpus actually benefits from the swap.
  • The model version is not recorded as metadata on each vector. When something goes wrong, the team cannot tell which documents are on the old embedder and which are on the new one, because the vectors don't carry their own provenance.
  • "We'll do a staged rollout if anything breaks" — a rollout plan that only exists as a contingency is not a rollout plan, because by the time you know something has broken, enough users have hit the bad index that the regression is already baked in.

None of these are fatal individually. Together, they predict a retrieval incident with the kind of accuracy you wish your embedder had.

The Cultural Fix Is a Naming Fix

The technical playbook is well-understood at this point; the people who have lived through an embedding rotation incident mostly agree on the shadow-index-dual-read-staged-cutover pattern. What is less well-understood is that the playbook only gets invoked if the change is correctly named as a migration from the first moment it enters the planning doc. A change filed as "upgrade to embedding-model-v3" gets the governance of a library bump. A change filed as "migrate vector index from embedding-v2 to embedding-v3" gets the governance of a schema change. The words pick the process.

The forward-looking move is to make the distinction structural rather than cultural. Teams that have done enough of these start building an embedding-rotation runbook into their platform: a migration job that automatically provisions the shadow index, runs the dual-read evaluation against a pinned golden set, generates the agreement report, and refuses to promote the shadow to primary unless the agreement threshold is met and a human signs off. At that point, calling the change a "deploy" is no longer an option, because the deployment system will not let you deploy it.

Until that infrastructure exists at your company, the cheapest improvement is linguistic: the next time someone suggests rotating the embedder, call the ticket a migration. Write the rollback plan in the same document as the re-embed plan. Budget the storage doubling before the PR opens, not after the bill does. The model may be one line of config, but the manifold underneath is the whole database, and the manifold does not care what you call the PR.

References:Let's stay in touch and Follow me for more thoughts and updates