Skip to main content

The Embedding Upgrade That Silently Re-Ranks Your Entire Corpus

· 9 min read
Tian Pan
Software Engineer

A new embedding model lands on the leaderboard. It scores higher than the one you shipped eighteen months ago, the API is a one-line change, and the dimensions even match. Someone files a ticket: "upgrade embedding model." It looks like swapping a logging library.

It is not. The embedding model is not a component of your retrieval system — it is the coordinate system your retrieval system lives in. Changing it does not improve your index. It invalidates it. And the cruelest part is that nothing crashes. No exception, no failed health check. Your search just starts returning subtly different results, and "subtly different" in a RAG pipeline means a different document feeds the model, which means a different answer reaches the user.

This is the failure mode that does not show up in code review. The diff is three lines. The blast radius is every document you have ever indexed.

A Vector Is Only Meaningful Inside Its Own Model

Every embedding model learns its own internal representation of meaning during training. The output is a floating-point array, but the array is not a universal description of the text — it is a set of coordinates in a space that this specific model invented.

Dimension 47 in one model might track something close to "texture." Dimension 47 in another might track sentiment, or nothing nameable at all. The two models were never trained to agree on what their axes mean. They cannot, because they never saw each other.

This has a hard consequence: a vector from model v1 and a vector from model v2 are not comparable, even when they describe the exact same sentence, even when both arrays have 1536 dimensions. Cosine similarity between them is a number, and the number is meaningless. The geometry that made similarity correspond to semantic closeness only exists within one model's space.

So when you "upgrade" the model, you are not getting better vectors for your corpus. You are getting vectors that live in a different universe than the millions you already stored. The old index is not degraded. It is simply written in a language the new query vectors do not speak.

The only correct move is a full re-embed: run every document through the new model and rebuild the index from scratch. There is no incremental upgrade. There is no "embed new documents with v2 and let the old ones age out." That last idea is where teams quietly destroy their retrieval quality.

The Half-Migrated Index Is Worse Than Either Model

Picture the tempting shortcut. The new model is live. New documents get embedded with v2. Old documents keep their v1 vectors. One index, mixed contents, no downtime. It feels pragmatic.

What you have built is an index where similarity is computed across two incompatible coordinate systems. A query embedded with v2 will score every document — but the scores against v1 documents are noise and the scores against v2 documents are signal. The ranking interleaves them. Some results are real neighbors; some are accidents of two unrelated geometries overlapping.

The behavior this produces is genuinely hard to debug. The same query returns excellent results for recently indexed documents and poor results for older ones. Precision and recall both sag, but not uniformly — they sag as a function of when a document was last embedded, which is a variable nobody on the team is graphing. Your observability dashboard shows "retrieval quality down 12%," and that aggregate number hides the fact that one cohort of documents went from good to unusable while another stayed fine.

A team will spend a week tuning chunk sizes and reranker weights before someone thinks to ask which model embedded which row. The mixed index does not announce itself. It just makes your system behave like it has two personalities, and gives you no field to split the metrics on.

If you take one rule from this: a single index, or a single collection, must contain vectors from exactly one model version. Mixing is not a tradeoff with a downside. It is a correctness bug.

A Higher Benchmark Score Is Not a Promise About Your Corpus

The reason the upgrade ticket gets approved is a number on a leaderboard. The new model beats the old one on MTEB, so it must retrieve better. This inference is weaker than it looks.

General-purpose embedding benchmarks measure average performance across a broad mix of public datasets. Your corpus is not that mix. It has its own vocabulary, its own document style, its own distribution of query phrasing. Research into domain-specific retrieval has found that a model's score on a general benchmark can fail to correlate with its score on a specialized domain — finance, legal, medical, and internal-jargon-heavy corpora routinely reorder the leaderboard once you measure on them directly.

So it is entirely possible — and frequently observed — for a model that benchmarks higher to retrieve worse on your specific data. The new model may have been tuned in ways that help general web text and hurt the dense, acronym-laden documents your support team writes. You will not know until you measure on your own corpus, with your own queries.

This is why the upgrade decision needs a golden dataset before it needs an API key. Assemble a few hundred of your most common and most critical queries, each paired with the documents that should be retrieved. Run that set against both models. Compare. If the new model does not win on your data, the leaderboard does not get a vote. The benchmark told you the model is good at being a general model. It did not tell you it is good at being your model.

The Migration Is a Schema Change, Not a Config Change

Once you accept that an upgrade means re-embedding the whole corpus, the project stops looking like a config tweak and starts looking like a database migration — because that is exactly what it is. The embedding model is a schema decision. Treat the cutover with the same discipline you would give a column type change on a production table.

The pattern that works is the shadow index. Stand up a second index built with the new model, alongside the live one. Keep both populated: a dual-write path sends every new and updated document to both indexes during the migration window, so the shadow does not fall stale while you backfill the history.

Then backfill — re-embed the existing corpus into the shadow index in the background, at a rate your API quota and budget tolerate. Throughout, production keeps serving from the old index. Nothing user-facing has changed yet.

The cutover is where most of the safety lives, and the rule is: never flip 100% of traffic at once. Both representations exist simultaneously, so use that. Route 5% of queries to the shadow index, watch your retrieval metrics and downstream answer quality, then ramp. A feature flag wraps the switch so rollback is one toggle, not a redeploy. If the shadow index disappoints, you discard it and you have lost compute, not customers.

This sequence — shadow index, dual-write, background backfill, gradual flag-gated cutover, instant rollback — is not exotic. It is the standard zero-downtime migration playbook, applied to vectors. The mistake is not that the playbook is hard. The mistake is not realizing a vector migration needs one.

Budget the Backfill, Not Just the Token Price

Teams that do scope the re-embed often scope only the obvious line item: the embedding API bill. That number is reassuringly small. Re-embedding a million documents of a few hundred tokens each costs single-digit dollars on a current small embedding model. The token price is not the cost.

The real costs sit around it. There is the storage of running two full indexes in parallel for the length of the migration — at tens or hundreds of millions of vectors, the duplicated index can cost more per month than the entire re-embedding run cost once. There is the rebuild time: large approximate-nearest-neighbor indexes take hours to construct, and that window has to be planned. There is the engineering time to build the dual-write path, the backfill job, the comparison harness, and the flag plumbing — and if you build all of that under pressure after retrieval quality has already regressed, you are paying for it at the worst possible exchange rate.

And there is rate-limiting. Re-embedding a large corpus is millions of API calls. You will hit throughput ceilings, you will need retry and checkpoint logic so a failed run resumes instead of restarting, and on a genuinely large corpus the backfill is a multi-day job, not an afternoon. Provider-side improvements help at the margin — some newer model families share an embedding space across their own tiers, so you can move within a provider's lineup without a full re-embed — but any cross-provider move, or any jump across a major version, still means re-embedding everything.

Treat the Embedding Model as a Pinned Dependency

The throughline is that the embedding model deserves to be treated like a schema, because functionally it is one. Three habits make that real.

Pin and record the version. Every index, and ideally every stored vector, should carry the model identifier and version that produced it. When something looks wrong, the first question — "which model embedded this?" — should be answerable from metadata, not from archaeology. This single field is what makes a mixed-index bug visible instead of invisible.

Keep a corpus-specific evaluation set. The golden query/document set is not just for the migration decision. It is your standing regression test for retrieval, and it is the only instrument that tells you the truth about a candidate model, because it measures your data instead of someone's average.

Plan the migration before you want one. The shadow index, dual-write, backfill, and flagged cutover should be a known runbook, not an improvisation. When the genuinely better model arrives — and it will — the team that already treats the embedding model as a versioned dependency runs a planned migration. The team that treats it as a swappable component ships a three-line diff and spends the next week explaining to users why the answers got worse.

The upgrade is not the risk. The silent re-ranking is. A model swap with no plan does not fail loudly — it just quietly rewrites which document your system trusts, one query at a time, until someone notices the answers drifted. Make the embedding model a thing you version, measure, and migrate on purpose, and that silence stops being dangerous.

References:Let's stay in touch and Follow me for more thoughts and updates