The Model Migration Playbook: How to Swap Foundation Models Without Breaking Production
Every team that has shipped an LLM-powered product has faced the same moment: a new foundation model drops with better benchmarks, lower costs, or both — and someone asks, "Can we just swap it in?" The answer is always yes in staging and frequently catastrophic in production.
The gap between "runs on the new model" and "behaves correctly on the new model" is where production incidents live. Model migrations fail not because the new model is worse, but because the migration process assumes behavioral equivalence where none exists. Prompt formatting conventions differ between providers. System prompt interpretation varies across model families. Edge cases that the old model handled gracefully — through learned quirks you never documented — surface as regressions that your eval suite wasn't designed to catch.
This is the playbook for migrating foundation models safely: the dual-write shadow period, the behavioral drift detection that actually works, the embedding compatibility problem, and the organizational coordination that determines whether the swap takes two weeks or two months.
Why "Just Run the Eval Suite" Is Necessary but Not Sufficient
The instinct is reasonable: you have an eval suite, the new model scores well on it, ship it. But eval suites encode the failure modes you already know about. Model migrations surface the ones you don't.
The fundamental problem is distributional. Your eval set is a curated sample — typically hundreds to low thousands of examples selected to cover known edge cases. Production traffic is a continuous, shifting distribution of inputs that includes combinations your eval authors never imagined. A model that scores 94% on your eval set can score 78% on the long tail of real traffic because the behavioral differences cluster in exactly the inputs you didn't test.
Three categories of drift consistently escape static evals:
Format and structure drift. OpenAI models tend to prefer markdown-heavy prompts with sectional delimiters, emphasis, and lists. Anthropic models respond better to XML tags for delineating input structure. A prompt that produces clean JSON from GPT-4o might produce JSON wrapped in markdown code fences from Claude, or vice versa. Your parsing code handles one format. The eval suite tests one format. Production breaks on the other.
Refusal pattern changes. Every model family has a different refusal boundary. A query that one model handles as a straightforward factual response, another might refuse or hedge with excessive caveats. These differences are invisible in standard accuracy evals because refusal isn't wrong — it's a different kind of right that happens to break your user experience.
Reasoning path divergence. Two models can arrive at the same final answer through different intermediate reasoning. When your system depends on chain-of-thought outputs — for logging, for downstream tool selection, for user-facing explanations — the answer being correct doesn't mean the behavior is equivalent.
The eval suite is your first gate, not your last. It catches the 60% of regressions that are obvious. The remaining 40% require shadow testing against live traffic.
The Shadow Period: Dual-Write Architecture for Safe Migration
Shadow testing is the practice of running production requests through both your current and candidate models simultaneously, logging the candidate's responses without showing them to users. It's the single most effective technique for catching behavioral drift before it reaches customers.
The architecture is straightforward: your API gateway or routing layer duplicates each incoming request, sending it to both models in parallel. The production model's response goes to the user. The candidate model's response goes to a comparison pipeline. You capture response content, latency, token count, and any structured output alongside metadata about the request.
A few operational realities that teams discover the hard way:
Shadow testing doubles your API spend. For the duration of the shadow period, you're paying for two inference calls per request. Set budget alerts and plan for a shadow period of one to two weeks — long enough to capture a full business cycle of traffic patterns, short enough to not hemorrhage money.
Latency comparison requires careful measurement. The candidate model runs without the pressure of a user waiting, so its latency numbers in shadow mode may differ from production. Measure time-to-first-token and total generation time separately, and be skeptical of latency improvements that only appear in shadow mode.
Automated comparison is essential but imperfect. You need a comparison pipeline that evaluates semantic similarity, format conformance, and task-specific correctness between the two outputs. LLM-as-judge works for semantic comparison. Deterministic checks work for structured output conformance. Neither catches everything — budget for human review of a random sample, especially for high-stakes outputs.
The shadow period produces a comparison dataset: thousands of real production inputs with paired outputs from both models. This dataset is more valuable than any benchmark. It tells you exactly where the new model diverges on your actual workload.
What to Measure During Shadow
Aggregate metrics hide the important signals. Break your analysis down by:
- Input category. If you have intent classification, measure divergence per intent. Regressions cluster in specific task types, not uniformly.
- Output length distribution. A model that's 30% more verbose costs more and may degrade user experience even if accuracy is identical.
- Structured output conformance rate. If you use JSON mode or function calling, measure schema validation pass rates separately. A 2% drop in conformance rate at scale means hundreds of failed requests per day.
- Error and refusal rates. Track how often each model refuses, hedges, or produces error responses. A new model that refuses 5% of queries the old model handled is a regression even if every non-refused response is better.
- Tail latency. p50 latency might improve while p99 gets worse. For user-facing applications, p99 is the number that determines whether your SLA holds.
Embedding Model Migration: The Reindexing Problem
When the model you're migrating is an embedding model — or when your new LLM requires a different embedding model for RAG — the migration complexity increases by an order of magnitude. Embeddings are not interchangeable across models. A vector produced by text-embedding-ada-002 is meaningless in an index built for text-embedding-3-large. The dimensionality may differ. The semantic space is fundamentally different. You cannot mix old and new vectors in the same index.
This means reindexing your entire document corpus, which creates three problems:
The zero-downtime requirement. Your RAG system can't go offline for the hours or days it takes to re-embed millions of documents. The standard pattern is blue-green indexing: build the new index alongside the old one, then switch the query path atomically once the new index is complete and validated. This requires enough infrastructure to run two indexes simultaneously.
Validation before cutover. A new embedding model changes retrieval behavior. Documents that ranked highly for a given query under the old model may rank differently under the new one. Before switching, run your retrieval eval suite against the new index and spot-check queries where the ranking order changed significantly. Retrieval changes cascade into generation changes — a 5% shift in what gets retrieved can produce a 15% shift in what gets generated.
The versioning trap. If you store embeddings in multiple places — a primary vector database, a cache layer, a feature store — all of them need to be updated atomically. Partial migration, where some queries hit old embeddings and others hit new ones, produces inconsistent behavior that's extremely difficult to debug. Version-tag your embeddings and enforce that the query path only reads from a single version at a time.
Teams that plan for a one-day embedding migration typically finish in one to two weeks. The reindexing is fast. The validation and coordination are slow.
Prompt Compatibility: The Migration Tax Nobody Budgets For
- https://www.codeant.ai/blogs/llm-shadow-traffic-ab-testing
- https://insightfinder.com/blog/hidden-cost-llm-drift-detection/
- https://medium.com/@komalbaparmar007/llm-canary-prompting-in-production-shadow-tests-drift-alarms-and-safe-rollouts-7bdbd0e5f9d0
- https://arxiv.org/html/2603.03111v1
- https://dev.to/humzakt/zero-downtime-embedding-migration-switching-from-text-embedding-004-to-text-embedding-3-large-in-1292
- https://docs.aws.amazon.com/prescriptive-guidance/latest/gen-ai-lifecycle-operational-excellence/prod-monitoring-drift.html
