The Model Migration Playbook: How to Swap Foundation Models Without Breaking Production
Every team that has been running LLM-powered features for more than six months has faced the same moment: a better model drops, the current provider raises prices, or the model you depend on gets deprecated with 90 days' notice. You need to swap the foundation model underneath a running production system. Most teams treat this as a configuration change — update the model ID, re-run the eval suite, ship it. Then they spend the next two weeks firefighting regressions that the evals never caught.
The model migration problem is fundamentally different from traditional software upgrades. When you swap a database version, the query semantics are preserved. When you swap a foundation model, everything changes: output distributions shift, edge-case behaviors diverge, and downstream systems that learned to depend on specific model quirks silently break. The failure modes are distributional, not binary, which means they hide in the long tail where your eval suite has the least coverage.
Here is the migration playbook that accounts for these realities — from the shadow testing period that catches behavioral drift before users do, through the embedding compatibility layer that prevents your vector indexes from becoming useless overnight, to the rollback strategy that actually works when the failure modes are subtle.
Why "Just Run the Eval Suite" Is Necessary but Not Sufficient
The standard approach to model migration goes something like this: point your eval suite at the new model, compare scores, and if the numbers look good, ship it. This approach catches roughly 60-70% of real-world regressions. The remaining 30-40% hide in places your eval suite was never designed to look.
The core problem is distributional mismatch. Your eval suite is a curated sample of inputs that somebody thought to include. Production traffic is a messy, long-tailed distribution that includes inputs nobody anticipated. A model swap can score identically on your eval set while producing subtly different outputs on the tail of your real traffic — different JSON formatting conventions, changed refusal boundaries, altered confidence calibration, or shifted tone that breaks downstream parsers and user expectations.
Three specific failure categories that evals routinely miss:
- Format drift: The new model wraps JSON in markdown code blocks when the old one returned raw JSON. Your parser handles both, but the extra tokens blow your latency budget on high-volume endpoints.
- Refusal boundary shift: The new model refuses 3% of queries that the old model handled, or — worse — handles queries the old model correctly refused. Neither direction shows up as an accuracy regression in evals because the eval set doesn't include enough boundary cases.
- Calibration change: The new model expresses higher confidence in uncertain answers. Downstream systems that threshold on stated confidence start making different routing decisions, cascading into behavior changes that look like bugs in unrelated features.
Evals are your pre-flight checklist. They tell you whether the new model is plausibly ready for production. They do not tell you whether production is ready for the new model.
The Shadow Period: Dual-Running Models on Live Traffic
The gap between eval performance and production behavior closes when you test on actual production traffic. Shadow testing — routing live requests to both the current and candidate models, serving only the current model's response to users — gives you a realistic picture of how the new model behaves on your real input distribution.
Architecture: Implement traffic duplication at the application layer rather than the load balancer. You need to capture the full request context (including conversation history, system prompts, and tool definitions) that shapes model behavior, not just the raw HTTP payload. The shadow model receives identical inputs and its outputs are logged for comparison but never surfaced to users.
Duration: Run the shadow period for at least one full business cycle. If your product has weekly usage patterns (lighter weekends, heavier Monday mornings), a single midweek day of shadow traffic misses the input distribution shifts that come with different usage contexts. For most B2B applications, two weeks is the minimum that captures enough tail-end traffic to be meaningful.
What to measure during the shadow period:
- Semantic equivalence rate: What percentage of response pairs are semantically equivalent despite surface-level differences? This is your primary health metric. Use an LLM judge with specific rubrics rather than string similarity.
- Disagreement analysis: For the cases where the models diverge, classify the disagreements. Is the new model better, worse, or just different? "Different" is not free — it means retraining user expectations and updating downstream integrations.
- Latency distribution: Compare p50, p95, and p99 latencies. A model that is 20% faster at p50 but 40% slower at p99 will improve dashboards while degrading the experience for your most complex queries.
- Token economics: Measure actual token usage on production traffic. A model that produces more concise outputs saves money; one that generates longer chain-of-thought reasoning before answering costs more than the per-token price difference suggests.
- Error and refusal rates: Track refusals, malformed outputs, and timeout rates separately. A 0.5% increase in refusal rate on a high-traffic endpoint means hundreds of failed user interactions per day.
Cost consideration: Shadow testing doubles your LLM API costs for the duration of the test. Budget for this explicitly. If cost is prohibitive on 100% of traffic, sample — but sample stratified by request type, not randomly. Random sampling under-represents rare but important request patterns.
Embedding Compatibility: The Migration Everyone Forgets
When your model swap includes the embedding model — or when a provider updates embedding model weights behind a version bump — every vector in your database becomes silently incompatible with new queries. Old embeddings and new query vectors exist in different geometric spaces. Cosine similarity scores between them are meaningless, but they still return numbers that look plausible, so the failure is silent: retrieval quality degrades without any error signals.
The dual-index strategy is the safest path for production embedding migrations:
- Add a parallel column for the new embeddings alongside the existing ones. Build the new index concurrently so production reads are uninterrupted.
- Batch re-embed your corpus asynchronously. Process in batches of 50-100 documents with rate limiting to stay within API quotas. Track progress so you can resume if the job fails partway through.
- Validate overlap before switching. Run your most common queries against both indexes and measure top-K result overlap. An 80%+ overlap in top-10 results indicates the new model preserves your retrieval semantics. Below 70%, investigate whether the new model is genuinely better or just different.
- Feature-flag the switch. A single environment variable controls which index serves production queries. This gives you instant rollback — no redeployment, no re-indexing.
- Clean up after stabilization. Drop the old column and rename the new index only after a stability period (one week minimum). Until then, keep both indexes warm.
The adapter shortcut: If full re-indexing is impractical (corpus exceeds 100M vectors or re-embedding costs are prohibitive), a lightweight linear transformation can map new query embeddings into the old vector space. Research on drift-adapters shows this preserves 90-95% of retrieval quality while completely avoiding re-indexing. The tradeoff is permanent technical debt: you are carrying a compatibility shim that every future migration must account for.
Version everything: Store the embedding model version alongside each vector. When you query, ensure the query embedding was produced by the same model version as the stored vectors. This sounds obvious, but the most common embedding migration bug is a mismatch between query and document embedding models that produces degraded — not broken — retrieval.
The Rollback Strategy That Actually Works
Model rollbacks in LLM systems are harder than they look because the failure signal is often delayed and ambiguous. A user who gets a slightly worse response does not file a bug report — they just trust your product a little less. By the time the degradation shows up in aggregate metrics, it has been affecting users for days.
Design your rollback before you start the migration:
- Keep the old model endpoint warm for the entire migration period. "Warm" means actively serving shadow traffic, not just theoretically available. A cold-started LLM endpoint has different latency characteristics than a warm one, and your rollback latency matters when you are reverting under pressure.
- Define rollback triggers in advance. Specific thresholds on specific metrics, decided before the migration starts, not in the heat of an incident. Examples: refusal rate exceeds baseline by more than 0.5%, p99 latency exceeds baseline by more than 30%, semantic equivalence drops below 90%.
- Automate the rollback mechanism. A feature flag that routes traffic back to the old model, flipped by a single API call or config change. If your rollback requires a deployment, it is too slow for the subtle-regression case where you discover the problem at 2 AM on day three.
- Account for state contamination. If the new model has been writing to conversation histories, knowledge bases, or other persistent stores, rolling back the model does not roll back that state. Decide in advance whether contaminated sessions need special handling.
Staged rollout as an alternative to full cutover: Rather than switching 100% of traffic at once, route progressively — 5%, then 25%, then 50%, then 100% — with a stabilization period and metric review at each stage. This limits blast radius and gives you earlier signal on real-user impact. The cost is a longer migration window and the operational complexity of serving from two models simultaneously.
The Prompt Compatibility Layer
Foundation models do not respond identically to the same prompt. A prompt optimized for GPT-4 produces different results on Claude, and a prompt tuned for Claude 3 may underperform on Claude 4. Model migration requires prompt migration.
Audit your prompt inventory before starting. Every system prompt, few-shot example set, and output format instruction needs review against the new model. The prompts that need the most attention are the ones with the most specific output format requirements — JSON schemas, structured extraction templates, and tool-use instructions.
Run prompt-level A/B tests during the shadow period. For each major prompt, compare outputs between the old prompt on the old model versus the adapted prompt on the new model. This isolates whether regressions are caused by the model change or the prompt change.
Common prompt adaptation patterns:
- Instruction specificity: Newer models often require less explicit formatting instructions but need more explicit constraint statements. If your old prompt spends 200 tokens explaining JSON formatting, the new model might need only 50 tokens there but 100 tokens on boundary conditions.
- Few-shot example count: Model upgrades frequently reduce the number of few-shot examples needed. Test with fewer examples — not just because it saves tokens, but because unnecessary examples can actually degrade output quality on more capable models by over-constraining the output space.
- System prompt restructuring: Different models have different sensitivities to instruction ordering within the system prompt. What worked as a mid-prompt instruction on the old model might need to be a final instruction on the new one.
The Migration Timeline in Practice
A realistic model migration for a production system with meaningful traffic looks like this:
Week 1: Preparation. Audit prompt inventory, set up dual-model infrastructure, define rollback triggers and success metrics, begin embedding re-indexing if applicable.
Week 2-3: Shadow period. Run both models on production traffic. Analyze disagreements daily. Adapt prompts based on findings. Fix format incompatibilities.
Week 4: Staged rollout. Route 5% of production traffic to the new model. Monitor real user metrics — not just model quality metrics, but product metrics like task completion rate and session length.
Week 5: Scale-up. Increase to 25%, then 50%. Each increase includes a 48-hour stabilization window.
Week 6: Full cutover and stabilization. Route 100% to the new model. Keep the old model warm for one additional week. Monitor for delayed-onset regressions (metrics that look fine in the first few days but degrade as the new model encounters rarer input patterns).
Week 7: Cleanup. Decommission the old model endpoint. Drop legacy embedding indexes. Update documentation and runbooks.
Six to seven weeks feels slow for what is nominally a configuration change. But the teams that skip steps here are the ones that end up doing emergency rollbacks at 2 AM — which takes even longer when you account for the incident response, root cause analysis, and trust repair.
What This Means for Your Architecture
The real lesson of model migration is architectural: your system should be built to assume the model will change. If swapping the model requires touching dozens of files, updating hardcoded model names, re-indexing without a dual-index capability, or redeploying to roll back, your architecture is coupling you to a specific model in ways that will cost you every time the landscape shifts.
Abstract the model behind an interface. Version your embeddings. Feature-flag your model routing. Build the shadow-testing infrastructure once and keep it warm. These are not migration-specific investments — they are the operational foundations that make your LLM system maintainable as the ecosystem moves faster than any single team can react to.
The model you are running today will not be the model you are running in six months. Build accordingly.
- https://www.codeant.ai/blogs/llm-shadow-traffic-ab-testing
- https://dev.to/humzakt/zero-downtime-embedding-migration-switching-from-text-embedding-004-to-text-embedding-3-large-in-1292
- https://arxiv.org/html/2603.03111v1
- https://byaiteam.com/blog/2025/12/30/llm-model-drift-detect-prevent-and-mitigate-failures/
- https://wallaroo.ai/ai-production-experiments-the-art-of-a-b-testing-and-shadow-deployments/
- https://medium.com/data-science-collective/different-embedding-models-different-spaces-the-hidden-cost-of-model-upgrades-899db24ad233
- https://sparkco.ai/blog/mastering-embedding-versioning-best-practices-future-trends
- https://insightfinder.com/blog/hidden-cost-llm-drift-detection/
