The Model Migration Playbook: How to Swap Foundation Models Without Breaking Production
Every team that has shipped an LLM-powered product has faced the same moment: a new foundation model drops with better benchmarks, lower costs, or both — and someone asks, "Can we just swap it in?" The answer is always yes in staging and frequently catastrophic in production.
The gap between "runs on the new model" and "behaves correctly on the new model" is where production incidents live. Model migrations fail not because the new model is worse, but because the migration process assumes behavioral equivalence where none exists. Prompt formatting conventions differ between providers. System prompt interpretation varies across model families. Edge cases that the old model handled gracefully — through learned quirks you never documented — surface as regressions that your eval suite wasn't designed to catch.
This is the playbook for migrating foundation models safely: the dual-write shadow period, the behavioral drift detection that actually works, the embedding compatibility problem, and the organizational coordination that determines whether the swap takes two weeks or two months.
Why "Just Run the Eval Suite" Is Necessary but Not Sufficient
The instinct is reasonable: you have an eval suite, the new model scores well on it, ship it. But eval suites encode the failure modes you already know about. Model migrations surface the ones you don't.
The fundamental problem is distributional. Your eval set is a curated sample — typically hundreds to low thousands of examples selected to cover known edge cases. Production traffic is a continuous, shifting distribution of inputs that includes combinations your eval authors never imagined. A model that scores 94% on your eval set can score 78% on the long tail of real traffic because the behavioral differences cluster in exactly the inputs you didn't test.
Three categories of drift consistently escape static evals:
Format and structure drift. OpenAI models tend to prefer markdown-heavy prompts with sectional delimiters, emphasis, and lists. Anthropic models respond better to XML tags for delineating input structure. A prompt that produces clean JSON from GPT-4o might produce JSON wrapped in markdown code fences from Claude, or vice versa. Your parsing code handles one format. The eval suite tests one format. Production breaks on the other.
Refusal pattern changes. Every model family has a different refusal boundary. A query that one model handles as a straightforward factual response, another might refuse or hedge with excessive caveats. These differences are invisible in standard accuracy evals because refusal isn't wrong — it's a different kind of right that happens to break your user experience.
Reasoning path divergence. Two models can arrive at the same final answer through different intermediate reasoning. When your system depends on chain-of-thought outputs — for logging, for downstream tool selection, for user-facing explanations — the answer being correct doesn't mean the behavior is equivalent.
The eval suite is your first gate, not your last. It catches the 60% of regressions that are obvious. The remaining 40% require shadow testing against live traffic.
The Shadow Period: Dual-Write Architecture for Safe Migration
Shadow testing is the practice of running production requests through both your current and candidate models simultaneously, logging the candidate's responses without showing them to users. It's the single most effective technique for catching behavioral drift before it reaches customers.
The architecture is straightforward: your API gateway or routing layer duplicates each incoming request, sending it to both models in parallel. The production model's response goes to the user. The candidate model's response goes to a comparison pipeline. You capture response content, latency, token count, and any structured output alongside metadata about the request.
A few operational realities that teams discover the hard way:
Shadow testing doubles your API spend. For the duration of the shadow period, you're paying for two inference calls per request. Set budget alerts and plan for a shadow period of one to two weeks — long enough to capture a full business cycle of traffic patterns, short enough to not hemorrhage money.
Latency comparison requires careful measurement. The candidate model runs without the pressure of a user waiting, so its latency numbers in shadow mode may differ from production. Measure time-to-first-token and total generation time separately, and be skeptical of latency improvements that only appear in shadow mode.
Automated comparison is essential but imperfect. You need a comparison pipeline that evaluates semantic similarity, format conformance, and task-specific correctness between the two outputs. LLM-as-judge works for semantic comparison. Deterministic checks work for structured output conformance. Neither catches everything — budget for human review of a random sample, especially for high-stakes outputs.
The shadow period produces a comparison dataset: thousands of real production inputs with paired outputs from both models. This dataset is more valuable than any benchmark. It tells you exactly where the new model diverges on your actual workload.
What to Measure During Shadow
Aggregate metrics hide the important signals. Break your analysis down by:
- Input category. If you have intent classification, measure divergence per intent. Regressions cluster in specific task types, not uniformly.
- Output length distribution. A model that's 30% more verbose costs more and may degrade user experience even if accuracy is identical.
- Structured output conformance rate. If you use JSON mode or function calling, measure schema validation pass rates separately. A 2% drop in conformance rate at scale means hundreds of failed requests per day.
- Error and refusal rates. Track how often each model refuses, hedges, or produces error responses. A new model that refuses 5% of queries the old model handled is a regression even if every non-refused response is better.
- Tail latency. p50 latency might improve while p99 gets worse. For user-facing applications, p99 is the number that determines whether your SLA holds.
Embedding Model Migration: The Reindexing Problem
When the model you're migrating is an embedding model — or when your new LLM requires a different embedding model for RAG — the migration complexity increases by an order of magnitude. Embeddings are not interchangeable across models. A vector produced by text-embedding-ada-002 is meaningless in an index built for text-embedding-3-large. The dimensionality may differ. The semantic space is fundamentally different. You cannot mix old and new vectors in the same index.
This means reindexing your entire document corpus, which creates three problems:
The zero-downtime requirement. Your RAG system can't go offline for the hours or days it takes to re-embed millions of documents. The standard pattern is blue-green indexing: build the new index alongside the old one, then switch the query path atomically once the new index is complete and validated. This requires enough infrastructure to run two indexes simultaneously.
Validation before cutover. A new embedding model changes retrieval behavior. Documents that ranked highly for a given query under the old model may rank differently under the new one. Before switching, run your retrieval eval suite against the new index and spot-check queries where the ranking order changed significantly. Retrieval changes cascade into generation changes — a 5% shift in what gets retrieved can produce a 15% shift in what gets generated.
The versioning trap. If you store embeddings in multiple places — a primary vector database, a cache layer, a feature store — all of them need to be updated atomically. Partial migration, where some queries hit old embeddings and others hit new ones, produces inconsistent behavior that's extremely difficult to debug. Version-tag your embeddings and enforce that the query path only reads from a single version at a time.
Teams that plan for a one-day embedding migration typically finish in one to two weeks. The reindexing is fast. The validation and coordination are slow.
Prompt Compatibility: The Migration Tax Nobody Budgets For
The most time-consuming part of a model migration is neither the infrastructure work nor the testing. It's prompt adaptation.
Every production system accumulates prompt-level assumptions about model behavior. You've tuned your system prompt over months — adjusting wording, adding constraints, removing instructions the model started ignoring. These adjustments encode implicit knowledge about how a specific model interprets instructions. When you swap models, that implicit knowledge becomes implicit bugs.
Common prompt compatibility failures:
- Few-shot examples that worked for one model confuse another. The examples may establish a pattern that model A follows and model B interprets differently, producing outputs that match the format but miss the intent.
- Instruction ordering sensitivity. Some models weight instructions at the beginning of the system prompt more heavily. Others weight the end. A prompt that works by placing the most critical constraint last may fail when the new model has a different attention pattern.
- Negative instructions that stop working. "Don't include disclaimers" might be respected by one model family and ignored by another. These are the regressions users notice immediately.
- Temperature and sampling sensitivity. The same temperature value produces different diversity levels across models. A temperature of 0.7 on one model might be equivalent to 0.4 on another.
The practical approach is to treat prompt migration as a separate workstream from infrastructure migration. Assign someone to systematically test each prompt against the new model, starting with the highest-traffic prompts. Use your shadow testing comparison dataset to identify which prompts have the largest behavioral divergence, and prioritize those for manual tuning.
Budget two to five days of prompt engineering per major prompt in your system. For complex multi-turn systems with dozens of prompts, this alone can stretch a migration to weeks.
The Rollback Plan That Actually Works
Every migration plan includes "we can roll back." Few specify what rollback means operationally. A proper rollback plan answers three questions:
What triggers a rollback? Define specific, measurable criteria before the migration begins. Not "if things go wrong" but "if structured output conformance drops below 97%" or "if p99 latency exceeds 800ms for 15 minutes." Ambiguous rollback criteria lead to two failure modes: rolling back too late because nobody wanted to make the call, or rolling back too early because someone panicked over normal variance.
How fast can you roll back? If switching back requires a config change and a deployment, you can roll back in minutes. If it requires reverting an embedding index, you need the old index still running. If prompts were modified for the new model, rolling back means reverting prompts too. Keep the old model's infrastructure warm and the old prompts in version control with a clear revert path.
What state do you lose? If the new model has been processing requests for hours, some user sessions may contain responses generated by it. Rolling back mid-session can create inconsistencies — a conversation where the tone, format, or capabilities suddenly change. For multi-turn systems, consider draining active sessions before rolling back rather than switching mid-conversation.
The safest rollback architecture is a feature flag that controls model routing at the request level. This lets you roll back instantly, roll forward gradually (1% → 10% → 50% → 100%), and maintain both models in a deployable state throughout the migration window.
The Canary Rollout: From Shadow to Production
Once shadow testing validates the new model and prompts are adapted, the migration follows a canary pattern:
Phase 1: Internal traffic only. Route internal users, dogfood accounts, and synthetic test traffic to the new model. This catches integration issues — serialization bugs, timeout configuration mismatches, logging format changes — without user impact.
Phase 2: Low-risk traffic segment. Route 1-5% of production traffic to the new model, selecting a segment where errors are least costly. For a customer support bot, this might be informational queries rather than account modification flows. Monitor your rollback criteria continuously.
Phase 3: Gradual ramp. Increase to 25%, then 50%, then 100% over days, not hours. At each stage, compare aggregate metrics between the old and new model cohorts. Watch for signals that only appear at scale: rate limiting on the new provider, latency degradation under load, or cost overruns from unexpectedly verbose responses.
Phase 4: Decommission. Once the new model reaches 100% and has been stable for at least one business cycle, decommission the old model's infrastructure. Not before. The old model is your rollback path, and premature decommissioning is how "we can always roll back" becomes "we could have rolled back."
The Organizational Coordination Problem
Technical migration is a solved problem. Organizational migration is where timelines blow up.
A model migration touches every team that depends on model behavior: product teams that designed UX around specific output formats, data teams that built analytics on response patterns, compliance teams that approved specific model behaviors, and customer support teams that know the current model's quirks.
The coordination checklist that prevents surprises:
- Notify downstream consumers before shadow testing begins. If other services parse your model's output, they need to validate against the new model's format before you switch.
- Update documentation and runbooks. Error codes, response formats, and behavioral expectations change. On-call engineers diagnosing production issues at 3 AM need accurate documentation.
- Align on the migration window. Don't migrate during high-traffic periods, product launches, or when key engineers are on vacation. This sounds obvious. It's violated constantly.
- Communicate the rollback criteria to stakeholders. When a product manager sees a temporary quality dip during canary rollout, they need to know whether it's within expected parameters or a reason to roll back. Define this before the migration, not during it.
The Migration Timeline That Actually Holds
Most teams estimate model migrations at one to two weeks. Most migrations take four to eight weeks. Here's a realistic timeline:
- Week 1: Eval suite baseline on new model. Identify obvious regressions. Begin prompt adaptation for highest-traffic prompts.
- Week 2: Deploy shadow infrastructure. Begin dual-write shadow testing. Continue prompt work.
- Week 3: Analyze shadow results. Complete prompt adaptation. Fix identified regressions.
- Week 4: Begin canary rollout (internal → 1% → 5%). Monitor rollback criteria.
- Week 5-6: Ramp to 25-100%. Address long-tail issues that only appear at scale.
- Week 7-8: Stabilization period. Decommission old infrastructure. Update documentation.
For systems with embedding model dependencies, add two to three weeks for reindexing and retrieval validation.
The teams that finish faster are the ones that have migrated before. The playbook gets faster with practice — not because the technical work shrinks, but because the organizational muscle memory develops. Your second migration will take half the time of your first. Build the infrastructure for repeatability: feature flags for model routing, automated comparison pipelines, versioned prompt repositories. The next model worth migrating to is already in training.
- https://www.codeant.ai/blogs/llm-shadow-traffic-ab-testing
- https://insightfinder.com/blog/hidden-cost-llm-drift-detection/
- https://medium.com/@komalbaparmar007/llm-canary-prompting-in-production-shadow-tests-drift-alarms-and-safe-rollouts-7bdbd0e5f9d0
- https://arxiv.org/html/2603.03111v1
- https://dev.to/humzakt/zero-downtime-embedding-migration-switching-from-text-embedding-004-to-text-embedding-3-large-in-1292
- https://docs.aws.amazon.com/prescriptive-guidance/latest/gen-ai-lifecycle-operational-excellence/prod-monitoring-drift.html
