Skip to main content

The Model Migration Playbook: How to Swap Foundation Models Without a Feature Freeze

· 11 min read
Tian Pan
Software Engineer

Every production LLM system will face a model migration. The provider releases a new version. Your costs need to drop. A competitor offers better latency. Regulatory requirements demand a different vendor. The question is never if you'll swap models — it's whether you'll do it safely or learn the hard way that "just run the eval suite" leaves a crater-sized gap between staging confidence and production reality.

Most teams treat model migration like a library upgrade: swap the dependency, run the tests, ship it. This works for deterministic software. It fails catastrophically for probabilistic systems where the same input can produce semantically different outputs across model versions, and where your prompt was implicitly tuned to the behavioral quirks of the model you're replacing.

Why Model Migrations Are Different From Software Upgrades

When you upgrade a database driver, the contract is explicit: same query, same results. LLM migrations break this assumption in three ways that compound on each other.

Prompt-model co-adaptation. Your prompts evolved alongside your current model. Every time an engineer tweaked a system prompt to fix an edge case, they were implicitly encoding that model's interpretation style into the prompt. A prompt that reliably produces structured JSON from GPT-4 may produce markdown-wrapped JSON from Claude, or add unsolicited commentary from Gemini. These aren't bugs in the new model — they're the residue of optimization for the old one.

Behavioral surface area. A model isn't just its accuracy on your eval set. It's how it handles ambiguity, how aggressively it refuses edge cases, how it formats outputs when the schema isn't perfectly specified, how it responds to adversarial inputs. Two models that score identically on a benchmark can behave completely differently on the 15% of production traffic that doesn't look like your eval distribution.

Downstream coupling. Your post-processing code, your guardrails, your output parsers — they all learned to expect specific patterns from your current model. A model that wraps JSON in triple backticks, or that spells out numbers instead of using digits, or that adds a preamble before the structured response, breaks downstream systems in ways that unit tests never anticipated.

The Shadow Period: Dual-Write Before You Cut Over

The single most important technique in model migration is the shadow deployment period. Run the new model on live production traffic without serving its responses to users. Log everything. Compare everything. Only cut over when the delta is understood and acceptable.

Here's how to structure it:

Phase 1: Parallel execution (1-2 weeks). Route a copy of every production request to the new model. Store both responses with a correlation ID. Users only see responses from the existing model. This costs you extra inference spend, but it's the cheapest insurance you'll buy.

Phase 2: Automated comparison. Build comparison pipelines that go beyond string matching. You need semantic similarity scores between old and new outputs, structural conformance checks (does the new model's JSON parse correctly?), and behavioral classification (did the new model refuse a request the old model handled, or vice versa?).

Phase 3: Discrepancy triage. Not all differences are regressions. The new model might produce better answers on some inputs. The goal isn't zero difference — it's zero surprising difference. Classify discrepancies into:

  • Improvements: new model is demonstrably better
  • Neutral variants: semantically equivalent, differently worded
  • Regressions: new model fails where old model succeeded
  • Novel behaviors: new model does something neither expected nor covered by evals

The novel behaviors category is the one that bites hardest. These are the cases your eval suite never anticipated because the old model never triggered them.

Prompt Translation Is Engineering, Not Find-and-Replace

When teams migrate between model families — say, from OpenAI to Anthropic, or from a proprietary model to an open-source one — the prompt layer requires genuine re-engineering. Each model family has different preferences for how instructions are structured, how context is delimited, and how output format is specified.

OpenAI models respond well to markdown-formatted prompts with section delimiters. Claude prefers XML tags for separating instructions from context. Gemini emphasizes explicit constraint placement. These aren't cosmetic differences — they affect how reliably the model follows complex multi-step instructions.

The practical migration pattern:

  1. Extract the intent from each prompt. What is this prompt actually trying to achieve? Strip away the model-specific formatting and write down the core requirements.
  2. Re-express for the target model. Use the target model's preferred structuring conventions. This often means rewriting, not translating.
  3. Test on your hardest cases first. Don't start with the happy path. Start with the prompts that required the most iteration to get right on the old model — those are the ones most likely to break.
  4. Preserve the escape hatches. If your old prompt had specific instructions for handling edge cases ("if the user asks about X, respond with Y"), verify these still trigger correctly. Edge case handling is where model-specific tuning concentrates.

A common mistake is building an abstraction layer that "normalizes" prompts across providers. This sounds elegant but produces mediocre results everywhere. The best prompts are model-specific. Accept the maintenance cost of per-model prompt variants in exchange for per-model reliability.

Embedding Migration: The Reindexing Problem Nobody Budgets For

If your system uses embeddings — for RAG, semantic search, or classification — a model migration creates a second, often larger migration: your entire vector index becomes incompatible. New embedding models produce vectors in a different semantic space. You cannot mix old and new embeddings in the same index and expect meaningful similarity scores.

The dual-index strategy. Create the new index alongside the old one. Re-embed your corpus with the new model while the old index continues serving production traffic. This requires enough compute to re-embed everything and enough storage to hold both indexes simultaneously.

Incremental vs. full reindex. For small corpora (under a few million documents), full reindexing is simpler and safer. For large-scale systems, incremental reindexing with change-data-capture from your source-of-truth database lets you prioritize high-traffic documents first and backfill the rest.

The validation gap. How do you know the new embeddings are actually better? Measure retrieval overlap: for a representative sample of queries, compare the top-K results from both indexes. Research suggests 80-85% overlap in top results is typical when migrating between competitive embedding models. The 15-20% that differs is where you need human evaluation to determine which index is actually right.

Cutover and rollback. Use an alias or routing layer in front of your vector store so the cutover is a configuration change, not a code deployment. Keep the old index warm for at least two weeks after cutover. If retrieval quality degrades on real traffic, you can switch back in seconds rather than hours.

Store embedding model metadata — model name, version, dimensions, and training date — alongside every index. When you're debugging retrieval quality six months from now, you'll want to know exactly which model produced which vectors.

Why "Just Run the Eval Suite" Is Necessary But Insufficient

Your eval suite tests the scenarios you anticipated. Model migrations break on the scenarios you didn't. Here's why evals alone create false confidence:

Distribution mismatch. Eval sets are curated. Production traffic is messy. The eval set overrepresents the cases you thought to test and underrepresents the long tail of weird inputs that make up 20-30% of real traffic. A model that scores 95% on your evals might handle that long tail completely differently than its predecessor.

Behavioral drift in multi-turn sessions. Research on model switching in multi-turn conversations shows that even a single-turn handoff between models — where the new model continues a conversation started by the old model — can swing performance by ±4 F1 points. If your system has any session continuity, you need to test not just single-turn accuracy but multi-turn coherence across the model boundary.

Changed refusal patterns. Every model has a different refusal surface. The new model might refuse requests the old one handled, or handle requests the old one refused. Neither your eval suite nor your users will immediately surface all of these differences. Refusal changes often appear as increased "the AI couldn't help me" complaints that take weeks to aggregate into a visible pattern.

Output format instability. Even when using structured output modes, different models have different failure modes at the edges. One model might occasionally nest an extra layer of JSON. Another might encode special characters differently. A third might truncate long outputs at a different threshold. These are the kinds of failures that pass evals (because evals use clean inputs) but break production parsers (because production inputs are dirty).

The Migration Playbook: A Phased Approach

Here's the sequence that works. Each phase has a clear gate before proceeding to the next.

Week 1-2: Prompt translation and unit validation. Re-engineer your prompts for the new model. Test against your existing eval suite. Gate: eval scores within 2% of the old model on all critical metrics.

Week 2-3: Shadow deployment. Run the new model in shadow mode on production traffic. Log all responses. Gate: automated comparison shows regression rate below your threshold (typically 3-5% for non-critical paths, <1% for critical paths).

Week 3-4: Canary rollout. Route 5% of production traffic to the new model. Monitor user-facing metrics: task completion rate, error rate, session length, explicit feedback scores. Gate: no statistically significant degradation on any user-facing metric.

Week 4-5: Progressive rollout. Increase traffic to 25%, then 50%, then 100% over the course of the week. Continue monitoring. At each step, the rollback path is a traffic routing change, not a code deployment.

Week 6+: Cleanup. Decommission the old model. Remove dual-write infrastructure. Archive shadow comparison data (you'll want it for the next migration). Update your prompt documentation to reflect the new model's conventions.

The entire process takes 4-6 weeks for a straightforward migration. If you're also migrating embedding models, add 2-3 weeks for reindexing and retrieval validation. If you're changing providers entirely (not just model versions), add another week for API integration, authentication, and error handling changes.

Building for the Next Migration

The teams that handle model migrations well aren't the ones with the best eval suites — they're the ones who designed for migration from the start.

Model-agnostic interfaces. Wrap your LLM calls behind an abstraction that separates prompt construction from model invocation. Not to make prompts "portable" (they shouldn't be), but to make the swap surface obvious: you can see exactly which prompts need re-engineering when you change models.

Behavioral baselines. Continuously log a sample of production inputs and outputs. These become your comparison corpus for the next migration. You don't need to evaluate them — just store them. When migration time comes, you replay the inputs through the new model and compare.

Versioned prompt registries. Track which prompt version is paired with which model version. When something breaks, you need to know whether the prompt changed, the model changed, or both. This sounds obvious but most teams version their code and treat their prompts as configuration that lives in environment variables.

Migration runbooks. Document each migration while it's fresh. What broke? What did the eval suite miss? How long did each phase actually take? These runbooks compound in value — your third migration will go twice as fast as your first because you'll know where the bodies are buried.

The uncomfortable truth about model migration is that it's a recurring tax, not a one-time project. Foundation models update every few months. Provider pricing changes quarterly. New capabilities emerge that your product team wants to leverage. The organizations that treat migration as an operational capability — like database migrations or API versioning — rather than a special project are the ones that can actually move fast when the landscape shifts.

Your next model swap is coming. The question is whether you'll have a playbook or a postmortem.

References:Let's stay in touch and Follow me for more thoughts and updates