Skip to main content

2 posts tagged with "shadow-testing"

View all tags

The Model Migration Playbook: How to Swap Foundation Models Without Breaking Production

· 13 min read
Tian Pan
Software Engineer

Every team that has shipped an LLM-powered product has faced the same moment: a new foundation model drops with better benchmarks, lower costs, or both — and someone asks, "Can we just swap it in?" The answer is always yes in staging and frequently catastrophic in production.

The gap between "runs on the new model" and "behaves correctly on the new model" is where production incidents live. Model migrations fail not because the new model is worse, but because the migration process assumes behavioral equivalence where none exists. Prompt formatting conventions differ between providers. System prompt interpretation varies across model families. Edge cases that the old model handled gracefully — through learned quirks you never documented — surface as regressions that your eval suite wasn't designed to catch.

The Model Migration Playbook: How to Swap Foundation Models Without Breaking Production

· 12 min read
Tian Pan
Software Engineer

Every team that has been running LLM-powered features for more than six months has faced the same moment: a better model drops, the current provider raises prices, or the model you depend on gets deprecated with 90 days' notice. You need to swap the foundation model underneath a running production system. Most teams treat this as a configuration change — update the model ID, re-run the eval suite, ship it. Then they spend the next two weeks firefighting regressions that the evals never caught.

The model migration problem is fundamentally different from traditional software upgrades. When you swap a database version, the query semantics are preserved. When you swap a foundation model, everything changes: output distributions shift, edge-case behaviors diverge, and downstream systems that learned to depend on specific model quirks silently break. The failure modes are distributional, not binary, which means they hide in the long tail where your eval suite has the least coverage.