Skip to main content

Quarterly Model Migration: Make It a Calendar Event, Not a Fire Drill

· 11 min read
Tian Pan
Software Engineer

The deprecation email arrives on a Tuesday afternoon. The model your billing pipeline has depended on for fourteen months is now on a sixty-day timer. The prompt was tuned by an engineer who left in March. The eval suite hasn't been re-baselined since launch. The customer-success team is asking why "the AI feels different" on two enterprise accounts. Nobody put this on the roadmap, and nobody will own it cleanly, because in your org's mental model this is a one-off project — even though it is the fourth one this year.

Every team running an AI feature in production runs into the same realization within eighteen months: the foundation-model provider is operating on a deprecation cadence that the team did not plan for, and the team's migration response keeps being a reactive scramble triggered by a notification email. The fix is not a better playbook for the next migration — there are already plenty of those, and your team has probably written one. The fix is to stop treating migration as a project and start treating it as a recurring operational primitive. Put it on the calendar.

The Cadence the Providers Have Already Imposed on You

Read the deprecation policies of any of the major model providers and a pattern emerges. Anthropic gives at least sixty days of notice before a publicly released model is retired. OpenAI declares replacement models roughly ninety to one hundred twenty days before retirement, then runs the deprecation window from there. Microsoft Foundry holds models in a "Deprecated" state for at least ninety days before they hit "Retired." xAI announced its May 2026 retirement of older Grok slugs with several months of lead time and an automatic redirect to a successor model on cutover day.

The numbers vary, but the rhythm is the same: somewhere between two and four months of notice, fired at the provider's discretion, on a schedule the provider does not coordinate with your release calendar. That cadence has been steady enough now — across providers and across years — that it is no longer a surprise. Look at any recent deprecation feed: Claude 3 Haiku deprecated February 2026, shut down August 2026. Claude 3.5 Haiku deprecated January, shut down July. Multiple OpenAI snapshots deprecated and removed during the same window. GitHub Models swept a batch of Anthropic and OpenAI versions out in a single February changelog entry. If your team ships an AI feature, you will face two to four of these per year, predictably.

That is the shape of a quarterly operational rhythm. It is not the shape of a project. A team that keeps responding to it as a project is, in practice, paying the migration tax three or four times a year and re-deriving the migration patterns each time from scratch. The org learns nothing across iterations because each migration is treated as a unique incident. The engineers doing the work are not building durable infrastructure; they are building the same temporary scaffolding repeatedly.

What a Quarterly Migration Drill Looks Like

A migration drill is a calendar event on the engineering schedule, owned by a named DRI, with a defined scope and a defined exit condition. It runs every quarter whether or not a deprecation is pending. The point of the drill is not to migrate — most quarters there is nothing to migrate to. The point is to keep the migration apparatus warm.

A typical drill looks like this: pick a candidate model from a different family than the current production model. Run the regression suite against the candidate. Run a behavioral diff between the current model and the candidate on a recent slice of production traffic. Inspect the diff for the categories you care about — refusals, formatting drift, tool-call shape, latency-percentile shift, judge-score delta. File the results in a versioned report. Update the migration runbook with anything you learned. Done.

The drill is not a green-light decision about whether to migrate. It is an exercise that produces three artifacts the team did not have before: a recent measurement of how close the production model is to its likely replacement, a runbook that has been touched within the last ninety days by someone who still works at the company, and a candidate-model evaluation that can be revived in two days rather than two weeks when the deprecation email actually lands. The drill compounds over quarters. By the third one, the team's migration response is muscle memory.

The DRI question matters more than the technical content. Migrations fail organizationally before they fail technically. When the deprecation email arrives and the implicit owner is "the prompt person will figure it out," what actually happens is that nobody schedules the meeting until the deadline is six weeks away, the prompt person discovers the regression suite needs work, and the timeline collapses. Naming a quarterly migration DRI — rotating across senior engineers — fixes the ownership ambiguity before the email forces the conversation under time pressure.

Eval Re-anchoring Is a Routine, Not a Project

The hardest part of a migration is not the prompt rewrite or the deployment cutover. It is answering the question every stakeholder will ask within twenty-four hours of the migration going live: did quality drop? If your only answer is a hallway opinion from the engineer who looked at twenty samples, the answer is going to be re-litigated for weeks, the rollback decision will be made on vibes, and you will lose half the team's morale to the second-guessing.

The answer the team needs is a number. Specifically, the eval-on-traffic score for the regression suite, run on the new model against the same prompts, the same inputs, and the same judge configuration that produced the pre-migration baseline. The pre-migration score is X. The post-migration score is Y. The delta is Z. The conversation now has a basis.

The trap is that the eval suite drifts between migrations. Cases get added, judges get retuned, prompts evolve, and the baseline that was meaningful eight months ago no longer cleanly compares to a fresh run. Re-anchoring the eval suite — running it against the current production model immediately before the migration so the baseline is fresh — is the practice that makes the post-migration number trustworthy. Build it into the runbook. Run it on every drill, even when no migration is pending, so the team has a recent baseline at any moment.

A subtler issue: the judge model itself can change. If your eval judge is a model that is also under deprecation pressure — and increasingly judges are foundation models — the judge becomes a moving part of the measurement instrument. Pin the judge model snapshot explicitly. Track the judge's own migration on a separate cadence. If the judge changes and the target model changes in the same quarter, the score delta becomes uninterpretable.

The Prompt-Portability Audit

Prompts tuned against one model family are not portable to another by default. They contain implicit assumptions about system-prompt placement, role-name conventions, tool-call formatting, stop-sequence behavior, and the model's tolerance for instructions of various shapes. A prompt that was iteratively refined over six months against one model often degrades by several eval points when the same prompt is run against a different family, even when published benchmarks suggest the two models are comparable.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates