Skip to main content

The Model Reached End of Life and Took Your Prompt With It

· 10 min read
Tian Pan
Software Engineer

A deprecation notice looks harmless. It arrives as a calm paragraph in a changelog or an email: this model snapshot will be removed from the API on a date a few months out, here is the recommended replacement, thank you for building with us. The implied work is a one-line change — swap the model string, redeploy, done.

That framing is wrong, and it is wrong in an expensive way. The model string is the smallest thing you are losing. The thing that actually leaves with the old model is the prompt you spent six months tuning — every edge-case patch, every reordered instruction, every "respond only with valid JSON, do not wrap it in markdown" you added because that specific model did that specific annoying thing. None of that was portable. It was fitted, in the statistical sense, to one model's behavior. The replacement is not bug-for-bug compatible, so the fit no longer holds.

A model end-of-life is a migration project. Treat it as a config change and you will discover the difference in production, on the new model, with real traffic.

A deprecation notice starts a clock, not a task

The notices are real and they are frequent. OpenAI scheduled chatgpt-4o-latest and several GPT-4-class snapshots for removal from the API in February 2026, with developers notified roughly three months ahead. Azure publishes rolling retirement tables for every model it hosts. DeepSeek announced that legacy V4 model names would stop resolving after a hard cutoff date. Across frontier providers, the first half of 2025 alone saw a dozen-plus major model releases, and each release quietly shortens the runway for whatever you are currently running.

The pattern underneath all of it: a usable production model has a shelf life of roughly twelve to eighteen months. That is not a worst case. That is the median. If your system has been in production for two years, you have already lived through at least one forced migration, whether you noticed it as a project or absorbed it as a series of small fires.

The deprecation notice gives you a date. It does not give you a plan, a test, or an estimate. What it actually starts is a clock, and the clock is running against work you have not scoped yet: re-validating every prompt, re-checking every tool-calling path, and re-confirming behavior on the slice of traffic that never showed up in your demos. The teams that get hurt are the ones who read the notice, filed it as "swap the string before the date," and only sized the real work when the date got close.

Your prompt is a fossil of one model's quirks

Here is the uncomfortable part. A prompt that has been in production for six months is not a clean specification of what you want. It is an archaeological record of one model's failure modes.

Every line that starts with "do not," "always," or "remember to" was added in response to something a specific model did. The model padded its JSON with prose, so you added a line forbidding prose. The model under-used a tool, so you added an emphatic instruction to use it. The model was too eager to refuse, so you softened the safety framing. The model misread an ambiguous field, so you reordered the context so the important part came last. Each patch was a local fix for a local quirk. Stacked together over six months, they form a prompt that is precisely shaped to one model's interpretation style — and to nothing else.

This is why prompt portability is mostly a fiction until you test it. The same instruction text, handed to a different model, lands differently. A prompt that reliably produces bare JSON on one model produces markdown-fenced JSON on another, or correct JSON plus an unsolicited friendly sentence. Instruction ordering that one model treats as priority, another treats as suggestion. The research on this is blunt: in few-shot settings, formatting changes alone — whitespace, separators, the order of examples — have been shown to swing accuracy by dozens of points. A single added "please" can move output quality. Your six months of tuning encoded all of that, invisibly, into a prompt you now think of as "the spec."

The trap is that the prompt still reads like a clean specification. Nothing in the text says "this clause exists because the March model snapshot double-escaped quotes." So when you migrate, you carry the whole fossil forward, and the patches that were load-bearing for the old model become dead weight — or active liabilities — on the new one. The new model never had the quirk your clause was fixing, and now your clause is fighting a behavior that does not exist.

The eval set is the only thing that survives a swap

If the prompt does not survive a model swap intact, what does? One thing: a re-runnable eval set.

An eval set is the single artifact that is defined in terms of what you want, not how a particular model behaves. "Given this support ticket, the response must correctly identify the refund policy and must not promise a refund outside the window" is true regardless of which model is behind it. That assertion outlives every model. The prompt that satisfies it does not.

This reframes what an eval set is for. Most teams build evals as a quality gate for prompt changes — edit the prompt, run the evals, ship if green. That is useful, but it undersells the asset. The eval set is also your migration insurance. It is the thing that lets you take a candidate replacement model, run it against a few hundred concrete cases that encode your real requirements, and get a number instead of a vibe. Without it, "is the new model good enough?" is answered by one engineer poking at it for an afternoon and declaring it fine.

A migration eval set has to be wider than a prompt-iteration eval set, because a model swap changes more surface area than a prompt edit does. Prompt iteration mostly affects the cases you were already thinking about. A model swap changes refusal behavior, output formatting under-specified schemas, tool-call frequency, latency distribution, and how the model handles inputs that look nothing like your happy path. Two models that score identically on a public benchmark can behave completely differently on the 15% of production traffic that is messy, adversarial, or just weird. So the migration eval set needs cases harvested from real production transcripts — especially the failures — not just the demo script.

The honest sequencing during a migration is: run the existing prompts against the new model first. Many prompts survive a generational jump fine; you do not want to rewrite what is not broken. Let the eval set tell you which prompts regressed, and rewrite only those. Rewriting a prompt you did not need to touch is how you introduce new bugs while fixing none. The eval set is what turns "rewrite everything, anxiously" into "rewrite these four, with evidence."

If you take one operational habit from this: the day you ship an LLM feature, you owe it a re-runnable eval set, because that set is the only thing that will still be standing the next time a model gets retired.

Straddling two models is not free

Migrations are not instant cutovers, and the period in between is where the unbudgeted cost lives.

A responsible migration runs the new model in shadow or canary mode before full cutover — mirror a slice of production traffic to the candidate, compare outputs, or route a small percentage of real traffic and watch the metrics. This is correct practice. It is also, for the duration, paying for inference twice. Shadow traffic is pure overhead: you serve the request on the old model and pay again to score the new one. Canary is cheaper but slower to build confidence. Either way, the migration has a cost line that does not exist in steady state, and it scales with how long you straddle.

The straddle cost is not only dollars. While two models are live, every prompt change has to be made and tested twice. Every incident triage starts with "which model served this request?" Every new feature either waits for the migration to finish or gets built against both targets. Engineers context-switch between two behavior models. Teams consistently underestimate this because the dual-running period feels like a temporary state that does not deserve real planning — and then it stretches, because the new model has one stubborn regression that takes a month to resolve, and the straddle quietly becomes the status quo.

The lesson is not to skip the straddle. It is to time-box it. Decide before you start how long the dual-running period is allowed to last and what the exit criteria are — eval pass rate, regression count, cost ceiling. An open-ended straddle is the most expensive way to run a migration, because you pay double for inference and double for engineering attention with no forcing function to stop.

Budget the treadmill as a fixed cost, not a surprise

The deepest mistake is treating each migration as a one-off event. It is not an event. It is a recurring operating condition.

If usable models turn over every twelve to eighteen months, then a production LLM system has a migration project on a permanent schedule, the same way it has dependency upgrades, certificate rotations, and security patches. Nobody is surprised by a TLS certificate expiring; it is on a calendar, it has an owner, and the renewal is routine. Model deprecation deserves the same treatment and rarely gets it, because it still feels like news every time.

Putting the treadmill in the budget means a few concrete things. It means the eval set is maintained continuously, not reconstructed in a panic when a notice arrives — a stale eval set is worth very little the day you need it most. It means there is a named owner for "model currency," the way there is an owner for dependency hygiene. It means each quarter's capacity plan reserves room for migration work, so it competes with features on the roadmap honestly instead of ambushing it. And it means the cost model for the whole system includes a periodic migration line item, amortized, rather than a recurring "unexpected" overrun every twelve months.

The framing that helps: the model is a dependency with a known, short support window. You would not build on a library that EOLs every year and act shocked annually. The model is exactly that library. The provider even sends you the calendar.

The takeaway

A model deprecation notice is not a config change waiting to happen. It is a migration project with a deadline you did not set. The model string swaps in a second; the prompt you fitted to that model's quirks does not survive, and the only thing that does is a re-runnable eval set defined in terms of what you actually need.

So treat it as standing infrastructure. Keep the eval set current and broad enough to catch behavior changes, not just accuracy changes. Run the old prompts against the new model first and rewrite only what the evals flag. Time-box the dual-running period with explicit exit criteria. And put the next migration in the budget now, because the provider has almost certainly already scheduled it — you just have not read that changelog entry yet.

References:Let's stay in touch and Follow me for more thoughts and updates