The Deprecation Date That Moved While It Sat in Your Backlog
The deprecation notice arrived on a Tuesday with a sunset date six months out. Your platform team logged it in the dependency tracker with a "Q3 cutover" label and a yellow severity. It joined two other migrations already in the queue. Three weeks later, the provider amended the date inside the same URL, no diff, no inbox notification, just a quietly updated paragraph that pulled the sunset sixty days earlier into the middle of your code freeze.
The lifecycle page you treated as a planning document was always a contract clock. The only thing that changed is which team's calendar it controlled — and the team that owns it is not yours.
The pattern is depressingly consistent across providers. A model your highest-value workload depends on gets a deprecation banner with a date in the future. You file it. You triage it against the other work. You give it the seriousness that the original date implied. Then the date moves, and the version that moves it is the only version that exists, because the page is the source of truth and the page just got rewritten in place.
The Deferred Bucket Is a Bet on the Vendor's Memory
Every engineering org has a triage bucket called "deferred," or "Q3," or "after the migration," that is functionally equivalent to "we trust this won't bite us before we get to it." That bucket works for internal work, where the deadline is a thing your team controls and can renegotiate. It breaks for external deadlines, where the only renegotiation channel is a vendor account manager who has no incentive to give you slack.
A model deprecation notice belongs in a different bucket. It is closer to a regulatory deadline than a feature request. The vendor has already committed to the sunset internally — there is a roadmap that says the GPU capacity backing your model is being repurposed, there is a successor model whose adoption metrics depend on starving the predecessor, there is a finance line item that depends on shutting down inference for an older architecture by a specific quarter. The published date is the soft end of a range the vendor has already narrowed internally. Treating it as a planning hint is a category error.
The fix is to route deprecation notices through whatever pipe your team uses for hard external deadlines. If your team has a runbook for SOC 2 audit prep, deprecation notices should land in that same runbook. The cutover gate should be named the day the deprecation lands, not the week before the date you optimistically assumed it would stay.
The Page Is the System of Record, and It Edits in Place
Software engineers have an instinctive expectation that important documents are versioned. The lifecycle pages most providers publish do not honor that expectation. There is one URL. The content at that URL is mutable. There is no public changelog of edits to the deprecation table. There is no commit history. There is, in many cases, no email when the sunset date changes — the email when the deprecation is first announced is treated as sufficient, and amendments are silent.
This is a remarkable architectural detail to anchor production decisions on. The team that depends on a date does not own the page where the date lives. The team that owns the page treats it as marketing-adjacent documentation that they can revise for clarity. The two teams have wildly different priors about how stable that field is, and the gap between those priors is where production incidents are born.
The mitigation is mechanical: poll the page on a schedule, diff the content, and alert when any field in the deprecation table changes. The polling cadence should be days, not weeks — daily is reasonable, hourly is paranoid but defensible if the workload depends on a model whose deprecation would be expensive to absorb. The diff target should be the entire table, not just the model rows you currently consume, because a row's appearance and disappearance both matter. When the alert fires, the action is not "discuss in the next sync." The action is to open the runbook and re-evaluate the cutover gate against the new date.
Eval Scores on the Canonical Set Do Not Authorize the Flip
The accelerated date forces the cutover plan to compress. The version that assumed two weeks of A/B traffic collapses to a weekend of dual-write and a Monday-morning flip. The team needs a green light, and the artifact that produces green lights is the eval suite. The eval suite returns a number above threshold. The flip is authorized.
The eval suite is measuring a fixed distribution. It was constructed against the traffic patterns and intent mix that existed when the labels were written, weeks or months ago. Production traffic is a live distribution that shifts continuously, and the part of it that breaks customer trust is rarely the median — it is the tail, the customer segments whose queries cluster in regions of the input space that the eval set under-samples.
The two failure modes here are independent and cumulative. The eval set is a snapshot, so it lags the distribution. The eval set is also a curated representation, so it under-represents the segments that the labelers found tedious, ambiguous, or hard to score. A model that scores close on the eval set can produce a tripled error rate on a specific customer segment whose queries the eval set treated as outliers. The signal that catches this is not the eval score. The signal is production error metrics segmented by customer cohort, watched on the day of the cutover and the week after.
The patch is to augment the eval set with stratified samples from each customer segment the workload serves, weighted toward segments whose queries are atypical. A pre-cutover dry run that routes one percent of production traffic through the replacement model from the day the deprecation lands gives the eval set six months of real-distribution feedback before the flip becomes mandatory. The team that runs this dry run finds out about the long-tail regression in the first week, not the sixth day after the flip.
The Customer Segment Whose Error Rate Tripled
There is a specific failure pattern worth naming because it has happened more than once. The cutover ships on Monday. By Tuesday, top-line error metrics look fine. By Friday, they still look fine. On the following Wednesday, a single customer's support volume crosses a threshold and pages the support oncall, who escalates to engineering, who finds that the customer's specific workflow has been failing at three times the previous rate since the flip.
The customer's traffic was a small enough fraction of total volume that its tripled error rate did not move the aggregate metric. The aggregate metric was the one being watched. The cohort breakdown existed in the dashboard but was not on the alert path. The six days between the flip and the page were six days during which the customer's users hit a degraded experience without anyone on the team knowing.
The mitigation here is to wire cutover dashboards to fire on per-cohort regressions, not just aggregate ones. The threshold should be tighter for cohorts representing meaningful contract value and tighter still for cohorts whose churn signal is harder to recover than acquisition cost. The first week after a model swap is the window when these alerts should be most sensitive, because that is when the highest-impact regressions surface, and the team's attention is already mobilized.
The Lifecycle Page Is a Vendor System of Record
Step back from the specific incident, and the architectural realization is uncomfortable. A production system you operate depends on a date stored in a system of record your team does not own, owned by a vendor whose incentive is to optimize their own roadmap and whose process for amending the date does not include notifying you. The team that does not poll the page on a schedule has built a release calendar that depends on the vendor remembering to notify them. The vendor will not remember every time.
This is the same shape as every vendor dependency that bit teams in earlier eras of software — the SaaS API whose breaking changes shipped behind a "minor version bump," the CDN whose IP ranges changed without an updated allowlist, the third-party SDK whose deprecation policy was a single sentence in a release note. The mature pattern in those domains is the same pattern that applies here: poll the source of truth, diff it on every poll, alert on every change, and treat the alert as a signal to re-evaluate the plan rather than as noise to dismiss.
A team that internalizes this stops thinking of the deprecation page as a planning document and starts thinking of it as a control plane operated by a vendor. The control plane sends signals at times the vendor chooses. The team's job is to receive those signals on the timescale the signals are actually sent, not on the timescale the team's quarterly planning cycle would prefer.
Build the Cutover Gate the Day the Deprecation Is Filed
The single highest-leverage change is structural. The moment a deprecation notice lands for a model your workload depends on, three things happen on the same day: a runbook entry is created that names the cutover gate, a fallback-traffic dry run is configured to route a small fraction of production through the replacement model, and a polling job is added to watch the lifecycle page for any field changes on that row. None of these wait for the next sprint. None of these get triaged into the deferred bucket.
The runbook entry is the artifact that survives the migration backlog reshuffles. It names the specific eval cohorts, the specific dashboards, the specific cutover window, and the specific rollback procedure. It is owned by a specific person, not a team, because the deferred bucket fills with team-owned tickets that no individual is accountable for. The polling job catches the silent amendment. The dry run accumulates the long-tail signal the canonical eval set will not produce.
The cost of all of this is small. The cost of not doing it is six days of degraded experience for a customer segment whose tripled error rate did not move the top-line metric. The provider's lifecycle page will continue to update in place. The vendor will continue to amend dates without notification. The only variable the team controls is whether the change registers the day it happens or the day support volume forces the page.
- https://platform.claude.com/docs/en/about-claude/model-deprecations
- https://www.anthropic.com/research/deprecation-commitments
- https://developers.openai.com/api/docs/deprecations
- https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/model-retirements
- https://openai.com/index/retiring-gpt-4o-and-older-models/
- https://vertesiahq.com/blog/your-model-has-been-retired-now-what
- https://news.ycombinator.com/item?id=37866550
