Skip to main content

The 12-Month AI Feature Cliff: Why Your Production Models Decay on a Calendar Nobody Marked

· 11 min read
Tian Pan
Software Engineer

A feature ships at 92% pass rate. The launch deck celebrates it. Twelve months later the same feature is at 78% — no incident report, no failed deploy, no single change to point at, just a slow erosion that nobody owned watching for. The team blames "hallucinations" or "user behavior shift," picks a junior engineer to investigate, and sets a quarterly OKR to "improve quality." The OKR misses. The feature ships an apologetic dialog telling users the AI sometimes makes mistakes. Six months after that, it's deprecated and replaced with a new version that ships at 91% pass rate, and the cycle starts again.

This isn't bad luck. It's the second clock that AI features run on, the one that nobody marks on the release calendar at launch. Conventional software has feature decay too — dependency drift, codebase rot, the slow accumulation of half-applied refactors — but those decay on a clock the engineering org already understands and budgets for. AI features have all of that, plus a parallel set of decay sources that conventional amortization assumptions don't model: model deprecations, vendor weight rotations, distribution shift in user inputs, prompt patches that compound, judge calibration drift, and the quiet aging of an eval set that no longer represents what production traffic looks like.

The architectural realization that has to land — before the next AI feature ships, not after — is that AI features have a non-zero baseline maintenance cost. The feature isn't done when it launches. It's enrolled in a maintenance schedule it can't escape, and the team that didn't budget for that schedule is going to discover it the hard way.

The Five Clocks Running Underneath Your Launch Date

When a conventional service ships at 99.95% reliability, the only way for it to drop to 99.7% next quarter is for somebody to push code, change config, or have a hardware failure. The path from "good" to "degraded" is mediated by an event the team can replay. AI features don't have that property. There are at least five clocks running underneath the launch, and any of them can be the cause of a regression a year later — usually multiple of them, compounding.

Clock one: model deprecation. Both OpenAI and Anthropic and Google now publish deprecation calendars that retire models on a 6–12 month cadence. GPT-4o, GPT-4.1, GPT-4.1 mini, and o4-mini all hit deprecation in 2026. Claude 3.5 Haiku is deprecated as of January 5, 2026 with shutdown July 5. Gemini 2.5 Flash and 2.5 Pro hit deprecation in June 2026. If your feature shipped against any of those models, the wall is on the calendar whether or not your team noticed. The migration from one model version to its successor is rarely a one-line change, because the prompt was negotiated against the specific quirks of the old model and the successor reads it differently.

Clock two: silent weight rotation. Pinning a model name is not the same as pinning model behavior. Vendors update weights behind dated aliases, and "gpt-4o-2024-08-06" is more stable than "gpt-4o" but still not a guarantee that the model serving your traffic today is byte-identical to the model that signed off on your eval six months ago. Some providers offer immutable revision pins; many don't. The team that assumed the alias was a contract is going to discover at audit time that the contract was a polite suggestion.

Clock three: input distribution shift. "Prompt drift" is a misnomer for the failure mode where the prompt hasn't changed but the production input distribution has. Users learn what your AI feature is good at and start asking harder questions; the long tail of unusual inputs grows as adoption broadens; new use cases emerge that the prompt template was never tested against. The eval set that signed off on launch was a snapshot of input distribution at month zero. Twelve months in, the eval set's coverage of what production actually sees is shrinking on a clock the team doesn't watch.

Clock four: prompt patch accumulation. Every "this one customer reported X" fix lands in the system prompt as a sentence, sometimes a paragraph. Each individual patch is justified. The aggregate is a prompt that grew from 800 tokens to 2400 over a year, where instructions in the middle got displaced by recency bias of newer instructions, and the original behavioral intent of the prompt is now mediated by a stack of overrides that the team cannot reason about as a whole. Recent research on "prompt debt" — the technical-debt sibling of prompt drift — finds that prompt configuration is the dominant source of self-admitted technical debt in LLM projects, with instruction-based prompts particularly vulnerable.

Clock five: judge and rubric calibration. If you're using an LLM-as-judge to grade your own output, the judge is itself running on a model that is on one of the four clocks above. A judge that gave you 92% pass rate at launch is now grading with a different model version, against a rubric written against a different production distribution, and the score it reports today and the score it reported a year ago are not directly comparable. The scoreboard is drifting under the eval, and the team that didn't track judge version alongside model version doesn't know it.

The Diagnostic Problem That Makes the Cliff Cliff-Shaped

The reason "the cliff" is the right metaphor — and not "gentle slope" — is that all five clocks tick quietly until they don't. Each one individually moves the needle by a few points; the user experience absorbs that without complaint, support tickets are framed as "the AI made a mistake" rather than "the AI got worse," and the dashboard's pass-rate line dips a percent or two per quarter. Then a model deprecation forces a migration on a fixed deadline, the migration exposes the prompt-debt that has accumulated, the eval set turns out not to represent current traffic so the team can't actually validate the migration, and the feature falls off a cliff because three quarters of cumulative drift get audited at once.

The diagnostic problem is that when the team finally looks, none of the clocks individually is the cause. The cause is that nobody owned watching them, so the regression has six contributing factors and no clean root cause. The post-mortem reads "we should have monitored this," and the action item is "add a dashboard," which the team builds, looks at for three weeks, then stops looking at because the next quarter's roadmap has new feature work on it. The cycle restarts.

The Maintenance Cadence That Has To Be On the Calendar Before Launch

The discipline that has to land — and "discipline" is the right word, because the failure mode is structural rather than technical — is that AI features ship with a maintenance cadence on the calendar from day one, owned by the same team that owns the feature, with the cost of that cadence priced into the feature's TCO before the launch decision is made.

Quarterly re-eval against fresh production traffic. The eval set is not a static artifact; it's a living dataset that pulls a sampled, labeled cohort from current production traffic each quarter. The labeling cost is real and recurring — somewhere between $5K and $30K per feature per quarter depending on label complexity — and it has to be a budget line, not an unfunded ask. The teams that get this right run the new eval against the current production model and the current production prompt, and they decompose the result into "model contribution," "prompt contribution," and "distribution contribution" so the regression can be attributed.

Monthly prompt audit. Every prompt patch landed in the last month gets reviewed as a unit, not as individual diffs. Patches that have been superseded by a model upgrade get retired. Patches that contradict each other get reconciled. The audit produces a prompt-changelog entry that names the current behavioral intent of the prompt and what's enforcing it where, so the next on-call who has to debug a regression can read the prompt as a coherent document rather than a sediment of historical fixes.

Model-pin review on the vendor's deprecation cadence. Every vendor model your feature depends on has a deprecation calendar; the team's release calendar should mirror it. The model-pin review happens before the deprecation deadline, not at it, and the migration target is selected after a calibration eval against the candidate successor — not at the moment the API starts returning 4xx. Teams that wait until the deprecation deadline are doing migration under time pressure with no eval headroom; teams that lead the deprecation by 60+ days have the headroom to ship the migration as a normal release.

Judge version pinning and recalibration. If you grade with an LLM-as-judge, the judge model and rubric version are versioned alongside the production model and prompt. When the judge model is itself deprecated, the recalibration is a project: re-grade a held-out set with both the old and new judge to establish the correspondence between scores, and announce the new score baseline rather than letting the dashboard quietly shift.

A baseline maintenance budget allocated at launch. The hardest part of this discipline is not technical; it's that AI features have a recurring cost line that conventional features don't, and the team that prices the feature like a conventional feature is underpricing it. A reasonable rule-of-thumb for production-grade AI features in 2026 is 15–25% of the original build cost per year as ongoing maintenance, growing if the feature is high-stakes or compliance-relevant. The PM who shipped the feature owns that line. If they can't get it funded, the feature ships with a known decay trajectory and a known sunset date — which is itself a defensible decision, but only if it's named.

The Org Failure Mode That Drives the Cliff

The technical disciplines above are not difficult to write down. They're difficult to fund. The org failure mode that drives the 12-month cliff is structural: AI features get launch budget, not maintenance budget, and the engineering org that scaled by shipping AI features fast doesn't have a slot in its operating model for the engineer whose job is to keep an already-shipped feature from decaying.

The PM is incentivized on launches, not retained quality of last year's launches. The eval engineer is shared across features and is the bottleneck the previous quarter's blog post warned about. The ML platform team owns the infrastructure but not the per-feature evals. The feature engineering team owns the feature but is loaded with new feature work and has no separate slot for the maintenance work. So the maintenance work falls to whoever shows up — and at twelve months it's the on-call who got paged because a customer escalated, and they're doing a year of compounded drift triage in a one-week firefight.

The fix is not to write a runbook. The fix is to put a named owner on the maintenance cadence and to count the maintenance hours against the team's capacity at planning time, exactly the way ops headcount counts against the SRE team's capacity. AI features that don't have an owner for the maintenance cadence will fall off the cliff; the question is only when.

What Changes If You Don't Pay This Cost

Some teams will read this and decide, reasonably, that the maintenance cost isn't worth it for every feature. That's a defensible position, and the architectural realization underneath this whole post is that the cost is real and the choice has to be made deliberately. Three options the team can pick from, all of them sustainable if named:

Option one: full maintenance cadence. The feature is mission-critical, the cost of decay is high, and the team budgets the 15–25% annualized maintenance line and staffs the named owner. The feature gets a 5-year horizon and a defended pass rate.

Option two: explicit sunset. The feature is useful but not load-bearing. The team ships it with a 12–18 month sunset on the calendar, no ongoing maintenance after launch, and a planned replacement or removal at sunset. Users and stakeholders are told the horizon at launch. The feature decays gracefully into retirement.

Option three: declared decay. The feature stays in production, but the dashboard shows the decaying pass rate honestly and the product surfaces "AI may be wrong" affordances calibrated to the current pass rate, not the launch pass rate. Decay is a known operating condition, not a hidden one.

The wrong option — the option most teams default to — is option four: ship and forget, then surprise. That's the path that produces the cliff, the unattributable regression, the firefight migration, and the apologetic blog post twelve months later about how AI is hard.

It is hard. But it's hard on a clock the calendar already shows you, if you're willing to mark it before the launch instead of after the regression.

References:Let's stay in touch and Follow me for more thoughts and updates