The AI Feature Sunset Playbook Nobody Writes
Every AI org has a graveyard. Not of services — those get a runbook, a deprecation banner, a 30-day migration window, and a slot on the platform team's quarterly roadmap. The graveyard is of features: the smart-summary beta that never graduated, the auto-categorizer that two enterprise customers actually built workflows around, the agentic flow that demoed beautifully and shipped behind a flag that nobody flipped off. The endpoint is easy to deprecate. The four other things attached to it — the prompt, the judge, the regression set, and the incident memory — are what actually take a quarter, and nobody on the team has written the playbook because nobody has been promoted for retiring something.
This is the gap. Most of the public discourse on "model deprecation" is about vendor-side retirements: GPT-4o leaves on a date, Assistants API beta sunsets on August 26, DALL-E 3 retires on May 12, and your platform team has a notification period to migrate. That problem has playbooks because vendors publish dates, because the migration is forced, and because the work fits in a sprint. The internal version — when you decide a feature you built didn't graduate, and you have to actually take it out — has none of those forcing functions. The deprecation date is whatever you say it is. The migration path is whatever you build. And the artifacts you have to retire are not a single endpoint but a tangled stack of model-adjacent assets that your monitoring barely knows exist.
What follows is the playbook I wish someone had handed me the first time I had to take an AI feature out of production. It assumes the easy part — sunsetting the user-facing surface — is the part you already know how to do. It focuses on the four things that don't get decommissioned by default and that keep generating cost, noise, and quiet risk for quarters after the feature is supposed to be gone.
The Five Things You're Actually Sunsetting
Before anything else, name the assets. A traditional service deprecation has one or two: the endpoint, maybe the database table. An AI feature has at least five, and they have different owners, different decay rates, and different decommissioning processes.
The user-facing surface. The button, the UI element, the API route that the customer touches. This is the one product managers and the comms team think about. It is also the easiest part. You add a deprecation header, you put up a banner, you set a date, you respect the date. Done.
The prompt and its version history. Every prompt that ever shipped to production for this feature now lives somewhere — a prompt registry, a config file, a feature flag payload, an environment variable in a Helm chart, an inline string in a Python module someone forgot to refactor. Mature registries support a "deprecated" label that hides versions from new SDK pulls while preserving history for audit, which is the right pattern. But if you don't have a registry — and most teams running shipped AI features for less than two years don't — your prompt version history is genuinely distributed across three repos, two config systems, and one Slack thread.
The judge. If the feature had any kind of LLM-as-judge eval scaffolding, you built at least one custom judge with its own rubric, its own calibration log, and quietly its own prompt that drifts on its own schedule. The judge is the asset most likely to be orphaned because it doesn't ship to users, doesn't appear in any dashboard the product team watches, and only the engineer who built it knows it exists. It will keep running on a cron forever unless you turn it off.
The regression set. The eval cases that were added every time the feature broke. Each case has provenance: it guards against a specific failure mode that surfaced in a specific incident. If you delete the eval set wholesale, you lose the institutional memory of why those failure modes mattered. If you leave it, it runs forever against a model that no longer powers the feature, generating noise that future engineers will misread.
The incident memory. The Slack threads, the postmortems, the prompt-edit reasoning, the "we tuned this against the case where the user gives a date in DD/MM/YY format and the model assumed US locale" decisions that exist only as commit messages and tribal knowledge. This is what most teams lose silently when the feature engineer leaves.
If your sunset plan does not name where each of these five assets lives and who owns retiring it, your sunset plan is a comms plan, not a decommissioning plan.
The Customer-Workflow Problem (Even for Betas)
The premise of a beta is that customers shouldn't depend on it. The reality is that any AI feature with non-trivial output quality gets bolted into a customer workflow within weeks, because the model is doing something the customer was previously paying a human to do, and that human is now reassigned. By the time you decide the beta didn't graduate, three of your top fifty customers have it in a Zapier flow, one has it scripted against your private API, and your enterprise team has quietly added it to a renewal pitch deck that closed two weeks ago.
The polite vendor-side version of this problem comes with a 90-day notification minimum, an explicit replacement model, and a migration guide. The harder internal version is that you may not have a replacement. The feature didn't graduate because the success criteria weren't met; pointing customers to "the new version" assumes there is a new version, which there often is not.
The honest playbook is this: before you announce the sunset internally, identify which customer workflows depend on the feature, what the next-best option is for each (and "do it manually again" is sometimes the answer), and what notice period is genuinely required for each tier. Enterprise customers with the feature in a renewal commitment need a different sunset cadence than self-serve users on a free tier. Treating them identically is what generates the support escalation that drags the sunset across two quarters.
The asymmetry to internalize: every other deprecation discipline in the org (DB migrations, API versioning, mobile app forced upgrades) has a process for this. AI features didn't ship through those processes because they were "just a beta," so the muscle isn't there. Build it now or pay the support cost later.
Retiring the Prompt Without Erasing the Lineage
The instinct on prompt cleanup is to grep for the prompt's identifier across the codebase, delete the references, and call it done. This is the wrong half of the work.
The right work is to mark the prompt deprecated in whatever registry holds it, preserve every version that ever served production traffic, and keep the version history queryable for at least one full audit cycle — typically twelve months, longer if the feature touched anything regulated. The reason is not nostalgia. The reason is that when, eighteen months from now, a customer escalates a billing dispute or a compliance review surfaces a question about how a specific output was generated, the prompt version that produced that output needs to be reconstructible. If you deleted it, you cannot answer the question, which means you cannot defend the output.
What you do delete: the ghost references. Every config file that lists the prompt as a candidate variant. Every feature flag that gates traffic to the deprecated version. Every CI test that asserts the prompt still parses. Every cron job that re-runs the prompt against a daily eval set to catch drift. These are the orphaned configs that compound flag debt — the SaaS literature on this is unambiguous: orphaned flags accumulate at a rate that doubles every quarter if nobody audits, and the AI version of this problem is worse because each "flag" carries a 4KB prompt payload that nobody dares touch.
Concrete checklist for the prompt retirement, in order:
- Mark every prompt version associated with the feature as
deprecatedin the registry. Set the production label to null so the SDK throws on attempted use rather than silently falling back. - Search the codebase for hardcoded references to the prompt ID and replace them with a deprecation error, not a silent fallback. A silent fallback is how the prompt survives for another six months in a code path nobody knew existed.
- Remove the feature flag and the config entries that gate traffic to the prompt. Add a CI check that fails if the prompt ID reappears in any new code change.
- Move the prompt's commit history, intent documentation, and incident references to a long-lived archive location, not the registry's working set. The registry should reflect what is live; the archive should reflect what was.
The Judge and the Regression Set: Different Decisions
These two assets look similar — both are eval-side, both run on a schedule, both are owned by the AI engineering team. But they decommission differently and conflating them is a mistake.
The judge should usually be retired with the feature. A judge that was tuned against the feature's specific failure modes is unlikely to generalize to a successor feature without re-calibration, and a re-calibration is essentially a new judge. Leaving the old judge running generates noise: it grades outputs from a model the feature no longer uses, against a rubric tuned to a prompt that no longer ships, and the alerts it fires (or stops firing) tell you nothing actionable. Turn it off. Archive the rubric, the calibration log, and the prompt. If a successor feature emerges, the next engineer will build a judge from scratch anyway — the rubric is reference material, not reusable infrastructure.
The regression set is different. The individual cases in a regression set are unit tests for failure modes — and failure modes outlive features. A test case added because the model hallucinated dates in DD/MM/YY locales is valuable to any future feature that touches date parsing, not just the one it was added under. The right move is to migrate the regression set, not retire it: re-tag the cases with the failure mode they guard against (rather than the feature they were added under), move them to a shared regression-set namespace, and ensure they keep running against the models the rest of your features depend on.
This re-tagging is the part teams skip. Without it, regression cases tagged "from the [old feature] eval set" get garbage-collected six months later because nobody is sure they're still relevant, and the institutional memory of why they exist evaporates. With it, the regression set becomes one of the few artifacts of the failed feature that actually compounds value forward.
Preserving Incident Memory When the Engineer Leaves
The last asset is the one most teams genuinely lose. The engineer who built the feature knows things that aren't written down: which prompt edits were tried and didn't work, which judge calibration was thrown off by which prompt change, which model migration broke which downstream behavior and how it was patched. When that engineer rotates off — or leaves entirely — and the feature gets sunset shortly after, that knowledge goes with them.
The defensive move is to write it down before the sunset, not during. A sunset retrospective that is filed three weeks before the deprecation date should cover, at minimum: what the feature was supposed to do, what it actually did, why it didn't graduate, what was learned about the failure modes, which of those failure modes are general (and should outlive the feature) versus specific (and can be archived), and what the engineer would tell their successor if they were inheriting the work.
This document is not a postmortem. Postmortems are about incidents; this is about features. Most orgs have no template for it because most orgs have not yet had to retire enough AI features to develop the muscle. The template doesn't need to be elaborate — a single page is enough — but it needs to exist and it needs to be a required artifact of the sunset, not a nice-to-have.
The leadership realization that has to land here: AI feature sunsets are not a sub-class of service deprecation that happens to use a different SDK. They have their own asset inventory, their own customer-dependency dynamics, their own preservation problem. The org that treats them with the generic deprecation process will keep losing the regression cases that catch the next bug, the judge calibration that survives the next model migration, and the engineer whose work was never visible because the artifact was a 30-line system-prompt diff rather than a 5000-line service rewrite.
What "Done" Actually Looks Like
A sunset is done when all of the following are true, not when the endpoint returns 410:
- The user-facing surface is removed, with appropriate notice given to each customer tier based on actual dependency, not assumed dependency.
- Every prompt version that served production traffic is preserved, marked deprecated, unreachable from any live code path, and queryable for audit.
- The judge is turned off, its rubric and calibration log archived as reference material, and no automation is still grading anything against it.
- The regression set is split: feature-specific cases archived with provenance, general failure-mode cases re-tagged and migrated into a shared regression-set namespace that survives the feature.
- A sunset retrospective exists, owned by the feature engineer, covering what was learned and what should outlive the feature.
- Every feature flag, config entry, and CI check related to the feature is removed, with a CI gate preventing them from being re-added.
If you cannot point to a specific artifact or commit for each of these, the sunset is not done. It is in a state that will quietly generate cost, noise, and risk for the next several quarters until someone is forced to come back and finish the job — usually under worse conditions, with less context, and with the original engineer gone.
The case for actually writing this playbook — naming the assets, defining the decommissioning steps, requiring the retrospective — is not that any single feature sunset will be catastrophic if you skip it. It's that AI orgs ship enough features now that the cumulative cost of skipping it is large, and the marginal cost of doing it right the first time is small. The first sunset you do this way will feel like overkill. The fifth one will feel like the only sane way to operate. By the tenth, the muscle is built, and the engineers who joined after the playbook existed will assume this is just how AI features end. That last sentence is the goal.
- https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/model-lifecycle-retirement
- https://docs.databricks.com/aws/en/machine-learning/retired-models-policy
- https://platform.openai.com/docs/deprecations
- https://community.openai.com/t/assistants-api-beta-deprecation-august-26-2026-sunset/1354666
- https://help.webflow.com/hc/en-us/articles/36046081578515-Feature-sunsets-deprecations
- https://langfuse.com/docs/prompt-management/features/prompt-version-control
- https://www.braintrust.dev/articles/what-is-prompt-versioning
- https://mlflow.org/docs/latest/genai/prompt-registry/
- https://launchdarkly.com/blog/prompt-versioning-and-management/
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
