AI Feature Decommissioning Forensics: What Dead Features Teach That Successful Ones Cannot
Here's an uncomfortable pattern: the AI feature your team is about to launch next quarter already died at your company two years ago. It shipped under a different name, with a different prompt, solving a vaguely different problem, and it got quietly decommissioned after six months of flat adoption. Nobody wrote it up. Nobody connected the dots. The leading indicators that would have saved this cycle were sitting in dashboards that got archived along with the feature.
Most engineering orgs are elaborate machines for remembering successes. Launches get retrospectives, blog posts, internal celebrations. The features that got killed — the ones with 12% weekly active users despite a polished demo, the ones whose unit economics inverted when token costs compounded across a longer-than-expected tool chain, the ones users learned to trust, lost trust in, and then routed around — generate almost no institutional memory. And the failure patterns embedded in those deaths are exactly the ones your planning process has no way to price in.
This is not a cultural problem. It's an observability problem applied to the wrong layer. Teams instrument the model. They instrument the prompt. They rarely instrument the decision-making that led to shipping the feature in the first place, and they almost never go back and audit those decisions against what actually happened. Decommissioning forensics is the discipline of systematically studying dead AI features the way SREs study incidents: with a template, a database, and the assumption that the next failure rhymes with a previous one.
Why launch retros don't catch the real failure modes
The launch retrospective assumes the hard part is shipping. For traditional software, this is mostly right — once the feature is in production, the code does what the code does, and the interesting failure modes surface within weeks. For AI features, the hard part is the opposite: the feature ships easily, the demo works, early metrics look fine, and then six months in a slow-motion failure unfolds that retrospectives aimed at the launch are structurally incapable of catching.
Three failure classes drive most AI feature deaths, and all three are invisible at launch:
- Trust erosion. The feature works 85% of the time, which is great in isolation but catastrophic if the 15% shows up on the tasks users care about most. Users develop a workaround, then a habit of bypassing the feature entirely. By the time churn data reflects this, the feature has been broken for months.
- Cost compounding. A single-call feature that costs $0.003 per request looks profitable at launch. When product-led growth drives usage into multi-turn conversations, agentic tool chains, or retrieval over larger corpora, per-session cost quietly climbs 10–30x. Gross margin inverts before finance catches it because the cost accounting maps tokens to infrastructure, not tokens to features.
- Adoption collapse after novelty. First-month usage runs hot because the feature is new and internally promoted. Week-12 usage tells the real story, and by then the team has moved on.
None of these show up in a launch-week retrospective. They also don't show up in standard product analytics dashboards, which typically trend-line DAU and feature-open rates — metrics that conceal the failure modes above rather than surface them. You have to go back and study features that completed the full lifecycle, including death, to see the patterns clearly.
The post-mortem template that makes dead features productive
A generic launch retrospective doesn't work here because the interesting questions are specific to AI features. The template I've seen work is structured around five forensic questions, each backed by a specific artifact the team has to produce.
1. What was the feature's falsifiable thesis? Not its user story — its thesis. Something like "users will accept AI-generated summaries for 70%+ of documents without edits, which justifies the inference cost at current token prices." If you can't reconstruct a falsifiable thesis from the original docs, that's finding #1: the feature was built on vibes and shipped on momentum. Almost every killed AI feature I've seen retroactively flunks this test, and the teams that survived the experience adopted pre-launch thesis documents that made future kills faster.
2. Which leading indicator would have told us sooner? This is the highest-leverage question in the entire template. For every killed feature, identify the metric that, in hindsight, diverged from expectations first — and by how many weeks it preceded the eventual decision. If trust erosion started showing up in edit-to-accept ratios at week 4 but the kill decision didn't happen until month 7, you have both a validated leading indicator and a measure of your org's decision latency. Over time, this builds a catalog.
3. Where did our evals lie to us? Eval scores rarely predict production success, but the interesting question isn't whether they lied — it's how. Was the eval set too clean? Did it miss the long-tail queries that dominated real usage? Did it test single-turn performance when production was multi-turn? The answer goes into your eval hygiene playbook, not just into the feature's obituary.
4. What was the true unit economics at peak usage? Rebuild the cost math using actual production token counts, not planning-time assumptions. Include the overhead nobody priced in: retries, tool chain expansion, cache misses at cold deployments, monitoring and eval-on-every-request costs. The delta between planned and actual cost per successful interaction is almost always the single most surprising number in the post-mortem.
5. What organizational signals did we ignore? The team knew. Someone always knew. Find the Slack message, the skeptical engineer, the product manager who flagged the adoption curve at week 6, and ask what structural reason prevented that signal from routing to a decision. This is usually where the most transferable lessons live, because the structural reason almost always persists into the next feature.
The template produces a document with a predictable shape: thesis, divergence moment, cost surprise, eval blindspot, ignored signal. Filed consistently, these documents become a searchable corpus that future planning can draw from.
Leading indicators that actually predict death
Across the kills I've seen and studied, a small set of indicators keeps appearing. They're not novel individually — experienced AI product teams know most of them — but they're rarely tracked in a centralized way, rarely defined with crisp thresholds, and almost never used to trigger pre-defined actions. Collecting them in one place is half the value.
- Edit-to-accept ratio. For any AI output the user can modify, the ratio of edited to accepted-as-is outputs. Rising over time, especially on experienced users, is the single most reliable predictor of trust erosion.
- Time-to-override. How quickly users switch from AI-generated output back to manual input after seeing it. A falling time-to-override means users are learning to not trust the feature.
- Feature bypass rate. The fraction of eligible sessions that silently route around the AI feature. This is the hardest metric to instrument because it requires counting actions the user didn't take — but it's also the one most correlated with actual death.
- Re-query rate. For retrieval and search features, the rate at which users reformulate immediately after an AI response. A sustained increase is a user-side eval suite for your feature, running for free.
- Session depth decay. For multi-turn features, the trend in conversation length over cohort tenure. New users explore; existing users hit the floor of utility. Growing decay means the feature isn't earning its way into habitual use.
- Silent cost growth per completed task. Not per-request cost — cost per actually useful outcome. When retries, fallback escalations, and cascading tool calls pad the cost of a completed interaction without changing the success rate, you're watching the unit economics invert in real time.
Each of these should have a pre-defined threshold that, when crossed, triggers a review — not an automatic kill, but a forced conversation. The value isn't the kill; it's removing the organizational latency between evidence and decision. Most dead AI features had the evidence long before they had the decision.
The sunk-cost failure mode specific to AI
Traditional sunk-cost bias is bad enough. With AI features, it has a particular variant that makes kills harder: the feature almost works. There's always a version you haven't tried — a different base model, a cleaner prompt, a bigger context window, a fine-tune, a new retrieval strategy — and the team can credibly argue that a month of iteration will turn the corner. Sometimes that's right. Often it isn't, and the iteration budget gets renewed indefinitely because no single iteration is obviously wasteful.
The cure is a commitment device defined in the pre-mortem: a specific metric that, if not hit by a specific date, triggers decommissioning regardless of how the team feels about the latest experiment. This has to be decided before launch, written down, and owned by someone with standing to enforce it. Without the pre-commitment, every iteration looks reasonable in isolation and the feature drags out for 6–12 months beyond its natural death.
The forensic discipline reinforces this loop. When killed features get rigorously studied and the iteration-past-threshold pattern keeps appearing in write-ups, future planning meetings naturally start asking, "Is this going to be another one of those?" That question, asked early and with evidence, is worth more than any individual eval improvement.
The leading-indicator catalog as organizational asset
After three or four decommissioning forensics have been filed consistently, something unusual starts to happen: the catalog of leading indicators becomes an organizational asset that compounds. New feature proposals can be scored against it. Thresholds get sharper as more kills contribute data points. The catalog starts catching problems at the proposal stage — "this looks like Feature X, which died from retrieval staleness at month 4" — which is where the actual ROI of forensics lives.
This is the part most teams never reach because they don't file consistently. A single killed feature yields one lesson. A corpus of ten yields patterns. The corpus is what separates teams that learn from those that relearn.
Practical guidance for building this: put every AI feature decommissioning into a standardized doc filed in a single, searchable location (a dedicated repo path, a Notion database, a Confluence space — the tool matters less than the consistency). Require the five questions be answered in every write-up. Tag each doc with the leading indicators that flagged the problem and how many weeks of decision latency occurred between signal and action. Review the corpus as a group at least twice a year.
The teams that do this stop shipping the same dead feature with a different prompt. The teams that don't keep filing the same obituary under new names, wondering each time why the demo was so compelling and the reality so thin.
An actionable takeaway
If you're reading this and your team has killed at least one AI feature in the last 18 months, here is the smallest useful thing you can do today: write up that kill using the five questions above. One document. Two hours of work. Include the specific metric that diverged first, and how long it took the organization to act on it. Circulate it. Then, in your next planning cycle, ask whether any proposed feature shares the failure signature of the one you just documented.
That single exercise is worth more than most of the eval infrastructure your team will build this year. It will almost certainly change at least one shipping decision. And it will start a corpus that, filed patiently, will eventually catch the feature that's about to silently cost you six months of engineering and the trust of the users who tried it.
Dead features are the most honest teachers AI engineering has. The only requirement is a willingness to listen to them.
- https://www.cio.com/article/4137344/the-hidden-cost-of-ai-adoption-why-most-companies-overestimate-readiness.html
- https://inkeep.com/blog/jason-hochman-ai-adoption-leading-indicators
- https://www.swarmia.com/blog/staged-approach-AI-adoption-for-engineering/
- https://www.digitalgenius.com/blog/the-hidden-cost-of-sticking-with-the-wrong-ai
- https://medium.com/data-science-collective/bleeding-to-death-ai-companies-are-deadly-trapped-in-the-sunk-cost-fallacy-b473a70d7e2e
- https://medium.com/@jessejantzen/the-sunk-cost-fallacy-fb412db05231
- https://www.isaca.org/resources/news-and-trends/isaca-now-blog/2025/avoiding-ai-pitfalls-in-2026-lessons-learned-from-top-2025-incidents
- https://engineering.zalando.com/posts/2025/09/dead-ends-or-data-goldmines-ai-powered-postmortem-analysis.html
- https://www.sweep.io/blog/2025-the-year-enterprise-ai-hit-the-system-wall
- https://www.ninetwothree.co/blog/ai-fails
- https://speedscale.com/blog/llm-simulation-missing-runtime-enterprise-ai-agents/
- https://medium.com/@michael.hannecke/the-hidden-costs-of-llm-systems-understanding-ai-debt-637dfa8a6040
