Skip to main content

The AI Feature You Should Not Have Shipped: A Task-Shape Checklist

· 10 min read
Tian Pan
Software Engineer

The demo always works. That is the most expensive sentence in AI product development. The product manager sees the model handle the happy path, the engineer ships the obvious version of the feature, and six weeks later the support queue is full of complaints that the metric did not predict. Nothing in the model regressed. Nothing in the prompt got worse. The feature was simply not the shape the model could do well, and the team did not have a way to say so before the work began.

A meaningful fraction of shipped AI features fail this way — not because the model is bad, but because the task is wrong. The output the product needs is deterministic and the engine is stochastic. The user's tolerance for the tail is one bad answer per thousand and the model's failure distribution is heavier than that. The latency budget the unit economics require is half of what the model can deliver at any tier you can afford. The ground truth required to evaluate quality does not exist and cannot be cheaply created. None of these are model problems. They are task-shape problems, and they should have been screened before the first prompt was written.

The reason this keeps happening is structural. Most product processes have a forward path for "yes, build this" and an implicit no-path of "deprioritized, will revisit." There is no explicit pathway for "this is a real user problem, but it is not model-shaped, and here is where it goes instead." Without that pathway, every idea gets routed through the AI team's backlog by default, and the team accumulates a portfolio of features that work in demos and fail in production. The fix is not better prompts. The fix is a pre-build checklist that rejects task-shape mismatches before they enter the queue, and a place to put the rejected ideas so they do not get smuggled back in next quarter under a new name.

The five axes that decide task-shape fit

Treat any candidate feature as a vector along five dimensions. Score each one before estimating engineering effort. If two or more dimensions fail, the feature is not model-shaped and the team should redirect rather than build.

Output determinism. Some user-facing surfaces require the same input to produce the same output across runs and across users — order confirmations, legal disclosures, dosage calculations, balance displays. LLM inference is non-deterministic at the cloud-API level even with temperature pinned, because batch composition and silent infrastructure changes shift the sampling. If the surface requires deterministic outputs, the answer is rules with optional model assistance behind a deterministic checkpoint, not the model on the critical path. The demo will not catch this — temperature-zero on a single example is not the same property as deterministic across millions of calls.

Tail-risk tolerance. Every model has a failure distribution. The question is not "what does it do on average" but "what does the worst one percent of outputs look like, and what does the user lose when they hit it?" An AI-generated reading list that recommends nonexistent books is a recoverable embarrassment. An airline chatbot that promises a refund the airline does not honor is a tribunal ruling. A pricing assistant that quotes a number nobody can fulfill is revenue loss with a long tail of customer service work. The tolerance you can defend is bounded by the cost of one bad output times the rate at which they occur — and the cost is never the average case.

Latency budget the unit economics will pay for. Voice agents need responses inside a 200ms turn-taking window humans are sensitive to. Coding assistants need completions inside the typing rhythm. Search reranking needs to fit inside the page load budget. Each of these has a model that can technically do the job and a model the unit economics can sustain, and they are often not the same model. If the cheapest model that meets quality blows the latency budget at peak, the feature does not ship — it needs a routing strategy, an aggressive cache, or a different shape. Discovering this after build is expensive.

Eval feasibility. The team has to be able to grade outputs. This is harder than it sounds when the task is open-ended — summaries, recommendations, creative outputs, judgment-laden assistant responses. If there is no way to construct a graded dataset that reflects production distribution, the team is shipping with vibes for evaluation. Ground truth is the constraint that gets quietly waived when a product is hot, and it is the one that comes back as a regulatory request the team cannot answer. The checklist should ask: who writes the eval, when, and against which distribution. If no one has a credible answer, the feature is not eval-feasible yet.

Regulatory exposure. Some surfaces sit inside an audit perimeter that the model crosses. Healthcare advice, financial recommendations, legal-adjacent assistance, anything affecting protected classes — these have a documentation burden and a liability exposure independent of how good the model is. A feature that would be fine in a consumer chat product is not fine in the same form on a regulated surface, and the diligence cost is borne by the team that ships, not the team that proposed.

What the checklist actually does in the room

The checklist is not a long document. It is a one-page vector with a score per axis and a conversation about the worst score. Its job is to make the discussion specific. "Will users tolerate hallucinations here" becomes "what is the rate, what is the cost per occurrence, what is the budget we are willing to spend, and how do we measure compliance with it." "Can we evaluate this" becomes "who is the labeler, what is the agreement rate, and what is the production distribution we are sampling from."

Most teams do not need a formal scoring rubric — what they need is a forcing function that surfaces the worst-case axis before the feature enters the build queue. Two failing axes is a redirect. One borderline axis is a build with explicit mitigations and a kill criterion documented up front. Zero failing axes is a normal build.

The forcing function works because it converts implicit assumptions into named risks. The product manager who said "users will tolerate occasional errors" now has to put a number on it. The engineer who said "we can probably hit the latency target" now has to commit to a budget. The legal partner who said "this is fine" now has to identify the audit surface. None of this slows down a healthy build. All of it stops a bad build before the sunk cost makes it impossible to kill.

The "not-AI-shaped" pathway

The single most underrated piece of AI product process is a documented, visible place to put rejected feature ideas. Without it, the discussion ends in "no, that's not a good fit," the idea goes back into the slack thread, and three months later the same idea returns from a different stakeholder with a different framing. The team relitigates the rejection from scratch.

A "not-AI-shaped" pathway has three parts. First, a destination — a wiki page, a board column, a prioritized backlog of "redirected" ideas with the axis that failed and the alternative shape suggested. Second, an alternative shape catalog — common redirects like "this is a rules engine with model fallback," "this is a search problem with model-augmented ranking," "this is a workflow tool that can use model assistance behind a human review," "this is not actually a user need, it is a metric that someone wants to move." Third, a re-entry criterion — what would have to change for this idea to be model-shaped later, so the team is not maintaining a graveyard of ideas with no path back.

This sounds like overhead. It pays for itself the first time a stakeholder revives a rejected idea and the team can point at the specific axis that failed and the specific change that would make it ship-able. The conversation is no longer about taste, it is about evidence.

The postmortem you should be writing for shipped features

Some features will ship despite failing the checklist, and some will fail in production despite passing. Both cases deserve a postmortem. The standard incident postmortem is shaped around outages — what broke, when did we notice, how did we recover. A task-shape postmortem asks a different set of questions.

Which axis failed in production that the checklist did not catch? The most common pattern is a tail-risk tolerance score that looked acceptable in expectation and turned out to be unacceptable in distribution — the rate of bad outputs was a percent the team could live with, but the cost per occurrence was higher than the model assumed. The postmortem captures that the team underweighted tail cost, and updates the scoring rubric accordingly.

What did the eval not catch? Often the eval graded against a curated dataset that did not reflect production distribution. The fix is not retraining the model. The fix is replacing the static eval with a production-cohort-sampled eval and treating that as a release gate. This change usually surfaces other features whose evals are similarly stale.

What was the cost of building versus the cost of redirecting? This is the hard question. Teams rarely document the redirected alternatives that were available at planning time, so when a shipped feature underperforms, the comparison is "the feature we shipped" versus "nothing." That framing makes the build look better than it was. A good postmortem reconstructs the alternative path and asks whether the team would make the same decision again with what they know now.

The postmortems compound. After a few of them, the checklist gets sharper, the rejected-pathway destination gets richer, and the org develops the pattern recognition that lets new ideas get scored faster. The team that ships its third task-shape postmortem moves faster than the team that has shipped none, because the second team is still relitigating rejections from scratch each quarter.

The senior engineer's job is saying no

The highest-leverage thing senior AI engineers do in 2026 is not write better prompts. It is recognize when a proposed feature is not model-shaped and redirect it before the team commits. This is unintuitive because the seniority signal in the broader engineering org is throughput — features shipped, lines reviewed, systems maintained. In AI engineering, the seniority signal is portfolio quality — what fraction of shipped features still work the way the demo did, and how many bad features did the team avoid building.

The cultural problem is that "no, that is not a good AI feature" reads as obstruction in an org that has bought the AI-everywhere narrative. The way through it is not louder noes. The way through it is a checklist anyone can run, a destination for rejected ideas, and a postmortem culture that gives the noes data to point at. After enough cycles, the no comes from the process rather than from the engineer, and the engineer goes back to building the features that should be built.

The team that ships every AI idea that crosses its desk is not faster than the team that screens them. It is slower, because it is paying the cost of building, supporting, and quietly retiring features that should never have started. The screening is the speedup. The checklist is the leverage. The senior engineer's job is to make the screening boring enough that it happens by default.

References:Let's stay in touch and Follow me for more thoughts and updates