Skip to main content

The Demo-to-Production Cliff: Why a 90%-Accurate Agent Ships at 0%

· 9 min read
Tian Pan
Software Engineer

There is a specific kind of meeting that happens about six weeks after an impressive agent demo. The prototype booked the trip, refactored the module, reconciled the invoices — live, on the first try, in front of stakeholders. Everyone agreed it was ready. Then someone pulled the production numbers, and the agent that "worked" was generating a support ticket every forty completed tasks, a refund every few hundred, and a quiet trail of half-finished states nobody could explain. The project did not get killed. It got stuck. It is still stuck.

This is the demo-to-production cliff, and it is the single most reliable way for an agent project to fail. The cliff is not caused by a bad model or a sloppy team. It is caused by a measurement mistake: treating a 90% success rate as 90% of the way to shipping. It is not. A 90%-accurate agent is a triumphant demo and, for most real workflows, an unshippable product. The MIT NANDA report that made headlines in 2025 — 95% of enterprise GenAI pilots delivering no measurable P&L impact — is this cliff, counted at scale.

The reason the gap feels so large is that the missing 10% is not what you think it is. Intuitively, "90% success" sounds like "works great, minor polish needed." But the missing fraction is not evenly distributed annoyance. It is a tail — a concentrated set of confidently wrong actions, silent partial completions, and unrecoverable states. Each entry in that tail is not a rough edge. It is a ticket, a refund, a compliance flag, or a churned user. You did not ship a product that works 90% of the time. You shipped a product that fails expensively 10% of the time, and the demo simply never sampled the tail.

The Math Nobody Does Before the Demo

Start with the arithmetic, because it is the part that gets skipped. Agents do not perform one action. They perform chains of them: read context, call a tool, interpret the result, decide, call another tool. Reliability across a chain is multiplicative. If each step independently succeeds 90% of the time, a five-step task succeeds at 0.9⁵ — about 59%. A ten-step task succeeds at 0.9¹⁰ — roughly 35%.

Push the per-step number to a demo-grade 95% and the picture is still grim: a ten-step workflow lands near 60%, and a twenty-step process succeeds only about 36% of the time. More than half of those operations fail before completion. This is why people describe agent reliability as "compounding": you are not adding error, you are multiplying survival probability, and multiplying numbers below one drives the product toward zero faster than anyone's intuition expects.

The demo hides this for a structural reason. A demo is one trajectory. It is the happy path, often rehearsed, almost always short. Production is the full distribution of trajectories — long ones, weird ones, ones where step three returns an empty list and the agent has never seen an empty list. The demo samples the center of the distribution. The cliff lives in the tail. You cannot see a tail by drawing one sample from the middle.

So the first discipline is honest: before you greenlight anything, write down the realistic step count for your actual workload and raise your per-step number to the corresponding power. If the answer is not comfortably above your tolerance for failure, you do not have a polish problem. You have a product that does not exist yet, and no amount of prompt tuning closes a gap that is fundamentally about chain length.

Raising Accuracy Is the Wrong Goal

The instinctive response to a 35% end-to-end success rate is "make the model better." This is the trap. Per-step accuracy has sharply diminishing returns against a multiplicative penalty. Going from 90% to 95% per step roughly doubles your end-to-end success on a ten-step task — but going from 95% to 99% is a far harder engineering and cost problem, and it still leaves a ten-step task failing about one time in ten. You cannot multiply your way to reliability when the exponent is large. There is no per-step accuracy that makes a long enough chain safe.

The reframe that actually ships products is this: stop trying to eliminate failure and start making failure cheap. A failure is expensive when it is undetectable, irreversible, unbounded in blast radius, and has no path back to a human. A failure is cheap when the system notices it immediately, the action it took can be undone, the damage it could have done was capped in advance, and a person gets a clean handoff instead of a mess.

Notice what changed. "Raise accuracy" is a property of the model. "Make failure cheap" is a property of the system around the model — and it is something your team fully controls. An agent at 90% per-step accuracy with cheap failures is shippable. An agent at 98% with expensive failures is not. Production readiness for agents is a property of the failure handling, not the success rate.

Four levers make failure cheap, and they are all design decisions, not training runs:

  • Detectability. The worst failures are the silent ones — the agent reports success and the work is half-done. Build verification steps that confirm the outcome, not just the absence of an exception. An agent that knows it failed is recoverable. One that thinks it succeeded is a time bomb.
  • Reversibility. Classify every tool the agent can call as reversible or irreversible. Reading the wrong record is recoverable. Deleting it, sending the email, charging the card, publishing externally — those are not. Route irreversible actions through staging, soft-deletes, draft states, or confirmation so that a wrong call is an inconvenience instead of an incident.
  • Bounded blast radius. Cap the worst case before it happens. Spending limits, rate limits, row-count limits, scoped credentials that only touch what this task needs. The agent should be structurally incapable of a catastrophe, not merely instructed to avoid one.
  • Graceful handoff. When the agent is uncertain or stuck, the correct move is to stop and hand a human a clear, complete picture of what it did and why it paused — not to guess, and not to dump a cryptic stack trace. A clean handoff turns a failure into a routine assist. A bad one turns it into an investigation.

Evaluate the Tail, Not the Mean

If the cliff lives in the failure tail, your evaluation has to measure the tail — and most agent evals measure the mean. "85% task success" is a mean. It tells you the size of the 90%. It tells you nothing about the shape of the 10%, which is the only part that determines whether you can ship.

Two agents can both score 85% and be completely different products. Agent A's 15% of failures are all "asked a clarifying question and stopped." Agent B's 15% are all "confidently took a wrong irreversible action." Same headline number. One is shippable today; the other is a lawsuit. A mean cannot distinguish them. Your eval has to.

That means scoring failures, not just counting them. For every failed trajectory, classify it: Did the agent know it failed? Was the action reversible? Did it stop cleanly or thrash? Was the blast radius bounded? Then track the distribution of those answers over time. The metric that predicts whether you can ship is not "success rate went up." It is "the fraction of failures that are silent-and-irreversible went down."

This also reframes what a good eval set is. A demo-grade eval set is the happy path with more rows. A production-grade eval set is deliberately adversarial: empty results, malformed inputs, ambiguous instructions, tools that time out, multi-step tasks long enough to expose compounding. You are not trying to make the agent look good. You are trying to find the cliff in a test environment instead of in the incident channel.

The Org Failure Mode

The cliff is not only technical. It has a predictable organizational shape, and naming it is half the defense.

It goes like this. Leadership sees the demo. The demo's implicit success rate — one-for-one, it looked perfect — becomes the number in everyone's head. A GA date gets set against that number. Engineering knows the real distribution is uglier but lacks a crisp way to say so, because "it's not reliable enough" sounds like sandbagging next to a demo that just worked. The agent ships. The cliff arrives on schedule, in the incident channel, and the post-mortem concludes the team "should have tested more."

The fix is to change what gets demoed. Stop demoing the happy path. Demo the failure handling. Show the agent hitting an ambiguous instruction and stopping to ask. Show it attempting an irreversible action and getting caught by a guardrail. Show the handoff a human receives when the agent gives up. A team that can demo its failure modes confidently understands its system. A team that can only demo success has not met its own tail yet.

And replace the single success-rate number with two: the end-to-end success rate at realistic step count, and the fraction of failures that are cheap. The second number is the one that gates GA. An agent does not become a product when it succeeds often enough. It becomes a product when its failures stop being expensive.

Crossing the Cliff

The teams that get agents into production in 2026 are not the ones with the best models. Everyone has access to roughly the same frontier. They are the teams that internalized that the demo measured the wrong thing — and rebuilt their work around the failure tail instead of the success mean.

Concretely, that means: do the compounding math before you commit to a date. Treat "raise accuracy" as a secondary lever and "make failure cheap" as the primary one. Classify every tool by reversibility and cap every blast radius before launch. Build an eval set that hunts the tail. And demo your failure handling, not your happy path, so the number in leadership's head is the real one.

The demo-to-production cliff is not a sign that agents do not work. It is a sign that "works" was measured as a mean when it should have been measured as a tail. A 90% agent is not 90% of a product. But a 90% agent whose remaining 10% is detectable, reversible, bounded, and handed off cleanly — that is a product you can ship today.

References:Let's stay in touch and Follow me for more thoughts and updates