When 'Can the Agent Do X?' Becomes a Ship Commitment
An engineer spends an afternoon poking at a question: can the agent reconcile a customer's invoice against their contract terms? They wire up a quick prompt, run it on five real invoices, and three come back correct. The other two are wrong in ways they don't fully characterize — they close the laptop and move on. In standup the next morning they say "yeah, invoice reconciliation basically works." A PM in the room writes it down. Two weeks later it's a line item on the Q3 roadmap. A month after that, a sales rep promises it to an enterprise account in a renewal call.
Nobody lied. Nobody made a bad decision in isolation. But the team is now contractually committed to a behavior whose eval set does not exist, whose failure modes were never written down, and whose reliability budget was set by a director who saw a demo and interpreted it as a contract. This is the most common way AI features acquire scope: not through a planning meeting, but through a capability probe that nobody ever explicitly promoted.
The industry has a name for the downstream symptom — "POC purgatory," the state where 70 to 80 percent of AI initiatives stall between a working sandbox and a shippable product. But purgatory is the wrong metaphor, because it implies the projects are stuck. They aren't stuck. They're moving — they were committed before anyone checked whether they were ready, and now the team is trying to retrofit reliability onto a promise.
How a probe mutates into a spec
Exploration is healthy. Engineers should poke at the edges of what a model can do, and "let's see if the agent can handle this" is exactly the right instinct. The problem is not the probe. The problem is that a probe and a commitment look identical from the outside, and there is no moment in most teams' workflow where one is explicitly converted into the other.
Watch the mutation happen. It moves through three handoffs, and each one is lossy in a predictable direction.
Engineer to standup. The engineer who ran five trials and saw three succeed has a precise mental model: "60% on a tiny unrepresentative sample, two failure modes I haven't named." What comes out of their mouth is "it basically works." That is not dishonesty — it's compression. "Basically works" is a reasonable summary of "I was surprised it did this well." But the uncertainty, the sample size, and the unexamined failures do not survive the compression. They were the most important part of the finding, and they are the first thing dropped.
Standup to roadmap. The PM hears "it basically works" and does their job, which is to turn capability into plan. "Works" plus "customers want this" equals "ship in Q3." The PM is not being reckless; they are operating on the information they were given. The information they were given had the error bars stripped off two sentences earlier.
Roadmap to customer. A sales rep sees the roadmap, or sees a demo built to support the roadmap, and does their job, which is to convert plan into revenue. Now there is an external party with a contractual expectation. The probe has traveled from "I ran it five times" to "we owe this to an account" without ever passing through a review where someone asked: what is the actual success rate, and on what?
At no point did anyone overstep. Each handoff was a person doing their role correctly with the input they received. The defect is structural — there is no gate in the pipeline, so the probe just flows downhill until it hits a customer.
The thing you measured is not the thing you promised
Even setting aside the handoff losses, there is a deeper mismatch. A capability probe answers "is this possible?" A ship commitment answers "is this reliable?" These are different questions, and recent reliability research is blunt about it: capability and reliability are independent axes. A highly capable system can be wildly unreliable, and a modest one can be dependable inside a narrow envelope. Measuring one tells you almost nothing about the other.
The probe optimizes for the best case. The engineer picks interesting inputs, watches the run, retries when something looks off, and reads the output with a charitable eye. They are answering "can this ever work?" — and the honest answer, demonstrated, is yes.
Production asks the opposite question: "what happens on the inputs I didn't pick, when nobody is watching, on the first try?" The five hand-chosen invoices were not a sample of production traffic. They were a sample of invoices the engineer found worth trying. The agent's behavior on a malformed PDF, a contract in a second language, or an invoice with a line item that doesn't exist in the contract — none of that was measured, because measuring it was not the point of the probe.
Current agents reconciling structured business records land goal-completion rates below 55% on realistic enterprise data. A probe that hits 60% on five curated examples is not evidence against that number. It is consistent with it. The probe and the production reality can both be true at once, because they are measuring different distributions.
Make the capability test an artifact, not a sentence
The fix starts by refusing to let a capability finding exist only as a spoken sentence in standup. A probe should produce a small, structured artifact — call it a capability test — that is deliberately not a feature commitment and says so on its face.
It does not need to be heavy. Five fields, written down:
- Setup. What exactly was run — the prompt, the model, the tools, the input set. "Five invoices from the Acme account, May 2026."
- Observed success rate. The real number on the real sample, with the sample size attached. "3 of 5." Not "basically works."
- Failure modes. What the failures looked like, named. "Misread multi-currency totals; hallucinated a contract clause that wasn't present." If you didn't characterize them, that itself is a finding: "2 failures, root cause not investigated."
- Distribution gap. How the test inputs differ from production traffic. "Hand-picked clean PDFs; production includes scans and non-English contracts, untested."
- Non-commitment posture. An explicit line: "This is a feasibility probe. It is not an estimate of production reliability and is not a basis for a roadmap commitment."
- https://shieldbase.ai/blog/why-most-ai-pocs-fail-and-how-to-avoid-it
- https://www.cloudjournee.com/blog/why-gen-ai-pocs-never-reach-production-and-how-to-fix-that-before-you-start/
- https://www.digitalapplied.com/blog/ai-agent-scaling-gap-march-2026-pilot-to-production
- https://arxiv.org/html/2602.16666v1
- https://www.ibm.com/think/insights/measuring-ai-outcomes-7-step-stage-gating-framework
- https://cleanlab.ai/ai-agents-in-production-2025/
- https://www.perle.ai/resources/the-hidden-roadblock-in-ai-development-conquering-scope-creep
