Skip to main content

The 95% Reliability Illusion: Why Your 10-Step Agent Fails 40% of the Time

· 12 min read
Tian Pan
Software Engineer

There is a moment in almost every agent project review that ends the conversation. Someone draws a small chart: end-to-end task success rate on the y-axis, number of tool-using steps on the x-axis. The line slopes down hard. The room goes quiet because everyone in it had been arguing about prompts, models, and retrieval strategies — and the chart is saying that none of those debates matter as much as the simple fact that the chain has too many links.

The math is one of the oldest results in reliability engineering, ported into a domain that pretends it is new. If every step in a pipeline succeeds independently with probability p, then n steps in series succeed with probability p to the n. Plug in numbers that sound healthy on a status report: 95% per-step reliability, ten steps, end-to-end success rate of 60%. Twenty steps gets you 36%. Thirty steps gets you 21%. The agent that "works 95% of the time" is the same agent that fails on a third of real user requests, because real user requests are not single steps.

This is the 95% reliability illusion. Every individual link looks fine in a unit test. The end-to-end trace looks broken. The prompts are not the problem. The chain is the problem.

The Math You Should Have Done on Day One

The compounding effect is not new and it is not subtle. Distributed systems engineers internalize it the first time they wire two services together: an HTTP call to a backend that is "five nines" still gives you four nines if you make it twice. Push it to fifty calls — not unusual in a modern agent that does retrieval, planning, multi-tool use, and post-processing — and a per-call failure rate of one percent compounds into roughly forty percent end-to-end failure. The numbers are uncomfortable specifically because the per-component number sounds responsible.

Agent systems make this worse in two ways that distributed systems do not have to deal with.

First, "success" is not boolean. A traditional service either returns a 200 or it does not. An LLM call returns a string that may be subtly wrong: the right shape, the right tone, even the right facts on the surface, but with a hallucinated identifier or a misordered argument list. Downstream steps consume that string as if it were correct, the planner reasons confidently about it, and the error propagates through the rest of the chain in a form that no Boolean reliability calculation captures. The effective per-step failure rate is higher than your evals say it is, because your evals scored the outputs that looked fine.

Second, errors correlate. If your retriever returns a stale document, every subsequent step that quotes from it inherits the same staleness. Independence is a fiction the math assumes for tractability. Real agent failures cluster around bad context windows, ambiguous user inputs, and tool catalogs the planner does not understand — meaning the actual end-to-end distribution is bimodal: most traces succeed, the failing ones fail at multiple steps in the same trace. You cannot recover the missed bookkeeping by pretending each step's failure is independent of every other.

Recent measurement work has confirmed something practitioners suspected: agents that behave consistently across reruns achieve 80–92% accuracy, while inconsistent agents land in the 25–60% range, and divergence shows up early — often by step two. Path length itself is a reliability signal. The longer the chain, the more chances for the trajectory to fork toward a path the eval set never saw.

Where You Put the Checks Decides Everything

Once a team accepts that some steps need verification, the next question is which ones. Verifying every step is expensive — recent measurements put the latency cost of fully-verified workflows at up to 28.9× the unverified version, with monetary cost as high as 53.2× — and verifying nothing leaves you with the original problem. The placement decision is the architecture, not a tuning knob.

Three placement patterns dominate, and they apply in different failure modes:

  • Front-load verification when failures cascade. If step one is "parse the user's intent into a structured task," and the rest of the agent reasons against that task, an error there poisons every subsequent step. The cost of catching it at step ten is the cost of throwing away nine steps of work plus the cost of confusing the user. Put a strong verifier — a separate model call, a schema check, a clarification turn back to the user — at the front, where the blast radius of a bad output is largest.

  • End-load verification when failures are detectable post-hoc. If the agent is composing an email, generating code, or producing a database query, the final artifact is the natural verification surface: lint it, dry-run it, type-check it, send it to a critic model. The intermediate steps are noisier and harder to score in isolation. Spending verification budget on the output you are about to commit catches more real bugs than spending it on the planning trace.

  • Mid-load verification at the highest-blast-radius steps. Some steps are not the first or the last but they touch external state irreversibly: the tool call that sends the email, the API call that charges the card, the write that updates a customer record. Those steps deserve checks proportional to the harm of getting them wrong, not proportional to their position in the chain. A verifier that asks "is this the action the user actually requested" before any side-effecting tool call is one of the highest-leverage checks an agent can have.

The teams that get this right rarely run uniform verification across every step. They look at their trace data, identify the two or three steps where failures actually originate or actually hurt, and concentrate verification budget there. The teams that get it wrong either verify everything (and ship a 28× cost regression) or verify nothing (and ship the 40% failure rate).

Redundancy Patterns That Bend the Curve

Verification catches errors after they happen. Redundancy reduces the rate at which they happen in the first place. The two are not substitutes — they compose — and the redundancy patterns that work in agent systems are mostly imports from other reliability disciplines that the agent literature is rediscovering.

Parallel sampling with majority vote. Run the same step three times with non-zero temperature, take the most common answer. For tasks with discrete answers (which class is this, which tool to call, which branch to take), this collapses many of the per-step failures that come from sampling noise rather than capability gaps. Self-consistency is the formal name; the empirical lift on multi-step reasoning is well-documented and often exceeds the gain from a model upgrade.

Verifier agents at high-leverage steps. A second model — frequently a smaller one — that scores the primary agent's output against an explicit rubric. The verifier does not need to be smarter than the generator; it needs to be different enough that their failure modes do not perfectly correlate. Stepwise evaluation by a verifier outperforms holistic end-of-chain scoring by up to 15% relative AUC-ROC for error detection, because errors caught at step three are cheaper to recover from than errors caught at step ten.

Self-consistency probes inside the chain. Periodically ask the agent to restate its current understanding of the task, then compare against the original brief. Drift between the two is a leading indicator that the chain is going off the rails before it produces an obviously bad output. This is the agentic version of a heartbeat: cheap, frequent, low-information per probe, but high-value when the divergence finally appears.

Recovery branches, not retries. A naive retry loop that reruns the failing step against the same context will produce the same wrong answer, because the model is not stateful and the input did not change. Effective recovery means changing something — reframing the prompt, asking a clarifier, rolling back to a prior checkpoint, escalating to a stronger model. "Retry on failure" without any of those changes is closer to a no-op than an intervention.

None of these patterns is exotic. The reason they show up unevenly in production is that each adds latency, each adds cost, and each requires the team to have decided in advance which steps deserve the budget. The teams that ship with redundancy baked into the high-leverage steps — and only those — outperform the teams that uniformly add it everywhere or skip it entirely.

The Cost-Reliability Frontier

Adding more checks is not free, and the curve flattens fast. The first verifier on a noisy step often improves end-to-end success by ten or twenty points; the third verifier on the same step might add half a point. The eleventh check costs more than the failure it prevents. The cost-reliability frontier is real and it is shaped like every other engineering tradeoff.

The honest framing is to compute the expected cost of a failure (refund, bad PR, manual cleanup, churn) and the marginal cost of an additional check, and stop adding checks when the second number exceeds the first. Most teams never do this calculation explicitly. They keep adding verification because every new check feels prudent until the latency budget for a single agent run blows past five seconds and the product team starts asking why the assistant feels slow.

Two heuristics help in the absence of a full cost model. The first is to verify steps whose failures are invisible before they are expensive. A wrong tool call to a CRM update is silent until the customer notices the wrong record three weeks later — that step warrants verification even if the dollar cost of the bad write is small, because the discoverability lag amplifies the cost. The second is to verify steps where the recovery is cheap and skip steps where the recovery is expensive anyway. If catching the error means redoing the whole chain, the verifier is buying you very little.

The frontier shifts as model capability improves. Steps that needed two-of-three majority voting last year may need none this year because the per-step reliability climbed enough on its own. Reliability budgets should be revisited every time the underlying model changes, and most teams do not — they leave the verification scaffolding from a year ago in place and pay for it forever.

The Architecture Review Where Someone Finally Graphs It

The intervention that does the most work in practice is the simplest one. Pull a sample of recent agent traces. For each trace, record two numbers: how many tool-using steps it took, and whether it ultimately succeeded. Plot success rate by step count. The curve will not be flat. It will slope down sharply somewhere between five and fifteen steps, depending on the agent.

The conversation that follows the chart is where real reliability work begins. Three questions tend to drive the most change.

Can this chain be shorter? Most agent designs accumulate steps because every new feature added a new tool, and the planner now has to traverse all of them in order. Removing tools, merging steps, or letting one tool do the work of two is the highest-impact reliability intervention available. A 12-step chain at 95% per-step is 54%; an 8-step chain at the same per-step is 66%. No verification or model upgrade buys you twelve points the way deleting four steps does.

Which step's failures are doing the most damage? Across the failing traces, which step is most often the one where things first went wrong? Concentrating verification, redundancy, and prompt work on that step usually moves the end-to-end number more than spreading the same effort across the chain.

What is the floor we are willing to commit to? Pretending the agent succeeds 95% of the time is more dangerous than admitting it succeeds 60% of the time. The product surface, the support process, the customer expectation, and the rollback path all need to be designed against the real number. Teams that publish optimistic single-step reliability metrics internally end up surprised by their support volume; teams that publish honest end-to-end numbers end up shipping the right safety nets.

What Reliable Agents Actually Look Like

The agents that hold up in production are not the ones with the cleverest prompts or the largest models. They are the ones with the shortest chains that still do useful work, the most thoughtful verification placement at the steps that actually fail, and a team that has internalized the multiplicative math well enough to push back when a new feature proposes adding a step. They graph success rate by step count regularly, not as a one-time artifact. They treat path length as the architectural variable it is.

The 95% reliability illusion is solved by refusing to be fooled by it. The math is simple, the levers are well known, and the discipline is mostly a matter of looking at the right number at the right moment. Pull the trace data, draw the chart, and have the conversation about the chain — before the chain has the conversation with your users.

References:Let's stay in touch and Follow me for more thoughts and updates