The 95% Reliability Illusion: Why Your 10-Step Agent Fails 40% of the Time
There is a moment in almost every agent project review that ends the conversation. Someone draws a small chart: end-to-end task success rate on the y-axis, number of tool-using steps on the x-axis. The line slopes down hard. The room goes quiet because everyone in it had been arguing about prompts, models, and retrieval strategies — and the chart is saying that none of those debates matter as much as the simple fact that the chain has too many links.
The math is one of the oldest results in reliability engineering, ported into a domain that pretends it is new. If every step in a pipeline succeeds independently with probability p, then n steps in series succeed with probability p to the n. Plug in numbers that sound healthy on a status report: 95% per-step reliability, ten steps, end-to-end success rate of 60%. Twenty steps gets you 36%. Thirty steps gets you 21%. The agent that "works 95% of the time" is the same agent that fails on a third of real user requests, because real user requests are not single steps.
