The Compound Accuracy Problem: Why Your 95% Accurate Agent Fails 40% of the Time
Your agent performs beautifully in isolation. You've benchmarked each step. You've measured per-step accuracy at 95%. You demo the system to stakeholders and it looks great. Then you ship it, and users report that it fails almost half the time.
The failure isn't a bug in any individual component. It's the math.
A 10-step pipeline where every step succeeds 95% of the time achieves an overall success rate of 0.95^10 = 59.9%. Not 95%. Not even 80%. Just 60%. And that's the optimistic case — it assumes errors are independent, recoverable, and visible. In real production systems, none of those assumptions hold.
This is the compound accuracy problem, and understanding it changes how you architect agent systems from the ground up.
The Arithmetic That Should Scare You
The formula is straightforward: if each of n sequential steps has accuracy p, your end-to-end success rate is p^n. What's not intuitive is how fast this collapses:
| Per-step accuracy | 5 steps | 10 steps | 20 steps |
|---|---|---|---|
| 99% | 95.1% | 90.4% | 81.8% |
| 95% | 77.4% | 59.9% | 35.8% |
| 90% | 59.0% | 34.9% | 12.2% |
| 85% | 44.4% | 19.7% | 3.9% |
At 90% per-step accuracy — a number that sounds high — a 10-step pipeline succeeds only 35% of the time. You'd reject that product. But teams routinely ship pipelines with these characteristics because they measure step accuracy in isolation and assume the whole system inherits it.
This isn't a new discovery. Reliability engineering has known about this since at least the 1960s, under the name Lusser's Law: the reliability of a chain equals the product of the reliabilities of its links. It was formalized after rocket failures where every component tested fine but the integrated system kept exploding. We're rediscovering it with LLMs.
Empirical research confirms the math. Studies on multi-step question answering show accuracy dropping from 78% on 2-step problems to 31% on 8-step problems. WebShop transaction tasks fail 68% of the time across multi-step chains. In long-horizon software engineering benchmarks, frontier models succeed on only 15–24% of tasks that require modifying multiple files across a large codebase — even though the same models ace individual subtasks.
Why the Problem Is Worse Than the Formula Suggests
The formula p^n assumes three things that are almost never true in production: errors are independent, errors are visible, and errors are recoverable.
Errors are correlated. LLMs trained on similar data fail in similar ways. If you run three instances of the same model on the same input hoping that majority voting will catch errors, you may find all three instances confidently agreeing on the wrong answer. They share the same hallucinations, the same blind spots, the same training distribution. A bug that triggers one triggers all. This makes consensus voting far less valuable than it appears — you're not getting independence, you're getting correlated confidence.
Errors are sticky. An agent producing a plausible-looking but wrong output in step 2 doesn't cause step 3 to fail loudly. Step 3 consumes the wrong output, produces another plausible-looking but wrong output, and the error compounds silently. By step 7 you're far from where you should be, but no exception has been thrown. This "failure stickiness" is the production failure mode that bites teams: the system appears to run successfully while delivering garbage results.
Errors aren't uniform across steps. Research on agentic workflow reliability shows that terminal steps (the last steps in a pipeline) and initial steps are disproportionately responsible for failures. Initial failures cascade — everything downstream builds on a broken foundation. Terminal failures are the hardest to catch because the pipeline has already committed resources to get there. Intermediate steps matter less in isolation but amplify earlier errors.
The actual effective per-step accuracy for real enterprise tasks is often 60–80%, not 95%. When you inject that number into the formula, a five-step pipeline succeeds less than half the time. That's not a reliability problem. That's a product that doesn't work.
The Architectural Responses That Actually Help
None of the following mitigations are silver bullets. Each operates on a different part of the failure curve.
Task decomposition with error isolation
The most important architectural decision isn't which model to use — it's how you decompose tasks. When a long task is encoded as a single agent prompt, a failure anywhere requires restarting everything. When each logical step is its own independent unit with explicit inputs and outputs, a failure at step 4 means you retry step 4, not the whole pipeline.
Aggressive decomposition also creates better checkpointing opportunities and makes failures visible. A monolithic agent that fails at step 7 of an internal chain tells you "it failed." A decomposed pipeline tells you "step 4 received correct input and produced incorrect output — here's the diff." That's debuggable.
The key insight: decompose at the granularity where you can write clear input/output schemas, not at the granularity where the subtasks feel like natural human task boundaries. LLM tasks need to be broken smaller than humans would expect.
Confidence-gated early exit
Not all outputs deserve equal confidence. An output where the model is genuinely uncertain should not proceed to the next step — it should either retry, flag for human review, or invoke a fallback path.
The challenge is that raw model confidence scores are poorly calibrated. High logit probability doesn't mean the answer is correct. You need calibrated confidence: does the model's stated confidence actually predict correctness on held-out data? In most production systems, this calibration work is skipped, so confidence scores are nearly useless as gates.
Better alternatives: output consistency checks (run the same step twice with different prompts and compare), constraint validation (does the output satisfy the domain constraints it should?), and behavioral invariant testing (does the output pass sanity checks that should always be true). These are more reliable than raw model confidence.
When a step fails these checks, the right response isn't always to retry. Classifying errors helps:
- Transient errors (parsing failures, timeout): retry with backoff
- Repairable errors (constraint violation with fixable output): surface the error to the model and ask it to correct
- Permanent errors (semantically impossible request): escalate or abort early
Retrying repairable errors by injecting the error message back into the context — the Try-Rewrite-Retry pattern — recovers a significant fraction of failures at low cost.
Strategic verification placement
Adding a verification step after every step in a long pipeline is expensive. Research on verification-guided agentic workflows shows that not all verification steps deliver equal value.
The highest-value verification points are:
- After the first step (catching early failures before they cascade)
- At branching points (where an error takes the pipeline down a completely wrong path)
- Before irreversible actions (sending emails, modifying records, making payments)
- Before the final output synthesis step
Verification in the middle of a linear chain often just adds latency without improving outcome reliability, because a verification step that catches an error still requires retrying from the error point — and if the middle steps are the error-resistant parts of the pipeline, you're spending resources on the wrong nodes.
One practical optimization: speculative execution. Continue the next step while verification runs asynchronously. If verification fails, roll back. If it passes, you've amortized the latency. This works when steps are cheap to roll back, which is another reason to design steps to be reversible by default.
Human checkpoints at the right places
Human-in-the-loop review works, but only if checkpoints are placed based on risk, not cadence. A checkpoint every N steps is arbitrary. A checkpoint before any irreversible action, before any step with output variance above a threshold, or when accumulated confidence drops below a floor — those are checkpoints that catch actual failures.
The mechanical requirement: the checkpoint must serialize the entire agent state (all context, all intermediate outputs, the planned next action) in a form that a human reviewer can actually understand in under two minutes. If the state dump requires reading 40k tokens of agent scratchpad to evaluate, reviewers will approve blindly. That's not oversight — it's theater.
Design the state representation for human comprehension, not for technical completeness. A summary of what the agent has done, what it's about to do, and why it's uncertain is more useful for a reviewer than a raw context dump.
Reducing correlated failures through diversity
If you need multi-agent consensus to improve reliability, the diversity of the agents matters more than the count. Three instances of the same model voting rarely outperforms two. A more reliable approach:
- Use different prompt formulations for the same underlying task
- Use models from different providers or training runs
- Use an independent verification model (different base model, different prompt) rather than majority vote among clones
For higher-stakes applications, architectural diversity — different reasoning approaches (chain-of-thought vs. direct answer vs. structured decomposition) — can expose correlated failure modes that homogeneous ensembles miss.
What the Success Rate Math Tells You About Pipeline Length
The compounding formula gives you a design constraint: given a target end-to-end reliability, what per-step accuracy do you need, and is that achievable?
For a 5-step pipeline targeting 90% overall success, you need 98% per-step accuracy. For a 10-step pipeline targeting 90% overall success, you need 98.9% per-step accuracy. Both of those numbers are at the edge of what production LLM systems can achieve on well-defined tasks, and impossible on ambiguous ones.
This means that beyond 5–7 steps, you cannot hit 90% reliability through step accuracy alone. You must add error recovery mechanisms. The pipeline structure must assume failures will happen and have a path for handling them — retry logic, fallback strategies, human escalation — not just assume that steps will succeed.
Empirically, long-horizon pipelines (10+ steps) in production research settings succeed only 15–30% of the time without explicit error recovery architecture. The systems that reach 80%+ success on complex tasks all share common features: aggressive decomposition, explicit checkpointing, error classification and retry, and human escalation for high-uncertainty decisions.
The Error Recovery Inversion
Here's the counterintuitive result that changes how you should prioritize reliability engineering: reducing failure stickiness (how much an error propagates before recovery) often matters more than improving per-step accuracy.
A pipeline with 90% per-step accuracy but 50% error recovery capability (half of errors are caught and corrected) dramatically outperforms a 96% accurate pipeline with no recovery. This is because recovery stops the compounding. An error in step 2 that's caught and corrected doesn't infect steps 3–10. An error in step 2 that propagates silently costs you the entire pipeline.
The implication for engineering prioritization: before spending effort to push per-step accuracy from 93% to 96%, check whether you have any error recovery at all. Instrumenting for failure detection, adding retry logic on repairable errors, and implementing a basic confidence threshold for human review will likely move overall pipeline reliability more than the same engineering effort spent on prompt optimization.
This also means that per-step accuracy is the wrong primary metric for complex pipelines. The right metric is end-to-end task completion rate, measured across a representative sample of realistic inputs — not curated test cases.
Where to Start
If you're building or maintaining a multi-step agent system, a short diagnostic:
- What is your current end-to-end task completion rate on realistic inputs? If you don't know, that's the first problem.
- How many sequential steps does your pipeline have? Apply the formula to your measured per-step accuracy. Is the predicted end-to-end rate consistent with what you're seeing?
- When a step fails, does the pipeline fail loudly (exception, empty output) or silently (plausible-looking wrong output)? Silent failures are the dangerous ones.
- Do you have any error recovery? Can you retry individual steps without restarting the whole pipeline?
- Where are your checkpoints? Are they before high-risk actions, or arbitrary?
The compound accuracy problem doesn't have a single fix. It's a constraint that shapes architecture from the start: how tasks are decomposed, how outputs are validated, how errors are classified and recovered, and where human judgment is inserted. Teams that treat it as a deployment concern rather than a design constraint will keep rediscovering it in production.
The 60% success rate isn't the model's fault. It's arithmetic.
- https://arxiv.org/html/2511.09030v1
- https://arxiv.org/html/2509.16941v1
- https://arxiv.org/pdf/2511.00330
- https://arxiv.org/html/2604.04269
- https://aclanthology.org/2025.acl-long.302.pdf
- https://www.oreilly.com/radar/the-hidden-cost-of-agentic-failure/
- https://chrispedder.substack.com/p/when-agents-fail-compounding-errors
- https://towardsdatascience.com/why-your-multi-agent-system-is-failing-escaping-the-17x-error-trap-of-the-bag-of-agents/
- https://www.getmaxim.ai/articles/multi-agent-system-reliability-failure-patterns-root-causes-and-production-validation-strategies/
- https://wand.ai/blog/compounding-error-effect-in-large-language-models-a-growing-challenge
