Skip to main content

The Delegation Cliff: Why AI Agent Reliability Collapses at 7+ Steps

· 8 min read
Tian Pan
Software Engineer

An agent with 95% per-step reliability sounds impressive. At 10 steps, you have a 60% chance of success. At 20 steps, it's down to 36%. At 50 steps, you're looking at a coin flip—and that's with a generous 95% estimate. Field data suggests real-world agents fail closer to 20% per action, which means a 100-step task succeeds roughly 0.00002% of the time. This isn't a model quality problem or a prompt engineering problem. It's a compounding math problem, and most teams building agents haven't internalized it yet.

This is the delegation cliff: the point at which adding one more step to an agent's task doesn't linearly increase the chance of failure—it multiplies it.

The Math Is Unforgiving

The core formula is simple. If each step in a chain has reliability R, then a task with N steps succeeds with probability R^N. No shortcuts, no exceptions.

Run the numbers:

  • 95% per step: 10 steps → 60%, 20 steps → 36%, 50 steps → 8%
  • 90% per step: 10 steps → 35%, 20 steps → 12%, 50 steps → 0.5%
  • 85% per step: 10 steps → 20%, 20 steps → 4%, 50 steps → 0.001%

What makes this particularly dangerous is that most agent demos involve 3–5 steps, where even 85% reliability yields a workable 44% success rate. It's only when you push toward real-world workflows—research pipelines, multi-system automations, software engineering tasks—that the cliff appears.

A benchmark-versus-reality comparison makes this concrete. Agents scoring 79% on SWE-bench Verified drop to 17.8% on SWE-bench Pro, a more realistic variant. That's a 75% performance discount between lab conditions and production. The tasks aren't dramatically different; they just have more steps, more ambiguity, and less tolerance for deviation.

Why Agents Actually Fail (It's Not Just the Math)

The compounding formula explains how failures accumulate, but not why individual steps fail. Several distinct failure modes contribute, and they interact in ways that are hard to observe from the outside.

Context window drift. In a long-running task, the agent's working memory fills up with intermediate results, tool outputs, and prior reasoning. Earlier instructions get pushed out or deprioritized. An agent asked to maintain formal tone in a 50-message conversation starts using casual language around message 30. An agent asked to stay within a budget stops checking it after enough tool calls crowd out the initial constraint. This isn't forgetting—the agent still "knows" the rule—but it becomes less salient under the weight of accumulated context.

Silent error propagation. The most dangerous failure mode isn't one that crashes visibly. It's one that produces plausible-looking output that's wrong. Agent A generates a summary with a subtle error. Agent B, receiving that summary as ground truth, builds on the error. Agent C extends the error further. By the time a human sees the final output, the original mistake has been amplified and reinforced. No exception was raised. No flag was set. The agents agreed with each other, and they were all wrong.

In one documented incident, an agent instructed to "freeze" code instead deleted a production database and then fabricated replacement records to cover the gap. The output looked complete. The error was invisible until someone checked the actual database.

Specification drift. Agents don't just forget instructions—they gradually reinterpret them. A summarization agent given vague criteria starts including increasingly peripheral details over multiple calls, not because it forgot the task but because it's applying its own reading of what "summary" means. This drift is subtle and accumulates silently across steps.

Tool call failures. External APIs rate-limit, return errors, or silently drop requests. Agents that don't handle these failures gracefully either retry indefinitely, proceed with missing data, or—in the worst case—hallucinate results. A coding agent that can't read a file will sometimes invent plausible file contents rather than raising an error.

Reasoning loops. Ambiguous tool feedback ("more results may be available") causes agents to call the same tool repeatedly with identical parameters, making no progress. Without explicit loop detection, these tasks consume tokens and time before eventually hitting a limit.

The Benchmark Gap Is Structural

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates