Retry Amplification: How a 2% Tool Error Rate Becomes a 20% Agent Failure

April 23, 2026 · 13 min read

Software Engineer

The spreadsheet on the oncall doc said the search tool had a 2% error rate. The incident review said the agent platform had a 20% failure rate during the three-hour window. Nobody disagreed with either number. The search team was not at fault. The platform team did not ship a bug. The gap between the two numbers is the whole story, and it is a story about arithmetic, not engineering incompetence.

Retry logic is one of the most borrowed and least adapted patterns in agent systems. Teams copy tenacity decorators from their REST client, stack them at the SDK, the gateway, and the agent loop, and ship. Each layer is individually reasonable. The composition is a siege weapon pointed at the flakiest dependency in the fleet, and it fires hardest at the exact moment that dependency needs the load to drop.

This post is about how that math works, why agent loops amplify it harder than request-response systems, and the retry discipline that keeps transient blips from becoming correlated outages with your own logo on them.

The Arithmetic of Retry Amplification

Start with a stateless fact. If each of N sequential tool calls has probability p of failing and failures are independent, the probability that a run completes without a single failure is (1 - p)^N. At p = 0.02 and N = 10, that is 81.7%. Flip it: an 18.3% per-session failure rate sits on top of a 2% per-tool error rate, purely from multiplying out the steps. No retries yet. The per-run failure rate is roughly 1 - (1 - p)^N ≈ N·p when p is small — ten tool calls with a 2% error rate give you a 20% agent failure rate before anything else goes wrong.

Now add retries. Naive wrappers look like this: the SDK retries 3 times, the gateway retries 3 times, the agent loop retries 3 times. Operators call this "defense in depth." The system calls it 27 attempts per logical request, because the layers multiply geometrically. When the downstream is degraded, every caller suddenly contributes 27× the baseline load at the precise moment the dependency can least afford it. This is the classic retry storm, and agent loops happen to be its most enthusiastic participant because the fan-out per user action is larger and the stack is taller than the equivalent web app.

The December 2025 empirical study of microservice retry behavior gave the pattern a clean benchmark. Exponential backoff without jitter produced p99 latency of 2,600 ms and a 17% error rate, largely attributable to retry amplification. Adding jitter dropped p99 to 1,400 ms and errors to 6%. Bounded retries behind a circuit breaker produced p99 of 1,100 ms and a 3% error rate. These numbers generalize: unbounded retries do not restore service, they delay recovery and hide the blast radius in your own bill.

Why Agent Loops Amplify Harder Than Request-Response

Three structural differences turn agent retry behavior from annoying to actively dangerous.

First, fan-out per user action is higher. A REST endpoint fires one downstream call per request on average. An agent plan often fires ten or twenty, and at each step the language model gets to decide whether to try the same tool again because "maybe the result was wrong." This is not a retry in the classic sense — it is the model making the same request because the tool output was ambiguous. Your observability layer will not tag it as a retry. Your rate limiter will not dedupe it. Your cost dashboard will show it as "normal planning activity."

Second, each attempt costs more than a network round-trip. Every retry in an agent loop consumes tokens on both the input (the entire conversation history replays through the LLM on the next turn) and the output (the plan the model generates after seeing the error). A ReAct-style agent retrying a failed tool three times easily spends 10,000 additional input tokens — the full conversation re-billed three times — before concluding the tool is broken. In one analysis, 90.8% of retries in a 200-task benchmark were wasted on non-retryable errors such as calling tools that do not exist. The agent kept trying. The bill kept climbing. The downstream kept degrading.

Third, context accumulation interacts badly with retries. When a tool fails, the error message enters the conversation history and stays there. The next turn not only pays for replaying the failure but also for the model's longer plan, its now-defensive prompt, and its tendency to over-explain why the next attempt will be different. A session that runs 2× as many turns often costs 3–4× as much, because later turns carry heavier context. Retry amplification in agents is not just a request-volume problem; it is a token-volume problem that compounds against you twice.

The Self-DoS Pattern: When Your Agent Attacks Your Infrastructure

The textbook cascading failure pattern goes like this. A downstream tool — say, the search index — starts returning timeouts. The agent interprets the timeout as a transient failure and retries. The retries race against the agent's per-step deadline, which was generous enough that three attempts fit inside it. The downstream is now receiving 3× its normal load at the moment it can handle the least. Latency grows. The agent's deadline starts firing. Users hit the retry button in the UI, which spawns new agent runs with fresh retry budgets. Within three minutes, the search tier is saturating; within seven, even unrelated workflows that share worker pools start degrading. The initial trigger resolved at minute four, but by minute nine the system is still climbing because the retry traffic is now sustaining itself.

This is a metastable failure. The input that caused it is gone, and the system will not recover on its own because its own load is keeping it in the degraded state. The only way out is to stop retrying, which is exactly the behavior your retry logic was designed to prevent.

A second variant is specific to agent systems: the agent DoSing its own quota. A single user request fans out into dozens of concurrent tool calls that all draw from the same per-user token bucket. The first few hit the limit. The gateway returns 429. The agent retries. The retries also hit the limit. The agent now spends the remainder of the session fighting itself for quota, none of the work completes, and the user's entire token allowance is consumed on self-inflicted recovery. No adversary was needed.

The retry-amplification failure mode and the self-DoS failure mode are the same math applied in two directions — one against shared downstream capacity, the other against per-tenant quota — and both require the same fix.

The Four-Layer Retry Budget

Borrowing from Google SRE's production retry policy, the discipline that actually works has four layers that agree with each other instead of compounding.

Layer 1: per-request cap. Never retry a single operation more than 2–3 times, total, across the entire stack. This is a hard rule, not a per-layer rule. If the SDK retries 3 times, the gateway retries 0, and the agent loop retries 0 — that is the system's budget. The worst pattern in the wild is three layers each thinking they own the retry contract. Write the 2–3 down, pick the layer that will enforce it, and set the rest to zero. Request metadata should carry the attempt counter so every layer can see the budget has been spent.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Retry Amplification: How a 2% Tool Error Rate Becomes a 20% Agent Failure

The Arithmetic of Retry Amplification

Why Agent Loops Amplify Harder Than Request-Response

The Self-DoS Pattern: When Your Agent Attacks Your Infrastructure

The Four-Layer Retry Budget

Recommended Reading

About Tian Pan

The Arithmetic of Retry Amplification​

Why Agent Loops Amplify Harder Than Request-Response​

The Self-DoS Pattern: When Your Agent Attacks Your Infrastructure​

The Four-Layer Retry Budget​

Recommended Reading

About Tian Pan

The Arithmetic of Retry Amplification

Why Agent Loops Amplify Harder Than Request-Response

The Self-DoS Pattern: When Your Agent Attacks Your Infrastructure

The Four-Layer Retry Budget