Retry Amplification: How a 2% Tool Error Rate Becomes a 20% Agent Failure
The spreadsheet on the oncall doc said the search tool had a 2% error rate. The incident review said the agent platform had a 20% failure rate during the three-hour window. Nobody disagreed with either number. The search team was not at fault. The platform team did not ship a bug. The gap between the two numbers is the whole story, and it is a story about arithmetic, not engineering incompetence.
Retry logic is one of the most borrowed and least adapted patterns in agent systems. Teams copy tenacity decorators from their REST client, stack them at the SDK, the gateway, and the agent loop, and ship. Each layer is individually reasonable. The composition is a siege weapon pointed at the flakiest dependency in the fleet, and it fires hardest at the exact moment that dependency needs the load to drop.
This post is about how that math works, why agent loops amplify it harder than request-response systems, and the retry discipline that keeps transient blips from becoming correlated outages with your own logo on them.
The Arithmetic of Retry Amplification
Start with a stateless fact. If each of N sequential tool calls has probability p of failing and failures are independent, the probability that a run completes without a single failure is (1 - p)^N. At p = 0.02 and N = 10, that is 81.7%. Flip it: an 18.3% per-session failure rate sits on top of a 2% per-tool error rate, purely from multiplying out the steps. No retries yet. The per-run failure rate is roughly 1 - (1 - p)^N ≈ N·p when p is small — ten tool calls with a 2% error rate give you a 20% agent failure rate before anything else goes wrong.
Now add retries. Naive wrappers look like this: the SDK retries 3 times, the gateway retries 3 times, the agent loop retries 3 times. Operators call this "defense in depth." The system calls it 27 attempts per logical request, because the layers multiply geometrically. When the downstream is degraded, every caller suddenly contributes 27× the baseline load at the precise moment the dependency can least afford it. This is the classic retry storm, and agent loops happen to be its most enthusiastic participant because the fan-out per user action is larger and the stack is taller than the equivalent web app.
The December 2025 empirical study of microservice retry behavior gave the pattern a clean benchmark. Exponential backoff without jitter produced p99 latency of 2,600 ms and a 17% error rate, largely attributable to retry amplification. Adding jitter dropped p99 to 1,400 ms and errors to 6%. Bounded retries behind a circuit breaker produced p99 of 1,100 ms and a 3% error rate. These numbers generalize: unbounded retries do not restore service, they delay recovery and hide the blast radius in your own bill.
Why Agent Loops Amplify Harder Than Request-Response
Three structural differences turn agent retry behavior from annoying to actively dangerous.
First, fan-out per user action is higher. A REST endpoint fires one downstream call per request on average. An agent plan often fires ten or twenty, and at each step the language model gets to decide whether to try the same tool again because "maybe the result was wrong." This is not a retry in the classic sense — it is the model making the same request because the tool output was ambiguous. Your observability layer will not tag it as a retry. Your rate limiter will not dedupe it. Your cost dashboard will show it as "normal planning activity."
Second, each attempt costs more than a network round-trip. Every retry in an agent loop consumes tokens on both the input (the entire conversation history replays through the LLM on the next turn) and the output (the plan the model generates after seeing the error). A ReAct-style agent retrying a failed tool three times easily spends 10,000 additional input tokens — the full conversation re-billed three times — before concluding the tool is broken. In one analysis, 90.8% of retries in a 200-task benchmark were wasted on non-retryable errors such as calling tools that do not exist. The agent kept trying. The bill kept climbing. The downstream kept degrading.
Third, context accumulation interacts badly with retries. When a tool fails, the error message enters the conversation history and stays there. The next turn not only pays for replaying the failure but also for the model's longer plan, its now-defensive prompt, and its tendency to over-explain why the next attempt will be different. A session that runs 2× as many turns often costs 3–4× as much, because later turns carry heavier context. Retry amplification in agents is not just a request-volume problem; it is a token-volume problem that compounds against you twice.
The Self-DoS Pattern: When Your Agent Attacks Your Infrastructure
The textbook cascading failure pattern goes like this. A downstream tool — say, the search index — starts returning timeouts. The agent interprets the timeout as a transient failure and retries. The retries race against the agent's per-step deadline, which was generous enough that three attempts fit inside it. The downstream is now receiving 3× its normal load at the moment it can handle the least. Latency grows. The agent's deadline starts firing. Users hit the retry button in the UI, which spawns new agent runs with fresh retry budgets. Within three minutes, the search tier is saturating; within seven, even unrelated workflows that share worker pools start degrading. The initial trigger resolved at minute four, but by minute nine the system is still climbing because the retry traffic is now sustaining itself.
This is a metastable failure. The input that caused it is gone, and the system will not recover on its own because its own load is keeping it in the degraded state. The only way out is to stop retrying, which is exactly the behavior your retry logic was designed to prevent.
A second variant is specific to agent systems: the agent DoSing its own quota. A single user request fans out into dozens of concurrent tool calls that all draw from the same per-user token bucket. The first few hit the limit. The gateway returns 429. The agent retries. The retries also hit the limit. The agent now spends the remainder of the session fighting itself for quota, none of the work completes, and the user's entire token allowance is consumed on self-inflicted recovery. No adversary was needed.
The retry-amplification failure mode and the self-DoS failure mode are the same math applied in two directions — one against shared downstream capacity, the other against per-tenant quota — and both require the same fix.
The Four-Layer Retry Budget
Borrowing from Google SRE's production retry policy, the discipline that actually works has four layers that agree with each other instead of compounding.
Layer 1: per-request cap. Never retry a single operation more than 2–3 times, total, across the entire stack. This is a hard rule, not a per-layer rule. If the SDK retries 3 times, the gateway retries 0, and the agent loop retries 0 — that is the system's budget. The worst pattern in the wild is three layers each thinking they own the retry contract. Write the 2–3 down, pick the layer that will enforce it, and set the rest to zero. Request metadata should carry the attempt counter so every layer can see the budget has been spent.
Layer 2: per-client retry budget. Track the ratio of retries to successful requests at the client. Google's production rule is that a request will only be retried if this ratio is below 10% — meaning retries can contribute at most 10% additional load during healthy operation. If the downstream is broken, the ratio blows past 10% within seconds and the client stops retrying altogether. This is a feedback loop: the sicker the dependency, the fewer retries hit it, which is the opposite of naive retry behavior.
Layer 3: exponential backoff with jitter. Capped exponential backoff plus full jitter (random delay between 0 and the computed exponential) or decorrelated jitter (next delay random between base and 3 × previous). AWS data shows this reduces retry storms by 60–80% compared to deterministic backoff. The key word is jitter: without it, a synchronized failure creates a synchronized retry, which creates a second failure at the exact same wall-clock moment across every client.
Layer 4: circuit breaker. When the error rate on a downstream crosses a threshold, the breaker opens and subsequent calls fail fast without hitting the network at all. After a cooldown, the breaker half-opens, allows a probe through, and either closes on success or re-opens on failure. This is the only one of the four layers that actually reduces load on a struggling downstream; the other three limit the amount you add. A system with the first three layers and no circuit breaker still sends traffic, just at a slower and bounded rate. A circuit breaker sends zero during the outage, which is the thing the downstream actually needs.
Each layer exists to handle a failure mode the others cannot. Per-request cap stops a single call from looping. Per-client budget stops a single client from being the DoS. Jittered backoff stops synchronized storms. Circuit breaker stops traffic entirely during outages. A retry policy that ships only one of the four is a policy that will surprise its operators.
Retry Budget for Agents Specifically: Tokens, Not Just Requests
The four layers above come from request-response systems. Agents need a fifth layer: a token-denominated retry budget. Requests are not the scarce resource that bounds user experience in an agent session; tokens are. A retry policy that stops after "3 attempts" but allows each attempt to consume 15,000 input tokens has spent a mid-tier user's daily budget on recovery from a single broken tool.
Concretely, this means:
- Classify errors before retrying at all. The 90% wasted-retry finding from ReAct benchmarks is almost entirely non-retryable errors — missing tools, auth failures, malformed arguments — being treated as retryable. Bake the classification into the tool-result schema (e.g.
retryable: falsein the error envelope) and have the agent loop honor it without consulting the model. Zero retries for non-retryable errors, regardless of how confident the plan looks. - Budget retries in tokens, not attempts. Sum the input and output tokens spent on failed attempts within a session. When the sum exceeds a per-session ceiling (a realistic number is 10% of the session's target token budget), stop retrying and surface the failure to the user. The ceiling prevents a single flaky tool from burning the entire session envelope.
- Make retry decisions visible to the planner. Most agents retry without telling the model. This means the model keeps making the same plan and getting the same error. Instead, surface retry exhaustion as a structured signal in the conversation ("search tool failed 3×, budget exhausted, choose an alternative") so the planner can actually route around the failure rather than retry-blindly into the wall. The model is allowed to plan; it is not allowed to retry.
Idempotency: The Retry That Shouldn't Have Been Retried
The final sharp edge is side effects. The classic duplicate-action bug looks like this: the agent calls send_email, the tool succeeds, the response times out on the return path, the agent retries, two emails go out. In a web app, the browser sees a duplicate submission and the user notices. In an agent system, the model reads the second success and narrates confidently that the email was sent. The duplicate is silent.
Three defenses matter:
- Idempotency keys on every tool with side effects. Hash
(workflow_id, tool_name, normalized_args)into a key. The tool server keeps a ledger. A repeat call with the same key returns the stored result instead of re-executing. Stripe's API documentation is the canonical reference — the pattern is boring and battle-tested, and there is no reason agent tool servers should not adopt it directly. - Treat timeouts as ambiguous, not failed. An operation that timed out may or may not have completed. Retrying a non-idempotent operation after a timeout is how duplicate side effects happen. Either make the tool idempotent or mark timeouts as non-retryable on write paths.
- Gate destructive operations behind a confirmation layer. For truly irreversible actions — financial transactions, production deploys, customer-facing emails — human approval on the retry path prevents the silent duplicate. This is the only layer that survives when every other defense fails.
What to Measure
If retry amplification is not on the monitoring stack, it is happening and you cannot see it. The minimum instruments:
- Retry ratio per downstream. Retries divided by total requests, windowed at one minute. If this goes above 10% sustained, something is wrong with either the downstream or your retry policy.
- Retry token spend per session. Tokens consumed on attempts that produced an error result. Express as a percentage of total session tokens. Anything above 15% is a leading indicator of a dependency degrading.
- Breaker open rate. Fraction of time each circuit breaker is open, per downstream. This is the ground truth for "how often is this thing actually down to users." A breaker that never opens is probably mis-tuned.
- Self-DoS events. Count of sessions where a single user's retries consumed more than 50% of their per-tenant token budget. One of these is a bug; many are a systemic problem with the retry loop.
None of these require new infrastructure. All four are calculable from the tracing and billing telemetry you already have if you aggregate it against tool-call outcomes. The harder part is picking the thresholds that trigger a page, and the only honest answer is that you tune them against your own SLOs.
The Takeaway
Retry amplification is not a bug in any one layer; it is the emergent behavior of layers that do not know about each other. The fix is not to remove retries — transient failures are real and the user-visible experience of a correctly-retried call is much better than the raw downstream success rate suggests. The fix is to stop treating retries as free and to account for them in the same way you account for any other finite resource: a budget, an enforcement layer, a metric, and a circuit breaker for when the budget itself becomes the attack surface.
The agent loop turns every mistake in that accounting into a multiplier. A 2% tool error becomes a 20% session failure when you multiply steps. A 20% session failure becomes a self-DoS when you multiply retries. A self-DoS becomes a metastable outage when the retries sustain the load that caused the retries. The math is unkind to systems that ship retry logic by reflex. The discipline that beats the math is boring: cap the attempts, budget the tokens, jitter the backoff, break the circuit, and make the planner see what the infrastructure sees. Build those five things and the spreadsheet number and the incident number will agree.
- https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
- https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
- https://aws.amazon.com/builders-library/making-retries-safe-with-idempotent-APIs/
- https://sre.google/sre-book/handling-overload/
- https://learn.microsoft.com/en-us/azure/architecture/patterns/circuit-breaker
- https://temporal.io/blog/error-handling-in-distributed-systems
- https://arxiv.org/html/2512.16959v1
- https://fast.io/resources/ai-agent-retry-patterns/
- https://towardsdatascience.com/your-react-agent-is-wasting-90-of-its-retries-heres-how-to-stop-it/
- https://dev.to/willvelida/preventing-cascading-failures-in-ai-agents-p3c
- https://www.agentpatterns.tech/en/failures/cascading-failures
- https://agentsarcade.com/blog/error-handling-agentic-systems-retries-rollbacks-graceful-failure
- https://medium.com/@kaushalsinh73/7-patterns-that-make-agent-retries-idempotent-not-duplicative-f57bc70018b7
- https://www.infoworld.com/article/4138748/finops-for-agents-loop-limits-tool-call-caps-and-the-new-unit-economics-of-agentic-saas.html
