Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill
Most teams discover their retry problem when the invoice shows up. The agent "worked"; latency dashboards stayed green; error rates looked fine. Then finance asks why inference spend doubled this month, and someone finally reads the logs. It turns out that 20% of the tool calls in a 3-step agent were quietly retrying, each retry replayed the full prompt history, and the bill had been ramping for weeks.
The math on this is not mysterious, but it is aggressively counterintuitive. A 20% per-step retry rate sounds tolerable — most engineers would glance at it and move on. The actual token cost, once you factor in how modern agent frameworks retry, lands much closer to 2x than 1.2x. And the failure mode is invisible to every metric teams typically watch.
Retry budgets — an old idea from Google SRE work — are the cleanest fix. But the LLM version of the pattern needs tweaking, because tokens don't behave like RPCs.
The compounding is worse than it looks
The naive model: a 3-step agent, 20% failure rate per step, single retry on failure. Expected calls per step is 1/(1-0.2) = 1.25. Three steps means 3 × 1.25 = 3.75 calls versus a baseline of 3 — a 25% overhead. Annoying, not alarming.
That model is wrong for almost every production agent. Here's why.
Most agent frameworks — LangChain, LlamaIndex, the OpenAI and Anthropic tool-use SDKs — retry by replaying the conversation history, not by retrying the single failed step in isolation. When step 3 fails and retries, you resend the system prompt plus the outputs of steps 1 and 2 plus the failed attempt's error. The token cost of that retry is not one step's worth of tokens; it's the cumulative context-window-so-far.
Recompute under context replay. On a 3-step agent at 20% retry rate, expected tokens climb to roughly 1.7–1.9x baseline. On a 5-step agent at the same rate, token overhead climbs to 2.2–2.5x. The compounding is quadratic in context size, not linear in step count, and that's the part practitioners miss.
Layer on retries at more than one level — SDK retries inside a tool, middleware retries wrapping the tool, agent-level retries wrapping the whole loop — and you hit multiplicative chains that Google's SRE book warned about a decade ago. Three retries per layer across five services is the canonical "retry storm" example: a single user request produces 3^5 = 243 backend calls in the worst case. LLM pipelines are not immune; they are worse, because each call is more expensive.
Real bills from real incidents
The retry storm isn't theoretical. Anthropic's own claude-code repository carries a July 2025 issue describing a single user session that consumed 1.67 billion tokens in five hours, with a peak of 224 requests per second and an estimated bill between $16,000 and $50,000. The post-mortem identified four overlapping root causes, one of which was explicit: 253 "usage limit" errors that did not stop the loop. The retry logic kept going after the provider said stop.
A separate October 2025 issue in the same repo logged 108.8 million tokens on a single day — about $64 at Sonnet pricing — because an autocompaction bug caused the same file to be re-read in a loop. Different root cause, same shape: retries without a budget.
At the individual-developer scale, a well-traveled substack post documented 8 months of Claude Code usage adding up to 10 billion tokens, roughly $15,000 at current Sonnet pricing, mostly from chained-call patterns the user never directly observed.
Smaller, quieter failures are more common. A recent benchmark of ReAct-style agents found that across 200 tasks, 90.8% of retries — 466 of 513 attempts — were wasted on hallucinated or nonexistent tool names. The agent invented a tool that didn't exist, the tool call failed, the agent retried with a slightly different hallucination, and so on. Retries weren't the fix; they were the failure mode wearing a helpful disguise.
The Google SRE retry budget, translated for LLMs
The SRE book's solution to retry storms is elegantly simple. Cap retries at the per-request level (never more than N attempts). Then, at the client level, track retries / total_requests over a rolling window, and refuse further retries once that ratio exceeds 10%. Without the second cap, load grows to 3x baseline under partial outage. With it, load grows to 1.1x. The budget treats retries as a limited resource that the system earns through a baseline of successful requests.
Translating this to LLM agents requires two adjustments.
First, change the unit from requests to tokens. Retries in LLM land don't cost one slot of capacity; they cost an amount of capacity proportional to prompt size. The budget should be denominated in retry-tokens per user per rolling window, not retry-attempts. A session that retried twice on a 2,000-token context is not equivalent to a session that retried twice on a 50,000-token context.
Second, enforce the budget inside the loop, not outside. Most retry middleware sits at the HTTP layer, well below where the agent logic lives. The agent loop itself — tool-call retries, JSON-parse retries, validation retries — is where the budget needs to be checked, because that's where the replay happens. A generic HTTP retry policy doesn't see the cumulative context cost of a 5-step retry with history replay.
The retry budget ledger pattern
A working pattern for agent systems: maintain a per-session ledger that tracks, in real time, the cumulative token cost of every retry attempt in that session. The ledger exposes three quantities — retry-tokens, budget remaining, and reason-for-retry — and every retry-capable code path consults it before firing.
Concrete rules that production teams converge on:
- Per-session cap: total retry tokens must not exceed a fixed percentage of the session's baseline token budget (10–15% works in practice)
- Per-step cap: no more than N attempts at a single step, regardless of session budget
- Per-tenant cap: hard dollar ceiling per tenant per day, independent of session math, with an alert at 80%
- Non-retryable error list: content-filter rejections, invalid-auth, and hard schema violations should short-circuit rather than retry, because the retry will deterministically fail
A clean exit strategy matters as much as the budget itself. When budget is exhausted, the agent needs a graceful degradation path: fall back to a smaller model, return a partial answer with a confidence flag, hand off to human review, or return a bounded "I don't know." The one thing it must not do is keep retrying silently while the ledger ignores it.
Catching retry explosions in CI
Retries are hard to test because they only fire under conditions that don't exist in a happy-path eval. The answer is deliberate failure injection: drop tool calls, return malformed JSON, surface fake 429s, and run the agent against this chaos harness. Then assert on the budget.
Useful eval assertions to gate in CI:
tokens_per_task < p95_budget— flags retry-driven cost blowupsretries_per_session < N— flags retry loopssteps_before_first_tool_call < M— flags hallucinated-tool retriesvariance(steps_per_task) < σ_max— retry-healthy agents have low variance, retry-pathological ones have long tails
The variance assertion is the most important and the least intuitive. Clean agents take a tight distribution of steps to solve similar tasks. Retry-prone agents have fat tails — most tasks finish fast, a handful take 5x the steps, and the average looks fine. Catching retry pathology requires looking at the 95th percentile or the distribution, not the mean.
Failure-injection frameworks like agent-chaos make this practical. The pattern is the same as chaos engineering for microservices: deliberately inject the failures production will eventually hand you, and gate the PR on how gracefully the system degrades.
What to change on Monday
If your agent is in production and you haven't audited retries, the order of operations is roughly:
- Measure first. Instrument retries as a first-class metric — per session, per tool, per step, with tokens attached. If you can't answer "what fraction of my monthly spend is retries?" in one query, you can't manage the problem.
- Find the hallucinated-tool rate. Most of the retry budget leak isn't rate limiting; it's the agent inventing tool names. Tight tool descriptions and a small hard-coded tool allowlist fix most of this immediately.
- Add the budget. Even a crude per-session token cap with a hard stop will prevent runaway sessions. Refinement can come later.
- Wire up failure injection in eval. Two or three chaos cases covering tool failures, malformed JSON, and 429s is enough to catch most retry-explosion regressions before they ship.
- Kill redundant retry layers. Retries nested inside retries nested inside retries are nearly always wrong. Pick one layer to own retry logic and disable it everywhere else.
The uncomfortable truth is that most teams running LLM agents are running without any explicit retry budget at all — the SDK defaults from a year ago, unchanged, multiplied across three layers of abstraction no one owns. That is not a configuration; that is a promise to pay whatever the provider charges this month. Retry budgets are the piece of infrastructure that turns that open promise back into a bounded cost.
- https://sre.google/sre-book/handling-overload/
- https://sre.google/sre-book/addressing-cascading-failures/
- https://github.com/anthropics/claude-code/issues/4095
- https://github.com/anthropics/claude-code/issues/9579
- https://buildtolaunch.substack.com/p/claude-code-token-optimization
- https://towardsdatascience.com/your-react-agent-is-wasting-90-of-its-retries-heres-how-to-stop-it/
- https://cookbook.openai.com/examples/how_to_handle_rate_limits
- https://platform.openai.com/docs/guides/rate-limits
- https://docs.langchain.com/oss/python/langchain/middleware/built-in
- https://reference.langchain.com/python/langchain/agents/middleware/tool_retry
- https://python.useinstructor.com/concepts/retrying/
- https://learn.microsoft.com/en-us/azure/architecture/antipatterns/retry-storm/
- https://github.com/deepankarm/agent-chaos
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://hamel.dev/blog/posts/evals-faq/
- https://eugeneyan.com/writing/evals/
