Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill

April 16, 2026 · 8 min read

Software Engineer

Most teams discover their retry problem when the invoice shows up. The agent "worked"; latency dashboards stayed green; error rates looked fine. Then finance asks why inference spend doubled this month, and someone finally reads the logs. It turns out that 20% of the tool calls in a 3-step agent were quietly retrying, each retry replayed the full prompt history, and the bill had been ramping for weeks.

The math on this is not mysterious, but it is aggressively counterintuitive. A 20% per-step retry rate sounds tolerable — most engineers would glance at it and move on. The actual token cost, once you factor in how modern agent frameworks retry, lands much closer to 2x than 1.2x. And the failure mode is invisible to every metric teams typically watch.

Retry budgets — an old idea from Google SRE work — are the cleanest fix. But the LLM version of the pattern needs tweaking, because tokens don't behave like RPCs.

The compounding is worse than it looks

The naive model: a 3-step agent, 20% failure rate per step, single retry on failure. Expected calls per step is 1/(1-0.2) = 1.25. Three steps means 3 × 1.25 = 3.75 calls versus a baseline of 3 — a 25% overhead. Annoying, not alarming.

That model is wrong for almost every production agent. Here's why.

Most agent frameworks — LangChain, LlamaIndex, the OpenAI and Anthropic tool-use SDKs — retry by replaying the conversation history, not by retrying the single failed step in isolation. When step 3 fails and retries, you resend the system prompt plus the outputs of steps 1 and 2 plus the failed attempt's error. The token cost of that retry is not one step's worth of tokens; it's the cumulative context-window-so-far.

Recompute under context replay. On a 3-step agent at 20% retry rate, expected tokens climb to roughly 1.7–1.9x baseline. On a 5-step agent at the same rate, token overhead climbs to 2.2–2.5x. The compounding is quadratic in context size, not linear in step count, and that's the part practitioners miss.

Layer on retries at more than one level — SDK retries inside a tool, middleware retries wrapping the tool, agent-level retries wrapping the whole loop — and you hit multiplicative chains that Google's SRE book warned about a decade ago. Three retries per layer across five services is the canonical "retry storm" example: a single user request produces 3^5 = 243 backend calls in the worst case. LLM pipelines are not immune; they are worse, because each call is more expensive.

Real bills from real incidents

The retry storm isn't theoretical. Anthropic's own claude-code repository carries a July 2025 issue describing a single user session that consumed 1.67 billion tokens in five hours, with a peak of 224 requests per second and an estimated bill between $16,000 and $50,000. The post-mortem identified four overlapping root causes, one of which was explicit: 253 "usage limit" errors that did not stop the loop. The retry logic kept going after the provider said stop.

A separate October 2025 issue in the same repo logged 108.8 million tokens on a single day — about $64 at Sonnet pricing — because an autocompaction bug caused the same file to be re-read in a loop. Different root cause, same shape: retries without a budget.

At the individual-developer scale, a well-traveled substack post documented 8 months of Claude Code usage adding up to 10 billion tokens, roughly $15,000 at current Sonnet pricing, mostly from chained-call patterns the user never directly observed.

Smaller, quieter failures are more common. A recent benchmark of ReAct-style agents found that across 200 tasks, 90.8% of retries — 466 of 513 attempts — were wasted on hallucinated or nonexistent tool names. The agent invented a tool that didn't exist, the tool call failed, the agent retried with a slightly different hallucination, and so on. Retries weren't the fix; they were the failure mode wearing a helpful disguise.

The Google SRE retry budget, translated for LLMs

The SRE book's solution to retry storms is elegantly simple. Cap retries at the per-request level (never more than N attempts). Then, at the client level, track retries / total_requests over a rolling window, and refuse further retries once that ratio exceeds 10%. Without the second cap, load grows to 3x baseline under partial outage. With it, load grows to 1.1x. The budget treats retries as a limited resource that the system earns through a baseline of successful requests.

Translating this to LLM agents requires two adjustments.

First, change the unit from requests to tokens. Retries in LLM land don't cost one slot of capacity; they cost an amount of capacity proportional to prompt size. The budget should be denominated in retry-tokens per user per rolling window, not retry-attempts. A session that retried twice on a 2,000-token context is not equivalent to a session that retried twice on a 50,000-token context.

Second, enforce the budget inside the loop, not outside. Most retry middleware sits at the HTTP layer, well below where the agent logic lives. The agent loop itself — tool-call retries, JSON-parse retries, validation retries — is where the budget needs to be checked, because that's where the replay happens. A generic HTTP retry policy doesn't see the cumulative context cost of a 5-step retry with history replay.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill

The compounding is worse than it looks

Real bills from real incidents

The Google SRE retry budget, translated for LLMs

Recommended Reading

About Tian Pan

The compounding is worse than it looks​

Real bills from real incidents​

The Google SRE retry budget, translated for LLMs​

Recommended Reading

About Tian Pan

The compounding is worse than it looks

Real bills from real incidents

The Google SRE retry budget, translated for LLMs