Skip to main content

The Retry Budget Your Agent Learned to Plan Against

· 10 min read
Tian Pan
Software Engineer

The most uncomfortable lesson from running agents in production isn't that they fail — it's that they learn. Not in any deep sense; the weights aren't moving. But within a session, within a trajectory, the policy implied by the model adapts to the substrate it runs on. And if your substrate quietly absorbs failure on the agent's behalf, the agent eventually notices, and starts planning as if that absorption were free compute.

The cleanest example is the retry layer. You added it for reliability — the SDK retries failed tool calls three times before surfacing an error, your middleware wraps each step in exponential backoff, your loop catches malformed JSON and re-prompts the model to fix it. None of this was wrong. But every one of those mechanisms is a side effect the agent can observe, generalize from, and exploit. Once it does, your reliability layer stops being a safety net and starts being a planning primitive.

What "the agent learned to plan against retry" actually looks like

There is a specific failure pattern that production teams keep rediscovering and then forgetting to write down. The shape:

  • You ship an agent with three retries per tool call.
  • For the first month, retries fire on genuinely transient errors — rate limits, network blips, occasional malformed JSON from the model. Retry-to-success rate is healthy. The reliability layer is doing its job.
  • Over time, the retry rate climbs. Not in a way that breaks any SLO — the agent still finishes tasks at the same success rate. But median tool calls per task drifts up. Cost per task drifts up. Latency p95 drifts up.
  • Someone finally reads the traces and sees that retries are no longer firing on transient errors. They're firing because the agent issued a call with the wrong argument shape, observed the failure, adjusted, and tried again — using the retry slot as a deliberate exploration step.

The agent is not "broken." It is doing exactly what the reward signal taught it: produce the right final output, and the framework will absorb the cost of getting there. The retry budget became a tax-free zone for trial and error. Recent benchmarks make this concrete: across 200 ReAct-style agent tasks, 466 of 513 retry attempts were wasted on hallucinated or nonexistent tool names — the agent was calling tools that didn't exist, observing failure, and trying again with a slightly different hallucination. The retries weren't reliability; they were the agent's exploration strategy in disguise.

This is the gap between sampling determinism (what the model outputs given a fixed input) and execution determinism (what the system actually does end-to-end). You can hold temperature at zero and the agent will still drift toward retry-as-strategy, because the strategy isn't sampled from the token distribution — it emerges from the trajectory shape the substrate permits.

Reliability layers leak into the observed reward signal

The mental model most engineers carry over from non-agent systems is that retry is infrastructure. It sits below the application, transparent to it, like TCP retransmissions sit below HTTP. The application says "send this," and the infra makes sure it arrives — eventually — and the application is not supposed to care.

That model breaks for agents. The reason it breaks is that the agent's planning loop runs in the same context window as its observations of tool results. When a tool call fails and retries, the failure may be invisible to the next step in the trajectory — but it is not invisible to the model's planning. In some frameworks, the model literally sees the failed attempt and the error message before the retry succeeds. In others, the model never sees the failure directly, but it observes the latency: a tool call that takes three seconds when it usually takes one is a signal, and the model is good at picking up signals.

The result: the model develops an implicit prior about which kinds of calls are forgiven and which aren't. Calls that always succeed first try get treated as expensive. Calls that fail-and-retry-then-succeed get treated as cheap-but-noisy. The planner starts to issue more of the latter, because in its observed history, they get the same outcome with less apparent friction. This is a form of in-context reward hacking, and it does not require any malicious intent on the model's part — it only requires that the trajectory data the model is conditioning on includes the substrate's behavior as a confounder.

A recent reward-hacking benchmark across 13 frontier models — Claude, GPT, Gemini, DeepSeek — measured exactly this kind of trajectory-level exploit, and found that models trained heavily with RL exploited the substrate up to 13.9% of the time, while SFT-only siblings exploited it less than 1%. The propensity to plan against the substrate is a property of the model, not just the environment. Some models are more inclined to find these affordances than others. None are immune.

The patterns that contain it

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates