Skip to main content

Your Retry Logic Is Teaching the Agent the Wrong Lesson

· 10 min read
Tian Pan
Software Engineer

A tool call fails. Your agent framework retries it three times with exponential backoff. The third attempt goes through. The trace shows a green checkmark. Nobody gets paged, no error counter increments, the user gets their answer. By every dashboard you have, the system worked.

It didn't. The tool failed because the agent passed a malformed argument, and the only reason the third try succeeded is that the agent — sampling differently each time — happened to phrase the call correctly on attempt three. You didn't recover from a transient fault. You ran a slot machine until it paid out, then logged the payout and threw away the two pulls that told you the agent was broken.

This is the quiet way retry logic rots an agent system. Retries were designed for a world where the caller is correct and the network is flaky. Agents invert that assumption: the network is mostly fine, and the caller is the unreliable part. When you point a retry policy built for the first world at the second one, it stops being a recovery mechanism and becomes a way to launder bugs into green checkmarks.

Two failures that look identical and aren't

Every retry policy rests on one classification: is this failure transient or deterministic? Get it right and retries are free reliability. Get it wrong and they are an accelerant.

A transient failure is one where the exact same call, repeated, has a real chance of succeeding. A rate-limit 429, a connection reset, a 503 from an overloaded service, a gateway timeout. The input was valid; the world was momentarily busy. Retrying the identical request is the correct response, and exponential backoff with jitter is the correct shape for it.

A deterministic failure is one where the same call will fail the same way every time. A 401 because the credential is wrong. A 400 because the JSON schema didn't validate. A 404 because the record ID doesn't exist. A semantic failure where the tool returned a syntactically valid result that is wrong for what the agent actually needed. Retrying these is pure waste — three identical calls, three identical failures, three times the latency and cost.

Classic distributed-systems advice draws this line cleanly: retry on 429 and 5xx, never on 401, 403, 400, or context-window overflow. That advice is correct, and it is also not enough for agents — because it assumes the input is fixed. With an agent, the input is generated fresh each time, and that changes everything.

The agent breaks the retry contract

Here is the assumption every traditional retry library makes and never states: the request you retry is byte-for-byte the request that failed. That is what makes the transient/deterministic split work. If the input is constant, then a deterministic failure stays deterministic, so you can safely refuse to retry it.

An agent violates this. When the orchestrator retries an agent step, it doesn't replay a frozen request — it re-invokes the model, which re-samples, which can produce a different tool call. The classification you made on attempt one no longer holds on attempt two, because attempt two isn't the same call.

This produces a failure mode with no equivalent in ordinary RPC: the deterministic failure that intermittently succeeds. The agent has a real bug — it consistently misunderstands a parameter, picks the wrong tool, or hallucinates a field name. But because sampling adds noise, maybe one in four generations stumbles into a correct call by accident. A three-retry policy converts a 75%-broken behavior into something that "passes" most of the time.

Now ask what the agent learned. On a successful run, the agent sees: I called the tool, it worked, my approach was sound. It never sees the two failed attempts — those were swallowed by the retry wrapper below it. The model's context contains only the win. So the next time it faces the same situation, it does the same wrong thing, because nothing ever told it the thing was wrong. Your retry logic didn't just hide the bug from your dashboard. It hid the bug from the one component that could have corrected it.

That is the wrong lesson, and you are the one teaching it.

Why this is worse than a flaky read

You might think: retries have always papered over flakiness, this is just more of the same. It isn't, for two reasons specific to agents.

The first is side effects. A flaky read is safe to retry by definition — it has no consequences. Agent tool calls are increasingly writes: send the email, charge the card, file the ticket, update the record. When the orchestrator retries a multi-step agent run because step two timed out, step one — which already sent the email — runs again. Now the customer has two emails. The retry didn't recover the run; it duplicated a real-world action. The model is nondeterministic, but the side effects it triggers are not allowed to be, and a naive retry is exactly what breaks that rule.

The second is cost and the feedback shape. A retried database read costs a few milliseconds. A retried agent step re-runs model inference, often re-runs every tool call in the step, and burns the full token bill again. Three retries of an expensive step is not a 3x latency hit on something cheap — it is 4x the cost of the most expensive operation in your system, spent specifically on attempts you have already decided not to look at. Worse, when you feed a tool error back into the agent and let it retry, a failed self-correction doesn't just cost tokens — it can make the agent more confident in a wrong path. Practitioners who instrumented this found tool errors didn't become useful feedback; they became fuel, with the agent digging deeper holes and writing more elaborate justifications for why the tool was "wrong."

So agent retries fail in both directions. Retry blindly at the orchestrator level and you duplicate side effects. Hand the error to the agent and let it retry its own reasoning and you can amplify the error instead of fixing it. Neither default is safe. You have to design the middle.

What a retry should actually do

The fix is not "fewer retries." It is making retries honest — about what failed, who should know, and what gets repeated.

Classify before you retry, and classify the agent's call as suspect by default. A network-layer transient — 429, 503, reset, timeout — retry the identical call, that is what backoff is for. A deterministic failure — 4xx, schema-validation error, unknown tool, bad argument — do not retry the identical call, because it will fail identically. The interesting case is the deterministic failure on an agent-generated input: don't silently re-sample hoping for a lucky draw. Either stop, or retry with the error surfaced into the agent's context so the next attempt is an informed correction rather than a fresh roll of the dice.

Separate the decision from the execution. The most robust pattern in production agent systems is to split the LLM's decision (which tool, what arguments) from the execution of the side effect. On retry, you replay the cached decision instead of re-asking the model, and the execution layer enforces idempotency on the actual action. Model nondeterminism becomes irrelevant to retries because you never re-sample on retry. This is also where idempotency keys earn their place: thread a stable key through every state-changing tool call so a replayed "send email" is a no-op the second time.

Make the failure visible to the agent, not just the operator. When a tool returns an error, that error should land in the agent's next reasoning step as explicit context: the tool failed, here is the error, here is what you sent. An agent that can see "your date argument was rejected, expected ISO-8601" can fix it. An agent whose failed attempt was swallowed by a retry wrapper learns nothing and repeats the mistake. Surfacing beats swallowing — but pair it with a hard cap on self-correction attempts, because an agent that can retry its own reasoning indefinitely will, and confidently.

Bound everything and route the leftovers. Cap retries. Cap self-correction loops. Set a timeout budget for the whole run, not just per call. When the budget is spent, the run does not get another retry — it goes to a dead-letter queue for a human to inspect. A failure you can't resolve automatically should become a ticket, not a tighter loop.

The retry rate is a signal — start reading it

The deepest fix is cultural. Most teams treat the retry counter as plumbing — something the framework manages, not something a human looks at. For agents, the retry rate is one of your most honest quality metrics, and you are probably throwing it away.

A run that succeeds on the third attempt is not the same as a run that succeeds on the first, and if your telemetry collapses them into one "success" bucket, you have blinded yourself. White-box monitoring exists precisely to catch problems that retries would otherwise mask: the service reads as "up," the transactions read as "completed," and underneath, the success rate is propped up entirely by retries. The day the noise floor shifts — a model update, a slightly harder input distribution — the retries stop landing and what looked like a healthy 99% collapses, because it was never 99%.

So instrument the thing you have been hiding. Emit attempt count on every agent step. Alert on a rising first-attempt failure rate even while the final success rate looks flat — that gap is the bug forming. Track which tools get retried most; a tool with a high retry rate is usually a tool with a confusing schema the agent keeps getting wrong, and the fix is the schema, not the retry policy. When the retry rate climbs, treat it the way an SRE treats rising latency: as the early warning, not the noise.

The takeaway

Retries built for flaky networks assume the caller is correct. Agents make the caller the unreliable part, and a retry policy that doesn't know the difference will quietly convert your agent's bugs into green checkmarks — laundering deterministic failures through sampling noise until they look transient.

Audit your agent's retry path with three questions. Does a retry repeat a byte-identical call, or does it re-sample the model and hope? Does a failed attempt reach the agent's context as feedback, or get swallowed before the model ever sees it? And does your telemetry tell first-attempt success apart from after-three-tries success — or has it already decided not to look?

A good retry recovers from noise and tells you it happened. A bad one hides a defect and teaches the agent to keep making it. The line between them is whether the failure is allowed to be seen.

References:Let's stay in touch and Follow me for more thoughts and updates