Skip to main content

Replan, Don't Retry: Why Most Agent Errors Aren't Transient

· 10 min read
Tian Pan
Software Engineer

A calendar-write returns 409 Conflict. The framework's default error handler kicks in: backoff 200ms, retry. Same conflict. Backoff 400ms, retry. Same conflict. Backoff 800ms, retry. By the time the agent gives up and tells the user "I couldn't book the meeting," it has burned three seconds of latency budget proving something the very first response already told it: the slot is taken. The world has not changed. It will not change in 800 milliseconds. Retrying was never going to work, because nothing about this error was transient.

This is the most common error-handling bug in agent systems, and it is hiding in plain sight inside almost every framework that ships today. The retry-with-exponential-backoff pattern was imported wholesale from stateless HTTP clients — where it is exactly correct — into stateful planning loops where it is actively wrong. The right default for a tool error in an agent is not retry. It is replan.

The borrowed reflex

Exponential backoff is one of the most successful patterns in distributed systems. A flaky network drops a packet, a service is briefly overloaded, a dependency restarts — wait a beat, try again, and most of the time the problem clears itself. Decades of HTTP clients, message queues, and database drivers have hammered the pattern into every engineer's reflexes. When the LLM-agent generation of frameworks shipped, they imported the reflex without examining whether the assumption underneath it still held.

It does not. In a stateless HTTP client, the failure is almost always about the transport: the server is up, the request was valid, something between you and it hiccupped. In a stateful agent loop, the failure is almost always about the world: the calendar slot is taken, the payment validation rejected the card, the search returned zero results, the file you tried to read does not exist. These are not transport hiccups. They are the environment telling the planner that an assumption it relied on is wrong.

Retrying a transport failure is sound because the request is the same and the transport is different on the second attempt. Retrying a semantic failure is unsound because the request is the same and the world is also the same. You are running the same deterministic computation against the same state and expecting a different answer, which is the textbook definition of a bug.

A two-axis taxonomy is the unlock

The reason teams keep landing on retry-by-default is that their error model is a single bucket — "the tool returned an error, do something" — and "wait and try again" is the cheapest something to write. The unlock is to split the bucket along two axes that have completely different recovery semantics.

The first axis is transient versus persistent. A transient error will clear if you wait: a 429 rate limit, a 503 from a service that just got paged, a network timeout under load. A persistent error will not: a 409 conflict, a 422 validation rejection, a 404 on a resource that was deleted. Retry is the right move for transient errors and a category mistake for persistent ones.

The second axis is infrastructural versus semantic. An infrastructural error is about the call itself — the wire, the auth, the rate limit, the runtime. A semantic error is about the meaning of the call — the slot was taken, the price changed, the row violates a unique constraint, the customer is on a fraud hold. Infrastructural errors usually want a retry-or-fallback policy; semantic errors almost always want a replan.

These axes are independent. A 503 is transient and infrastructural — retry is correct. A 429 is transient and arguably semantic about your rate-limit budget — retry with backoff and a budget cap. A 409 conflict is persistent and semantic — retrying is futile; the only useful response is to re-observe and replan. A 401 is persistent and infrastructural — retry is futile but a replan is also pointless because no plan from this credential will work; escalate.

The framework default that ships in most stacks today collapses all four quadrants onto the "retry with backoff" cell, which is correct for exactly one of them.

Why this is more than a code-quality complaint

Three failure modes in production motivate the upgrade.

The first is latency tax on guaranteed-failed retries. A retry budget of three with 200ms / 400ms / 800ms backoffs adds a guaranteed 1.4 seconds to any call that returns a persistent semantic error — and that latency lands on the user, who is staring at a spinner during a calendar conflict that the agent could have detected and surfaced in 200 milliseconds. Recent benchmarks measuring ReAct-style agents found that the majority of retry budget is spent on errors that can never succeed; in one 200-task evaluation more than 90% of retries were against tools that did not even exist.

The second is correctness regressions from non-idempotent retries. Many tools that look idempotent are not. A payment-create call that timed out may have actually succeeded server-side; the retry creates a duplicate charge. A send-email call that returned a 5xx may have already enqueued the message; the retry double-sends. Frameworks default to retry, application authors default to "tools should be idempotent," and when those assumptions collide the user gets billed twice. Replan, by contrast, forces the agent to re-observe state before acting, which surfaces the duplicate-side-effect risk in the planning step rather than papering over it in the execution step.

The third is eval scores that hide the real problem. Success-rate evals don't distinguish "the agent succeeded after one call" from "the agent succeeded after eight retries and a recovery." A team optimizing eval pass-rate is often unknowingly optimizing for retry-tolerance — the model learns to lean on retries the framework provides for free — and the resulting policy looks good in the suite and feels sluggish and expensive in production. The PALADIN line of work, which trains explicitly on tool-failure recovery trajectories, reports recovery-rate improvements from 32% to nearly 90% by exposing models to failure during training rather than during incident response.

The default policy that actually works

The starting point is a typed-error contract on every tool — not a free-form string, but a structured object the agent can reason about: a class (transient infrastructural, persistent infrastructural, transient semantic, persistent semantic), a code, a human-readable reason, and the observed state that produced it. With that contract in place, the recovery policy becomes mechanical rather than improvisational.

Persistent semantic errors trigger a replanning step, not a retry. The agent is given the error class, the failed step, the observed state, and the original goal, and asked to produce a new plan that accounts for what just changed. A calendar conflict produces a new plan that proposes a different time, asks the user to choose, or pivots to a different attendee set — not a tighter loop on the same slot.

Transient infrastructural errors trigger a bounded retry with backoff, exactly as in the HTTP-client world. The bound is per-call rather than per-turn so a single flaky tool cannot consume the entire latency budget.

Persistent infrastructural errors — bad credentials, missing tools, malformed schemas — escalate immediately. Retrying or replanning around a 401 is wasted compute; the agent should surface the failure to a higher authority (the user, an operator, a fallback agent with broader privilege) without burning more turns.

The replan budget is then capped per turn, separately from the retry budget per call. A typical configuration: three retries per call, two replans per turn, then escalate. This is the bounded-recovery shape that recent academic work on structured exception handling (SHIELDA and the scheduler-theoretic graph harness papers from the last year) converges on, because it prevents the two pathological loops — infinite retry on a deterministic failure, and infinite replan on a goal that's structurally impossible.

Instrumentation: replan-after-error should be your dominant graceful path

A team that has internalized this design will see something interesting in their telemetry: "replan-after-error" becomes the most common graceful-recovery path, ahead of "retry-and-succeed" by a wide margin. That is the right ratio. Most production failures are the world telling the agent something. The metric that should worry you is the inverse — high retry-then-succeed rates often mean either flaky infrastructure that ought to be fixed at the platform layer, or persistent errors getting misclassified as transient and accidentally working because someone else's idempotency saved you.

Concrete dashboards that fall out of this view: replan rate per turn split by error class, time-to-replan after a semantic failure, retry-budget consumption per tool, and the count of escalations that hit the user. Each of these tells you something a generic success-rate eval cannot. A spike in replan-rate for the calendar tool means the world is contending more often than your plans assumed — maybe a rollout broke your availability heuristics. A spike in retry-budget consumption against the payment tool means the provider is degrading and you should circuit-break before more user calls hit the wall.

The eval discipline this requires

Replan-as-default only works if your eval suite tests it. Most agent benchmarks today construct happy-path traces — the world cooperates, every tool returns success, and the agent's job is to chain them. The traces that matter for recovery are the adversarial ones, where the world changes between plan and execution: the slot the agent just confirmed got booked by another caller, the price the agent quoted just moved, the document the agent planned to read got deleted.

A good replan eval injects these state changes mid-trajectory and asserts a property — not "the agent finished the task," because sometimes the right answer is "tell the user the slot is gone," but "the agent re-observed and produced a coherent next plan, did not retry the same call against the same state, did not blow its budget thrashing." This is the property the SHIELDA and PALADIN lines of work are building eval harnesses for, and it is the kind of test your team has to write because no off-the-shelf benchmark currently tests for it on your tools.

The architectural framing

Once you've made the swap, a deeper realization tends to follow: an agent is not a chain of API calls. It is a control loop. The right response to most signals from the environment is to re-observe and decide what to do next, given the new state. Retry is a special case of that loop where the new state happens to be identical to the old state and the call happens to be idempotent — a narrow special case worth optimizing for transient infrastructure failures, but a special case nonetheless.

The dominant case — the one that should govern your defaults — is that the world has moved, and the agent's plan is now slightly wrong. The cheapest, most graceful response is to notice and adjust. Retry is what you do when you're pretending the world hasn't moved. Replan is what you do when you've accepted that it almost always has.

The frameworks will catch up eventually. Until they do, the move that buys the most reliability per hour of engineering time is to walk through every tool in your registry, classify its likely error modes along the two axes above, and rewire the default for everything in the persistent-semantic quadrant from "retry-with-backoff" to "raise-and-replan." Your latency p95 will improve, your duplicate-side-effect bug count will fall, and your team will stop confusing "the world told us no" with "the network had a hiccup."

References:Let's stay in touch and Follow me for more thoughts and updates