The Agent That Burned an Hour Re-Trying the Same 400 Error
An agent calls a tool. The tool returns 400 Bad Request with a perfectly structured error body: {"error": "missing required field", "field": "email"}. The agent's reasoning chain quotes the error back word for word, says "I need to include the email field," and then emits the next tool call with the exact same payload as before. The loop's retry policy, written years ago for a world of flaky HTTPS connections, treats the 400 as transient and fires it again. Three times. Eight. Fifteen. An hour later, the agent has burned through context, money, and a rate-limit budget without ever sending a request the tool could accept.
The failure looks like a model problem. It is not. The model read the error correctly. The harness around it never gave the correction anywhere to land.
This is the gap between a model that can read an error and an agent that can learn from one. The model has fluent natural-language understanding of "missing field 'email'." The agent has a retry loop that does not distinguish "the network blinked" from "your request is structurally wrong." Those two situations have opposite remedies. One wants the same call again, possibly with backoff. The other wants the same call never again unless something about it changes. Conflating them is how an agent ends up doing the digital equivalent of pressing the elevator button harder.
Two Kinds of Retry the Loop Conflated Into One
The retry primitive most agent frameworks ship with was designed for a different problem. Network calls fail for reasons unrelated to the payload — a TCP reset, a 503 from a temporarily overloaded backend, a 429 because the per-second quota slipped. The right response in all three cases is wait and retry the identical request. Exponential backoff, jitter, capped attempts. The classic shape.
That primitive papers over a deeper distinction. Errors split into two categories that have almost nothing in common operationally:
- Transient errors. The request was fine. The world failed temporarily. Examples:
502,503,504,429(often), connection reset, DNS hiccup, a timeout inside the provider. The remedy is to send the same bytes again after a delay. - Deterministic errors. The request was malformed, unauthorized, or pointed at something that does not exist. Examples:
400,401,403,404,422. The remedy is to change the request. Re-sending identical bytes guarantees the identical failure.
A retry policy that fires on both classes is not retrying. It is hammering. And the most common production agent shape today does exactly this: a try/except wraps the tool call, the except block sleeps and re-invokes, and the LLM is given the error string to "reason about" on the next turn. The harness never asks: did anything about the request actually change? The model's reasoning trace says it understood the fix, but the harness has no contract that turns understanding into a structural difference in the next call.
The distinction matters because the cost profile is asymmetric. A transient retry that succeeds is cheap insurance. A deterministic retry that cannot succeed costs tokens, latency, money, rate-limit headroom, and — eventually — trust, when the agent gives up and reports failure on a task that was solvable with a one-character payload change.
Why the Model's Eloquence Hides the Bug
The seductive trap is that the LLM looks like it is fixing the problem. Read its reasoning between retries and you will see sentences like "the error indicates I need to include an email field — I will add that to the next call." Then the next call goes out without the email field. The reasoning chain and the tool invocation are two separate channels, and only the invocation actually hits the network. Eloquence in channel one is not evidence of correction in channel two.
This is a specific instance of a general agent failure: the agent narrates what it intends to do, and the harness does not verify that the narration matched the action. In an action-verification loop, every claim the agent makes about its next step is checkable against the next step's payload. In a "trust the reasoning" loop, the narrative drifts and the payload stays put.
Worse, the model has a positive incentive to repeat itself. Once an LLM has emitted a tool call within a context window, the prior call sits in its own input as a high-probability template. Generation is path-dependent: the easiest next token is the one that matches the surrounding pattern. The model that confidently announced "I will add the email field" then samples from a context where the last three tool calls did not include it. The corrective intent is overwritten by the local prior. Without a forcing function, the model regresses toward what it has already done.
The frameworks that hide tool calls behind a "the agent decides when to retry" abstraction make this invisible. The trace shows three retries, the costs add up in the billing dashboard, and the team's takeaway is "the API is flaky." The API is fine. The loop never asked the agent to commit to a change before re-firing.
The Loop Layer Is Where the Distinction Has to Live
Pushing the fix into the prompt — "if you see an error, please make sure to actually change your request" — works in demos and fails in production. Prompting is advisory. Loop policy is enforcing. The two error classes need different code paths, not different instructions to the model.
A loop that has internalized this distinction looks like:
- On transient errors: classify by status code or exception type, apply exponential backoff with jitter, cap attempts (most production agents settle on three), and re-send the identical payload. No model invocation required for the retry itself.
- On deterministic errors: do not re-invoke the tool with the same payload. Surface the structured error to the model, require a structurally different next call (different field set, different value, different endpoint), and refuse to dispatch a call whose payload hash matches the failed one.
That last clause is the load-bearing one. The harness hashes the tool call — name plus normalized arguments — and refuses to dispatch a hash it has already seen fail in the current turn. The model can produce whatever reasoning it wants. If the resulting payload is byte-identical to a known failure, the harness rejects it and tells the model "you said you'd change X, but the payload did not change. Try again, and the change must be visible in the arguments." This is the forcing function. The hash-equality check is the one mechanism that cannot be talked around.
For ambiguous cases — 429, for instance, which can mean either "wait a moment" or "you are over quota for the day" — the loop reads the response headers and the error body. A Retry-After header turns it transient. A "monthly quota exhausted" body turns it deterministic. The cost of correctly classifying ambiguous errors is one switch statement. The cost of mis-classifying them is the hour the agent burns hammering the wrong endpoint.
Prompt Patterns That Make the Correction Load-Bearing
Even with a strict loop, the model needs to be steered toward producing a different call on the next turn. Two patterns earn their keep:
Pre-commit before re-invoke. Before generating the next tool call, the agent is required to emit a structured statement: "the previous call failed because X. The specific change I am making is Y, in field Z." This goes into the conversation as its own message, separate from the tool invocation. The harness can then check that the next tool call's Z differs from the previous one. The model has been forced to externalize its theory of the fix, which makes the theory testable.
Schema-first reflection. Instead of paraphrasing the error in prose, the agent is given the tool's schema alongside the error and asked: "name the fields in this schema that are missing or wrong in your last call." The model produces a list. The next call must address every item on the list. This converts a vague "I will fix the error" into an itemized checklist the loop can verify against.
Both patterns share a structure: take the implicit reasoning the model would have done anyway and make it explicit, structured, and checkable. The model can still hallucinate the fix, but a hallucination that produces a structurally identical payload is now caught at the dispatch boundary.
A complementary pattern at the tool layer: errors should be designed to teach. A 400 that says "bad request" is information-poor. A 400 that says {"error": "missing_required_field", "field": "email", "received_keys": ["name", "phone"], "expected_keys": ["name", "phone", "email"]} makes the corrective action mechanical. Tool authors who write rich error responses save agent designers an enormous amount of work. The error is part of the tool's contract, and a contract that only tells you "no" without explaining "no because" is a contract the agent cannot integrate with.
Instrumentation That Surfaces Stuck Loops
The hardest part of this failure mode is that it can be invisible until the bill arrives. The agent looks busy. Spans light up. Tool calls are flowing. Nothing in a default observability dashboard distinguishes "fifteen tool calls, fifteen useful actions" from "fifteen tool calls, fifteen retries of the same broken payload."
The instrumentation that closes the gap is small and specific:
- Repeat-call rate per tool. For each tool, the percentage of calls within a single turn whose argument hash matches a previous call in the same turn. Healthy tools sit near zero. A tool whose repeat-call rate is climbing is being hammered.
- Same-error retry count. When a tool returns an error code, how many calls in the same turn return that same code with that same payload signature. A value above two is a stuck loop, full stop.
- Loop rate by task. Average iterations to task completion. A rising loop rate, holding task type constant, is a signal that something downstream — a tool, a schema, a model upgrade — is no longer cooperating.
- Cost per resolved task, not per call. Per-call cost dashboards hide stuck loops behind volume. Cost per resolved task makes the failure mode show up as the line graph it actually is.
When these signals exist, the question shifts from "did the agent succeed" to "is the system that produces successes healthy." Stuck loops become a first-class operational concern, not a slow bleed nobody notices until a finance review.
Tools Are Contracts, Not Conveniences
The deeper principle hiding inside this failure: a tool's error response is part of its contract. The agent that calls the tool successfully when the input is right and ignores the response when the input is wrong has not integrated with the tool. It has integrated with the happy path. The unhappy path — the structured error, the rate-limit header, the validation failure body — is the part of the API that most tells the agent how to behave next, and it is the part most agent harnesses throw away.
Building agents that respect this contract is mostly not a model problem. It is a loop problem, a prompt problem, and an instrumentation problem, in roughly that order. The model already knows how to read the error. The harness has to make reading translate into doing. The 400 your agent is about to retry for the sixteenth time is not a mystery. It is a contract the agent has not yet honored. Honor it once, in the loop layer, and the hour you were about to burn turns into a single corrective call that succeeds.
- https://sparkco.ai/blog/mastering-retry-logic-agents-a-deep-dive-into-2025-best-practices
- https://medium.com/@connect.hashblock/7-langchain-retry-timeout-patterns-for-flaky-tools-a371c3edc1d3
- https://apxml.com/courses/prompt-engineering-agentic-workflows/chapter-3-prompt-engineering-tool-use/addressing-tool-errors-via-prompts
- https://apxml.com/courses/building-advanced-llm-agent-tools/chapter-1-llm-agent-tooling-foundations/tool-error-handling
- https://fast.io/resources/ai-agent-retry-patterns/
- https://ingramhaus.com/action-verification-and-retries-in-llm-agent-execution-loops
- https://www.askui.com/blog-posts/challenge-stuck-vision-ai-agents
- https://www.fixbrokenaiapps.com/blog/ai-agents-infinite-loops
- https://medium.com/@ThinkingLoop/13-agent-eval-tests-that-catch-silent-tool-failures-79ac312d70a4
- https://dev.to/gabrielanhaia/tracing-agent-tool-calls-so-you-can-catch-a-stuck-loop-24a9
- https://dev.to/mostafa_ibrahim_774fe947b/what-is-agent-observability-traces-loop-rate-tool-errors-and-cost-per-successful-task-bl5
- https://medium.com/@kacperwlodarczyk/stuckloopdetection-how-we-stopped-an-agent-burning-12-on-47-identical-calls-a12b5ea1f193
- https://platform.claude.com/docs/en/build-with-claude/structured-outputs
