Skip to main content

Dead Letters for Agents: What to Do When No Agent Can Complete a Task

· 10 min read
Tian Pan
Software Engineer

A team building a multi-agent research tool discovered, on day eleven of a runaway job, that two of their agents had been cross-referencing each other's outputs in a loop the entire time. The bill: $47,000. No human had seen the results. No alarm had fired. The system simply kept running, confident it was making progress, because nothing in the architecture asked the question: what happens when a task genuinely cannot be completed?

Message queues solved this problem decades ago with the dead-letter queue (DLQ). A message that exceeds its delivery retry limit gets routed to a holding area where operators can inspect it, fix the root cause, and replay it when the system is ready. The pattern is simple, battle-tested, and almost entirely missing from production agent systems today.

Agent failures are not the same as message delivery failures, though. A queue consumer either processes a message or it doesn't — a binary outcome. An agent can hallucinate confidently, loop without progress, violate a policy it didn't know existed, exhaust its context window mid-task, or cascade a bad output into five downstream agents before anyone notices. These failures carry semantic weight that a retry count cannot capture. Adapting the DLQ pattern for agents means preserving that weight — and routing it somewhere useful.

Why Agents Fail Differently Than Messages

A queue delivery failure has one meaningful attribute: the number of attempts already made. That's enough to decide "give up and dead-letter this." Agent task failures have at least six distinct modes, each requiring a different recovery path:

Confidence collapse. The agent produces output that looks correct but is built on a shaky inference chain. Small deviations in reasoning compound, and by step fifteen the answer is operationally wrong in ways factual accuracy checks won't catch. Unlike a delivery failure, there's no error code — the agent returned 200 OK.

Infinite loops. Two agents cross-referencing each other, a tool that always returns data the agent disagrees with, a planning loop that generates and discards the same subtask repeatedly. Finite iteration limits catch the obvious case, but semantic loops — where the agent phrases the same stuck state differently each time — evade fingerprint-based detection.

Tool permission errors. The agent calls an endpoint it shouldn't have access to, or hits a resource that requires approval the system never provisioned. These are permanent failures in the current context. Retrying won't help; the fix requires a human to change the agent's authorization scope.

Context overflow. As context windows fill up, models lose fidelity on information buried earlier in the conversation. An agent that was reliably tracking a complex multi-step plan at step three may silently drop a constraint at step twenty. The failure isn't a crash — it's drift.

Policy refusal mid-task. An agent discovers partway through a workflow that completing the next step would require an action the model's safety training won't permit. The task is abandoned at an intermediate state, leaving downstream agents waiting for output that will never come.

Multi-agent cascade. One agent produces a malformed output. The next agent, which trusts its input, processes it and generates a subtly wrong result. By the time the cascade reaches the final step, the original error is invisible. A system with ten agents each running at 95% individual reliability delivers approximately 60% end-to-end reliability — and failures in the early stages corrupt everything downstream.

What to Preserve at Failure Time

The classic DLQ preserves the message payload and a timestamp. For agent tasks, that's not nearly enough to diagnose what went wrong or to safely resume the work. A useful agent dead-letter record should capture:

Original intent. The user's goal as expressed before the agent began interpreting it. This is distinct from the current execution state, which may have drifted significantly from what was actually requested.

Complete tool call history. Every tool invoked, with its input parameters and output, in order. This is the execution trace — the equivalent of a stack dump at crash time. Without it, a human reviewer can't understand what the agent already tried or why the failure occurred.

Confidence signals at key steps. If the model surfaces confidence scores or produces reasoning chains, those should be captured at each decision point. A failure that happened at low confidence is different from one that happened at high confidence — the first suggests the task was genuinely ambiguous, the second suggests a subtle model error or bad data.

Failure classification. Was this a transient error (rate limit, timeout) or a permanent one (policy refusal, permission denied)? Is it recoverable with a different model, or does it require human intervention? Classifying the failure at capture time makes routing decisions much faster.

Retry count and prior recovery attempts. If the task has already been escalated once from a cheaper model to a more capable one and still failed, that's critical context. Dead-lettering a task that's already been through two rounds of escalation calls for a different recovery path than dead-lettering on first failure.

Execution state and memory. Any persistent memory the agent wrote, any partial results it produced, the current position in a multi-step plan. If the task is eventually retried or handed to a human, they shouldn't have to start from scratch.

The Three Recovery Paths

Once a task lands in the dead-letter queue, it needs to be routed to exactly one of three outcomes.

Auto-retry with escalation. Many transient failures — rate limits, timeouts, context overflow — can be resolved by retrying with a more capable model or with expanded tool access. A task that failed on a lightweight model should be re-attempted with a frontier model before it reaches a human reviewer. This escalation should be bounded: two escalation tiers maximum, with a hard dead-letter if both fail. Unbounded escalation recreates the original problem at higher cost.

Human-review drainage. Permanent failures and high-ambiguity cases require a human. The design of the drainage queue matters enormously here. The reviewer needs to see the original intent, the failure reason in plain language, the tool call history, and a recommended action (retry with different context, modify the task, abandon). Without this structure, the queue becomes a pile of opaque failures that operators dread touching. With it, most failed tasks can be triaged in under two minutes.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates