Dead Letters for Agents: What to Do When No Agent Can Complete a Task
A team building a multi-agent research tool discovered, on day eleven of a runaway job, that two of their agents had been cross-referencing each other's outputs in a loop the entire time. The bill: $47,000. No human had seen the results. No alarm had fired. The system simply kept running, confident it was making progress, because nothing in the architecture asked the question: what happens when a task genuinely cannot be completed?
Message queues solved this problem decades ago with the dead-letter queue (DLQ). A message that exceeds its delivery retry limit gets routed to a holding area where operators can inspect it, fix the root cause, and replay it when the system is ready. The pattern is simple, battle-tested, and almost entirely missing from production agent systems today.
