Durable Agents: Why Async Queues Break for Long-Running AI Workflows
An agent that works 95% of the time per step is not a 95% reliable agent. Chain twenty steps together and the end-to-end completion rate drops to 36%. This is the arithmetic most teams discover only after their agent hits production, and it is the reason so many "working" prototypes stall the moment real traffic arrives. The fix is not better prompts or bigger models. It is a boring piece of distributed systems infrastructure most AI teams try to avoid until the third outage forces their hand.
The infrastructure is durable execution — the discipline of making a multi-step workflow survive crashes, restarts, and partial failures without losing its place. It is not a new idea. Temporal, Restate, DBOS, Inngest, and Azure Durable Task have been selling it for years. What is new in 2026 is that every serious agent framework has quietly admitted durable execution is table stakes: LangGraph now ships with a PostgresSaver checkpointer, the OpenAI Agents SDK exposes a resume primitive, Anthropic's Managed Agents runs on an internal durable substrate. If your agent architecture still rests on a Celery queue and optimism, you are solving in 2026 a problem the rest of the industry stopped pretending to ignore in 2024.
This post is about the architectural seam between a stateless LLM and the stateful workflow engine that has to wrap it. The seam is where reliability lives, and it is where most teams are currently writing bugs.
