Durable Agents: Why Async Queues Break for Long-Running AI Workflows
An agent that works 95% of the time per step is not a 95% reliable agent. Chain twenty steps together and the end-to-end completion rate drops to 36%. This is the arithmetic most teams discover only after their agent hits production, and it is the reason so many "working" prototypes stall the moment real traffic arrives. The fix is not better prompts or bigger models. It is a boring piece of distributed systems infrastructure most AI teams try to avoid until the third outage forces their hand.
The infrastructure is durable execution — the discipline of making a multi-step workflow survive crashes, restarts, and partial failures without losing its place. It is not a new idea. Temporal, Restate, DBOS, Inngest, and Azure Durable Task have been selling it for years. What is new in 2026 is that every serious agent framework has quietly admitted durable execution is table stakes: LangGraph now ships with a PostgresSaver checkpointer, the OpenAI Agents SDK exposes a resume primitive, Anthropic's Managed Agents runs on an internal durable substrate. If your agent architecture still rests on a Celery queue and optimism, you are solving in 2026 a problem the rest of the industry stopped pretending to ignore in 2024.
This post is about the architectural seam between a stateless LLM and the stateful workflow engine that has to wrap it. The seam is where reliability lives, and it is where most teams are currently writing bugs.
Why The Async Queue Pattern Breaks
The default reflex when an agent call takes longer than a request-response cycle can tolerate is to push it onto a queue. Redis, SQS, RabbitMQ, Celery — the exact choice barely matters. You enqueue a job, a worker picks it up, calls the LLM, calls some tools, writes the result somewhere, and acks the message. For a single-shot inference this works fine. For a multi-step agent it is a trap waiting to close.
The first failure is worker death mid-execution. A research agent that spends eighteen minutes gathering sources and synthesizing findings has burned dozens of dollars in LLM spend by the time the Kubernetes autoscaler decides to reschedule its pod. When the worker dies, the queue redelivers the message — and the agent restarts from step one, spending the same dollars again. You can try to mitigate this with in-memory state, but in-memory is precisely what the worker lost. You can try to mitigate it with a visibility timeout, but you do not actually know how long the agent needs, because the LLM is non-deterministic and the tools might loop.
The second failure is retries that are not safe to retry. If step seven of your agent issued a refund, sent a Slack message, or created a Stripe charge, redelivering the job causes the agent to plan its way to step seven again and do it twice. The queue gave you at-least-once delivery, which is what queues give you. Making the work idempotent is your problem, and with an LLM in the loop "idempotent" does not mean "compute the same value twice" — the LLM might plan differently this time. A retry of a non-deterministic workflow is not a retry; it is a fresh run.
The third failure is the state sprawl that accumulates when you try to patch around the first two. Teams start writing status flags to Postgres after every step, checking them on retry, building ad-hoc resumption logic into each agent. What looks like "we just added a few columns" turns into a home-grown workflow engine with worse semantics than the ones you could have adopted off the shelf. The maintenance tax compounds every time someone adds a new step.
The Seam: Stateless Planner, Stateful Substrate
The clean mental model is to treat the LLM as a stateless oracle that returns the next intended action, and to treat the workflow engine as the durable substrate that actually executes those actions, persists their results, and offers the LLM a consistent view of what has happened so far. The agent loop becomes: ask the LLM what to do, record the decision, execute the action durably, record the result, feed everything back to the LLM, repeat.
This inversion matters because it puts the non-deterministic component where it belongs — as a pure function from observed state to a proposed action — and puts the deterministic machinery around it. The workflow engine handles what deterministic machinery is good at: persistence, retries, timeouts, concurrency, compensation. The LLM handles what LLMs are good at: picking the next move. The seam between them is a small API, not a sprawl of ad-hoc state columns.
Two implementation styles dominate. Journal-based replay, used by Temporal and Restate, records every completed step to a log and replays the log on recovery, skipping completed work by returning cached results. Database checkpointing, used by LangGraph and DBOS, persists the full state object after each node. Both get you to the same place — a workflow that resumes exactly where it left off — with different tradeoffs on storage, replay cost, and how much state lives inside the framework versus inside your database.
The distinction that actually matters is whether the framework treats side effects as first-class. Diagrid's critique of pure checkpoint-based systems is pointed: a checkpointer alone does not know that step three called a payment API, so it cannot guarantee that replay will not call it a second time. Durable execution requires every external interaction — every LLM call, every tool call, every database write — to be wrapped in a primitive that the engine knows how to dedupe on replay. Without that, you have resumption theater, not resumption.
Sagas and Compensation When Rollback Is Not An Option
Once you accept that agent workflows span minutes to days and touch external systems, you lose the ability to use database transactions as your consistency model. You cannot hold a Postgres transaction open across an LLM call that might take two minutes, and you certainly cannot hold one across a tool call that hits a third-party API. The answer the distributed systems community settled on twenty years ago is the saga pattern, and it transfers to agents almost unchanged.
A saga is a sequence of local transactions, each paired with a compensating action that logically undoes it. If step five fails, the engine runs the compensations for steps one through four in reverse. The compensations are not rollbacks in the ACID sense — they are semantic undos. A credit-card charge gets compensated by a refund, not by deleting the row. A sent email gets compensated by an apology email, not by un-sending. The compensation is an application-level decision the engine cannot make for you.
Sagas become more interesting in agent workflows because the LLM itself is part of the decision graph. A recent paper on SagaLLM makes the point that multi-agent LLM planners need transactional guarantees or their plans quietly drift out of sync with the world they are acting on. Practically, this means every externally-observable action an agent takes — writing to a CRM, calling a webhook, creating a calendar event — should be encoded as a step with a defined compensation, not as a bare tool call. Most agent frameworks today do not enforce this, which is one reason agents that look fine in demos behave horrifyingly when a mid-flow failure forces recovery.
The hardest part of saga design for agents is deciding what is compensable and what is not. Sending a message to a customer is not undoable. Granting a discount is partially undoable. Creating a GitHub issue is trivially undoable. The saga design pressure forces engineers to think about action reversibility up front, which is a forcing function that tends to produce saner agent designs even when durable execution is the nominal motivation.
Idempotency Keys That Actually Work With LLMs
Idempotency is the primitive that makes retries safe, and agent workflows stress it in ways most libraries were not built for. The standard HTTP idempotency-key pattern — client generates a UUID, server caches the response for twenty-four hours — works fine when the client is deterministic. When the client is an LLM, two problems emerge.
First, the LLM might generate a different request on retry. If the model decided on retry to call send_email(to=customer, subject="welcome") instead of last attempt's send_email(to=customer, subject="Welcome"), a content-hash idempotency key will treat them as distinct requests and send two emails. The fix is to derive the idempotency key from the workflow's deterministic context — the workflow ID plus the step number — not from the arguments. That way, whatever the LLM decides to send on retry, the engine recognizes it as a replay of the same logical step and returns the cached result.
Second, the LLM call itself needs idempotency or it will burn tokens every replay. Journal-based engines handle this by caching the LLM response in the journal and returning it on replay, which is correct but has a subtle failure mode: if your prompt template changes between the original run and the replay, the cached response may no longer be coherent with the new prompt. Frameworks that version workflows explicitly — Temporal's workflow versioning, LangGraph's node identity — sidestep this; frameworks that do not will eventually bite you.
The bigger lesson is that idempotency in agent workflows is a property of the workflow engine, not of the individual actions. You cannot bolt it on action-by-action because actions are generated dynamically by a model that might change its mind. The engine has to own the identity of each step.
What To Adopt And When
Reaching for Temporal or Restate on day one for a toy agent is overkill and will slow you down. Reaching for them only after three production incidents is the common path, and by then you have accumulated enough ad-hoc state to make migration painful. The middle path that works in practice is to pick the lightest-weight durable substrate that matches your team's skill profile, and adopt it the moment the agent goes past a single round-trip.
For teams already on Python and LangGraph, the built-in checkpointer with a PostgresSaver is the lowest-friction starting point. It gives you resumption and human-in-the-loop pauses without introducing new infrastructure. The upgrade path when you hit its limits — cross-service transactions, multi-region, very long-running workflows — is to move to Temporal or Restate. Do not try to scale LangGraph's checkpointer into a general workflow engine; that is not what it is.
For teams comfortable with explicit workflow-as-code, Temporal is the mature option and has become something close to the default for AI-native companies that operate at scale. Restate offers a lighter footprint and is worth evaluating if your deployment model favors serverless or edge. DBOS is a newer entrant that collapses the workflow engine into your Postgres database, which is appealing for smaller teams that do not want to operate another system. Inngest sits in the queue-native camp and is easier to adopt than Temporal but gives up some of Temporal's replay semantics.
Whichever you pick, the commitment you are making is architectural, not just operational: every externally-visible action gets wrapped in an engine primitive, every step gets a compensation or an explicit "this is irreversible, do not retry" tag, and the LLM stops being the thing that knows where the workflow is in its execution. The LLM picks the next step; the engine remembers that the step happened. Confusing those two responsibilities is the original sin of most production agent outages.
The Takeaway
The question every team building multi-step agents will eventually face is whether to reinvent durable execution by accident or adopt it on purpose. Adopting it on purpose is cheaper, because the alternative — a growing collection of status flags, retry counters, and resumption checkpoints scattered across your database — is just a worse version of the same thing, built by a smaller team with less review. The architectural seam between the stateless planner and the stateful substrate is the most important line in an agent system. Draw it early, defend it, and let the workflow engine own everything on the durable side.
- https://temporal.io/solutions/ai
- https://temporal.io/blog/of-course-you-can-build-dynamic-ai-agents-with-temporal
- https://zylos.ai/research/2026-02-17-durable-execution-ai-agents
- https://www.inngest.com/blog/durable-execution-key-to-harnessing-ai-agents
- https://inference.sh/blog/agent-runtime/durable-execution
- https://docs.langchain.com/oss/python/langgraph/durable-execution
- https://www.diagrid.io/blog/checkpoints-are-not-durable-execution-why-langgraph-crewai-google-adk-and-others-fall-short-for-production-agent-workflows
- https://learn.microsoft.com/en-us/azure/durable-task/sdks/durable-task-for-ai-agents
- https://www.dbos.dev/blog/durable-execution-crashproof-ai-agents
- https://arxiv.org/html/2503.11951v3
- https://cloud.google.com/blog/topics/developers-practitioners/implementing-saga-pattern-workflows
- https://docs.aws.amazon.com/prescriptive-guidance/latest/agentic-ai-patterns/prompt-chaining-saga-patterns.html
- https://render.com/articles/durable-workflow-platforms-ai-agents-llm-workloads
- https://www.zenml.io/blog/the-agent-deployment-gap-why-your-llm-loop-isnt-production-ready-and-what-to-do-about-it
