Async Agent Workflows: Designing for Long-Running Tasks
Most AI agent demos run inside a single HTTP request. The user sends a message, the agent reasons for a few seconds, the response comes back. Clean, simple, comprehensible. Then someone asks the agent to do something that takes eight minutes — run a test suite, draft a report from twenty web pages, process a batch of documents — and the whole architecture silently falls apart.
The 30-second wall is real. Cloud functions time out. Load balancers kill idle connections. Mobile clients go to sleep. None of the standard agent frameworks document what to do when your task outlives the transport layer. Most of them quietly fail.
This post covers the architecture you need to bridge the gap between synchronous HTTP and long-running autonomous work. The patterns are not novel — distributed systems engineers have been solving this for decades — but they are systematically missing from most AI agent stacks.
The Fundamental Mismatch
HTTP was designed for synchronous request-response interactions. A client connects, sends a request, and waits. The server processes the request and sends back a response. The connection closes. The server forgets.
That model works when the work completes in under a second. It degrades gracefully to maybe thirty seconds if you're willing to accept poor UX. Beyond that, you're fighting the protocol.
Agent workflows break this model in several ways. An agent doing a multi-step research task isn't just waiting for a single API call — it's running a loop of reasoning steps, each of which might trigger tool calls that themselves take variable time. An agent waiting for a human approval in the middle of a workflow might be idle for hours. An agent processing a batch of documents is doing work proportional to the input size, not the complexity of a single query.
The naive fix is to increase timeouts. Set your load balancer to five minutes and hope for the best. This works until it doesn't — and the failure mode is silent. The client times out, the agent keeps running in the background with no way to retrieve its output, and the user gets an error. If the client retries, you now have two agents running the same task. If the task has side effects — emails sent, records written, APIs called — you have a problem.
The Async Job Pattern
The correct architecture for long-running agent tasks is the same pattern used for any long-running background job: accept the work, return a handle immediately, let the client retrieve results when they're ready.
The flow looks like this:
- Client submits a task. Server validates the request and enqueues the work.
- Server returns a
task_idimmediately — the HTTP response is202 Accepted, not200 OK. - Agent runs asynchronously in the background. Task state is persisted externally.
- Client polls
GET /tasks/{task_id}to check status, or receives a webhook notification when the task completes.
This decouples the submission latency from the execution latency. The client's connection only needs to survive long enough to receive the task ID. Everything after that is async.
The task state machine is simple: pending → working → completed | failed | cancelled. Once a task reaches a terminal state it becomes immutable. This matters for correctness — you never want a client to observe a regression from completed back to working during a retry.
Idempotency Keys: Why Agents Need Them More Than APIs Do
Standard API design recommends idempotency keys for mutation endpoints. For agents, they're non-negotiable.
Here's why. When a client submits an agent task and the network drops before the 202 response arrives, the client doesn't know whether the task was accepted. The correct behavior is to retry. Without idempotency keys, that retry spawns a second agent running the same task. With side effects, that's a disaster. With expensive LLM calls, it's also wasteful.
The pattern: clients attach an Idempotency-Key header (a UUID they generate) to every task submission. The server stores the mapping from key to task ID. If the same key arrives again within a TTL window (typically 24 hours), the server returns the existing task ID without creating a new one. The client can then poll that task ID and discover the work was already completed.
For agent-specific use cases, there's an additional wrinkle. If your agent itself calls tools that have side effects, each tool call should carry an idempotency key derived from the task ID and the step number. This makes individual tool invocations safe to retry without duplication. A common scheme: {task_id}:{step_index} hashed and encoded as a key. If the agent crashes mid-execution and resumes from a checkpoint, retried tool calls at the same step produce the same effect.
Polling vs. Webhooks: A Practical Decision Framework
Teams reaching for async patterns typically want to implement webhooks immediately. Webhooks feel efficient — the server pushes notifications when work is done instead of the client burning requests in a polling loop. They are efficient, but they're not reliable, and the failure modes are subtle.
Webhooks get dropped by firewall rules. They fail silently when the receiving endpoint is temporarily down. They arrive out of order when retried. They require the client to run a publicly reachable HTTP server, which is impossible in some environments (mobile clients, scripts running on developer machines, tools behind NAT).
The practical recommendation: treat polling as the source of truth, and treat webhooks as an optimization.
Polling means the client periodically calls GET /tasks/{task_id} until it sees a terminal state. This is boring and reliable. The downside is latency and wasted requests during the polling interval. Use exponential backoff — start at one second, cap at thirty — and you'll barely register in your API metrics.
Webhooks reduce that polling overhead for clients that can receive them. When a task completes, the server fires a signed POST to the client's callback URL. If the webhook succeeds, great. If it fails or the client doesn't register one, the polling fallback catches it. The two mechanisms compose correctly: the client polls until the webhook arrives, then stops polling.
Sign your webhooks. Include an event ID for deduplication. Retry with exponential backoff from the server side. Provide a replay endpoint so clients can re-fetch missed events.
State Persistence and the Checkpoint-Resume Pattern
The most common failure mode in long-running agent workflows isn't LLM errors — it's infrastructure interruptions. Servers restart. Network partitions isolate running jobs. Cloud spot instances get preempted mid-task. If your agent state lives only in the memory of the process running the loop, any of these events means starting over from scratch.
The fix is checkpointing: saving the agent's state to durable storage after each significant step. "State" means whatever the agent needs to resume without redoing work — the step index, the intermediate outputs produced so far, the memory contents, the list of tool calls that have already been made.
A checkpoint-resume workflow looks like this. Before executing each tool call, the agent persists a checkpoint. If the process restarts, it loads the most recent checkpoint and resumes from that step. Tool calls at checkpointed steps are skipped (because they already completed and their outputs are stored). The agent continues from where it left off.
This requires that tool calls be idempotent or that their outputs be stored as part of the checkpoint. For non-idempotent tools, you need to check whether the call already happened before invoking it again. The pattern: store a log of completed tool calls with their outputs in the task record. Before any tool invocation, check if it's already in the log.
The checkpoint interval is a trade-off. More frequent checkpoints mean less work lost on failure, but more I/O overhead. For most agent workflows, checkpointing at each tool call boundary is reasonable.
What Most Agent Frameworks Get Wrong
The patterns above are well-understood in distributed systems. What's surprising is how poorly they're implemented in most agent frameworks.
The dominant failure mode is conflating the agent loop with the HTTP request lifecycle. Frameworks like LangChain and basic LlamaIndex configurations run the full agent loop synchronously inside a request handler. This works in demos. It breaks in production with any task that takes more than thirty seconds.
A second failure mode is missing idempotency at the task creation layer. Frameworks that don't support idempotency keys leave clients unable to safely retry failed submissions. Teams discover this when they start seeing duplicate task executions from network-level retries.
The third — and most insidious — failure mode is in-memory state. An agent that stores its working state as Python objects in a running process has no recovery path on crash. When the process restarts, the task disappears. The client keeps polling forever and gets no response. No error, no terminal state, just silence.
If you're building on a framework that doesn't address these, the pragmatic solution is to add a thin coordination layer on top. A simple task table in Postgres (id, idempotency_key, status, checkpoint_data, result) handles all three failure modes. The framework code runs in a worker process that loads tasks from the queue, checkpoints to that table, and updates the status on completion or failure. The HTTP API reads from the same table to serve polling requests.
Matching Infrastructure to Task Duration
Not every agent task needs the full async treatment. The overhead of job queues and polling isn't free.
A useful heuristic: synchronous execution for tasks under 10 seconds, async with polling for tasks under an hour, and a dedicated workflow engine (Temporal, Dagster, or similar) for tasks that need to span hours or days with human-in-the-loop steps.
The hour boundary matters because polling-based async works well when clients maintain a session and actively poll. For multi-hour tasks, clients often disconnect and reconnect. You need status visibility across sessions — which means persisting tasks indefinitely until they're explicitly acknowledged, not just until the polling session ends.
For tasks spanning hours or days, durable workflow engines provide what ad-hoc job queues don't: persistent timers, human input steps, compensation logic for rolling back completed steps on failure, and audit logs. The LLM loop becomes one component in a larger workflow definition rather than the entire program.
Closing Thoughts
The gap between an agent that works in a demo and an agent that works in production is mostly an infrastructure problem. The LLM reasoning code is often the easy part. The hard part is what happens when the reasoning takes longer than a connection can stay alive, when the process crashes mid-task, when a client retries a request that already went through.
The patterns for solving this — async task queues, idempotency keys, durable checkpoints, polling with webhook acceleration — are not new. They're the same patterns used in payment processing, report generation, and any other system where work happens asynchronously. Applying them to agent workflows is less about AI and more about taking distributed systems seriously.
The frameworks will catch up. Until they do, building this layer yourself is not a premature optimization. It's the difference between an agent that runs reliably in production and one that works until the first real workload arrives.
- https://workos.com/blog/mcp-async-tasks-ai-agent-workflows
- https://palospublishing.com/designing-long-running-llm-agent-workflows/
- https://www.zenml.io/blog/the-agent-deployment-gap-why-your-llm-loop-isnt-production-ready-and-what-to-do-about-it
- https://prateekjoshi.substack.com/p/how-do-long-running-agents-work
- https://www.inferable.ai/blog/posts/distributed-tool-calling-message-queues
- https://dev.to/damikaanupama/designing-asynchronous-apis-with-a-pending-processing-and-done-workflow-4gpd
