Why Long-Running AI Agents Break in Production (And the Infrastructure to Fix It)

October 28, 2025 · 9 min read

Software Engineer

Most AI agent demos work beautifully.

They run in under 30 seconds, hit three tools, and return a clean result. Then someone asks the agent to do something that actually matters — cross-reference a codebase, run a multi-stage data pipeline, process a batch of documents — and the whole thing falls apart in a cascade of timeouts, partial state, and duplicate side effects.

The problem is not the model. It is the infrastructure. Agents that run for minutes or hours face a completely different class of systems problems than agents that finish in seconds, and most teams hit this wall at the worst possible time: after they have already shipped something users depend on.

The Math That Explains Most Agent Failures

There is a number worth keeping in mind when you design an agentic workflow: step count. Not task duration — step count.

A workflow with a 95% success rate per step and 10 steps completes successfully about 60% of the time. At 20 steps, that drops to 36%. At 30 steps, you are below 21%. Even if you push per-step reliability to 99%, a 20-step workflow fails nearly 18% of the time.

This is not a modeling quality problem. It is compound probability applied to sequential systems, and it scales brutally. The fix is not to get a better model — it is to design around the math. Shorter, bounded workflows beat open-ended planning. Per-step retries with idempotency beat end-to-end reruns. Checkpoints at stage boundaries mean you recover one step, not restart everything.

A study of 306 production AI practitioners across 26 domains found that 68% constrain their agents to bounded workflows rather than open-ended planning specifically for this reason. Unbounded autonomy sounds impressive in a demo. In production, it means you cannot predict your failure rate.

The 30-Second Synchronous Boundary

Synchronous HTTP works well for AI tasks that finish in under 30 seconds. The client blocks, the agent runs, the response returns. Simple and reliable.

Beyond that boundary, everything breaks. Load balancers time out. Mobile clients drop connections. Serverless functions hit execution limits. The user refreshes the page and triggers a second request. None of this surfaces in a local development environment where the agent runs on localhost with no timeout constraints.

The correct pattern is trivially obvious in hindsight: return a task ID immediately, then let the client either poll or subscribe to a webhook for completion. The agent runs in a background worker with no connection to the originating HTTP request. This is how every reliable long-running job system works — background workers, job queues, async processing.

What makes AI agents different is everything that has to be managed alongside the job:

LLM context needs to be serialized and stored, not held in process memory
Tool call results need to be persisted so retries do not re-execute side effects
Progress signals need to reach a frontend that has long since disconnected from the originating request
Dual rate limits — requests per minute and tokens per minute — require the queue to compute "when can the next task safely run" rather than just reacting to 429 errors

Teams that route AI tasks through a standard job queue without accounting for these concerns end up building the accounting layer themselves. Or they hit production incidents and build it reactively.

Checkpoints Are Not the Same as Durable Execution

Most agent frameworks offer some form of checkpointing. LangGraph's PostgresSaver, for example, persists graph state at each node so that a failed workflow can be resumed. This is genuinely useful and far better than nothing.

But checkpointing and durable execution are not the same thing, and the distinction matters in production.

Checkpointing says: "I saved your state. You take it from here." If a worker crashes, someone or something has to detect the failure, fetch the checkpoint, and manually trigger resumption with the correct thread identifier. If two workers try to resume the same thread simultaneously, you get duplicate execution. The framework provides the persistence; you provide the orchestration.

Durable execution says: "Your workflow will run to completion. Period." The runtime detects failures through heartbeats, automatically replays event history to reconstruct state, and resumes at the exact failed step without re-running completed work. No custom watchdog logic, no distributed coordination to handle manually.

Tools like Temporal represent the durable execution model. One team that originally built their production agent system on LangGraph and Redis described the result as "powerful in concept but brittle in practice" — they migrated to Temporal specifically to eliminate the custom retry and recovery logic they had been maintaining. Newer purpose-built platforms like Trigger.dev and Inngest are pursuing a similar model with tighter AI-specific ergonomics.

For most teams building something new, the practical question is: how much orchestration logic are you willing to own? LangGraph with PostgresSaver is a reasonable production choice if you have the engineering capacity to build failure detection and recovery on top. If you want that handled for you, a durable execution runtime is worth the operational overhead.

Idempotency Is a First-Class Concern

The subtle failure mode that breaks checkpointing and retry logic is non-idempotent side effects. When an agent sends an email, writes to a database, or calls a payment API, and then the worker crashes before updating its checkpoint, the next retry will attempt to perform the same action again.

The pattern that prevents this is already well-established in distributed systems: every external write carries an idempotency key tied to the workflow identity and step number. If the same key is submitted twice, the second call is a no-op that returns the original result. Most modern APIs support this natively. The failing point is usually agents that call APIs without including idempotency keys because they were not designed with retry scenarios in mind.

This has real consequences. A compound failure scenario — an agent invents incorrect information, uses that to call a downstream API, the call fails halfway through, and a retry re-executes the whole sequence — can trigger a chain of duplicate side effects across multiple systems. Idempotency keys at every external-write step prevent the duplicate half even if they cannot prevent the underlying hallucination.

The practical checklist: every external write needs an idempotency key. Every queue should deduplication at enqueue time. Every retry should pull context from persisted state, not reconstruct it from scratch.

Human-in-the-Loop Is a Systems Design Decision

Human oversight in long-running agents is typically framed as a UX concern — show a confirmation dialog before something risky happens. In practice, it is a systems design decision with significant architectural implications.

Pausing a multi-step agent workflow to wait for a human response means the worker must release its execution context, store enough state to resume cleanly, and wait indefinitely — possibly for hours or days. This is not how most job queue implementations work. Workers are expected to either complete a job or fail it. "Waiting for input" is a third state that most systems were not designed to handle.

Frameworks that support this natively do it through explicit interrupt mechanisms. LangGraph's interrupt() function pauses the graph at a defined node, saves state, and resumes cleanly when input arrives. Temporal supports it through Signal handlers that workflows can block on indefinitely. The key property is that "waiting for human" is a first-class state in the execution model, not a workaround that involves polling a database in a hot loop.

The interrupt points worth designing into a workflow are:

Confidence threshold violations: when the agent is uncertain about intent, ask before guessing
Irreversibility boundaries: before deleting, modifying production systems, or triggering payments
Cost checkpoints: in workflows that generate significant token spend, verify progress at milestones before continuing
Ambiguous inputs: rather than hallucinating a resolution to an ambiguous instruction, surface the ambiguity

Getting this right requires designing interrupt points into the workflow topology before implementation, not adding them as an afterthought when something goes wrong in production.

The 35-Minute Degradation Problem

There is an empirically observed reliability cliff that researchers have named but not yet fully explained: AI agents degrade in reliability after approximately 35 minutes of continuous operation. The leading hypothesis is context window saturation combined with accumulated reasoning drift — each decision incorporates the noise from all previous decisions, and that noise compounds.

The practical implication is that long-horizon tasks should not be implemented as a single continuous agent run. They should be decomposed into sessions with explicit handoff points.

Anthropic's own approach for multi-session coding tasks is illustrative. An initializer agent sets up foundational state in the first session. Subsequent sessions run a coding agent that reconstructs context from git commit history and a persisted progress document, then makes incremental progress toward the goal. Each session is bounded. Context is reconstructed from durable artifacts, not carried forward from a live context window.

This pattern — bounded sessions, external state for continuity, explicit handoff artifacts — transfers to most long-horizon task domains. The agent that ran for two hours is not more capable than two agents that each ran for one hour with a clean handoff between them. It is probably less capable, and significantly harder to debug when something goes wrong.

What Most Teams Get Wrong

The pattern from real production deployments is consistent: teams build a prototype that works in demos, ship it to users, discover that it breaks on long tasks, and then retrofit async infrastructure around an architecture that was not designed for it.

The retrofitting is always more expensive than building it right initially, because the places where synchronous assumptions are baked in tend to be load-bearing — API response contracts, frontend state management, error handling, cost attribution.

The infrastructure that long-running agent tasks actually require is not novel. Job queues, durable execution, idempotency keys, and human-in-the-loop workflows are solved problems in distributed systems. The novelty is in applying them to AI-specific workloads where context management, dual rate limiting, and non-deterministic failure modes add friction. Frameworks and platforms are rapidly standardizing on patterns that handle this friction — but the fundamental decision to design for async from the beginning remains the engineer's.

If you are building an agent that will handle tasks longer than 30 seconds, design for async first. Return a task ID before you run anything. Persist every piece of state you will need to resume. Make every external write idempotent. Define your human interrupt points explicitly. Bound your session lengths. These are not optimizations to add later — they are the foundation that makes everything else reliable.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Why Long-Running AI Agents Break in Production (And the Infrastructure to Fix It)

The Math That Explains Most Agent Failures

The 30-Second Synchronous Boundary

Checkpoints Are Not the Same as Durable Execution

Idempotency Is a First-Class Concern

Human-in-the-Loop Is a Systems Design Decision

The 35-Minute Degradation Problem

What Most Teams Get Wrong

Recommended Reading

About Tian Pan

The Math That Explains Most Agent Failures​

The 30-Second Synchronous Boundary​

Checkpoints Are Not the Same as Durable Execution​

Idempotency Is a First-Class Concern​

Human-in-the-Loop Is a Systems Design Decision​

The 35-Minute Degradation Problem​

What Most Teams Get Wrong​

Recommended Reading

About Tian Pan

The Math That Explains Most Agent Failures

The 30-Second Synchronous Boundary

Checkpoints Are Not the Same as Durable Execution

Idempotency Is a First-Class Concern

Human-in-the-Loop Is a Systems Design Decision

The 35-Minute Degradation Problem

What Most Teams Get Wrong