Skip to main content

Timeout-Aware Agent Design: How to Deliver Partial Results Instead of Silent Failure

· 10 min read
Tian Pan
Software Engineer

An agent successfully creates a GitHub issue, opens a Jira ticket, and updates a shared spreadsheet. Then it times out before sending the Slack announcement. The framework records the run as delivered. The user never gets notified. The side effects exist in three systems; the result that matters to the human doesn't.

This is the most common timeout failure mode in production agent systems, and it's almost never the one teams prepare for. Most agent implementations treat a timeout like any other exception: catch it, log it, return an error. The user gets nothing, even though the agent completed 90% of the work. The question isn't whether to set timeouts — every production system needs them. The question is what an agent does when the clock runs out.

The Anatomy of a Deadline Failure

Agent failures at deadline fall into two distinct categories. The first is loud: the agent crashes, throws an exception, and the caller sees an error. Annoying, but recoverable. The second is the one that causes real damage: the agent completes irreversible side effects — writes, notifications, external API calls — and then fails before delivering the result to the user.

The distinction matters because these two failure modes require entirely different mitigations. A loud failure needs retry logic and circuit breakers. A silent partial-completion failure needs transaction discipline: understanding which operations can be rewound and which cannot, and sequencing them accordingly.

There's a third failure mode that teams rarely account for: the initialization tax. A well-resourced agent initializing against memory stores, credential systems, and skill registries can burn 75 seconds of a 300-second budget before doing any productive work. Your timeout looks like 300 seconds but effectively behaves like 225. Systems sized by wall-clock tests — which skip cold start — tend to fail in production at a rate that surprises everyone.

Why Agents Fail Completely Instead of Partially

The root cause is architectural. Most agent implementations model the full task as a single atomic unit. Either the task succeeds or it fails; there's no representation of intermediate state. When a timeout fires, the runtime has no checkpoint to return from, no partial schema to populate, and no signal to send the caller beyond "timed out."

This contrasts sharply with how mature distributed systems handle deadlines. A database query that times out can still return the rows it scanned before cancellation. A streaming API signals the client to close gracefully. A file download resumes from a byte offset. These patterns work because the underlying protocols were designed with partial completion in mind — the schema allows for it, the client expects it, and the timeout becomes a progress gate rather than a hard stop.

Agent loops inherited none of this. The ReAct loop — observe, reason, act, repeat — doesn't have a natural early-exit hook. LangChain's AgentExecutor added max_iterations and max_execution_time parameters, which helped contain runaway behavior. What they didn't add was a structured way to package whatever progress existed at the moment of interruption. The agent stops at iteration 4 of 10, and everything it learned in those four steps evaporates.

Three Patterns That Actually Work

Checkpoint-First Execution

The most robust approach treats every agent step as a durable state transition rather than an ephemeral function call. After every LLM call and every tool return, the system writes a checkpoint: current context, accumulated results, pending actions. When a timeout fires, execution stops cleanly at the most recent checkpoint rather than at an arbitrary mid-step.

Durable execution frameworks like Temporal implement this automatically. Every workflow step is captured in an event history; if the process crashes mid-run, a new worker replays from the event log and resumes from where the previous worker stopped. The agent never re-does completed work, and completed results are never lost.

This pattern has costs. Checkpoint writes add latency to every step, and the event log grows proportionally to execution length. For short-running agents (under 30 seconds), the overhead often isn't worth it. For agents expected to run for minutes, checkpointing is the only way to make partial progress visible to callers.

Structured Partial-Result Schemas

Checkpointing stores progress for recovery. Partial-result schemas communicate that progress to callers. The difference is who the audience is: checkpoints are for the system, partial results are for the user.

A schema designed for partial completion marks fields as optional and includes a status envelope. Rather than returning a complete analysis object or nothing, the agent returns:

{
"status": "partial",
"completed_steps": ["competitor_pricing", "market_share"],
"missing_steps": ["financial_projections"],
"reason": "timeout",
"results": { ... }
}

The caller — whether a UI, another agent, or an orchestrator — can now make an informed decision about what to show. Eight out of ten competitor prices is immensely more useful than zero. A partial inventory report is better than nothing when a manager is deciding whether to place an emergency order.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates