Skip to main content

Deadline Propagation in Agent Chains: What Happens to Your p95 SLO at Hop Three

· 10 min read
Tian Pan
Software Engineer

Most engineers building multi-step agent pipelines discover the same problem about two weeks after their first production incident: they set a 5-second timeout on their API gateway, their agent pipeline has four hops, and the system behaves as though there is no timeout at all. The agent at hop three doesn't know the upstream caller gave up three seconds ago. It keeps running, keeps calling tools, keeps generating tokens—and the user is already gone.

This isn't a configuration mistake. It's a structural problem. Latency constraints don't propagate across agent boundaries by default, and none of the major orchestration frameworks make deadline propagation easy. The result is a class of failures that looks like latency problems but is actually a context propagation problem.

The Math That Breaks Your SLO Before You Write a Single Line of Code

Start with the simplest case: a pipeline with five sequential agent hops. Each hop has a p95 latency of 1 second. What's the p95 latency of the full pipeline?

The intuitive answer is 5 seconds. The correct answer is worse than that.

If each hop independently meets its p95 target 95% of the time, then all five hops meet their targets simultaneously with probability 0.95^5 ≈ 0.77. That means roughly 1 in 4 requests will exceed your per-hop p95—not because any individual service is slow, but because the probabilities multiply.

The effect compounds at the tail. At p99, a single hop at 99% becomes five hops at (0.99)^5 ≈ 0.95—meaning your five-hop pipeline has a p99 equal to what a single hop delivers at p95. If you have p99 SLO commitments, you're already in trouble before you measure anything.

Real production pipelines make this worse in three ways:

  • Hops are not independent. Shared GPU infrastructure, rate-limit queues, and upstream API dependencies create correlated slowdowns that hit all hops simultaneously.
  • Each hop typically introduces coordination overhead: serialization, context assembly, tool dispatch, and response parsing that doesn't show up in the model inference latency.
  • Retries at individual hops consume time that downstream hops can no longer use—and most frameworks implement per-hop retry logic without any awareness of the budget remaining for the full request.

The practical outcome: a pipeline that looks fast in unit tests reveals its true p95 only under load, when all these effects interact.

Why Deadlines Beat Timeouts for Distributed Systems

The vocabulary matters here. A timeout is a duration: "wait no longer than 3 seconds." A deadline is an absolute timestamp: "this request must complete by 14:32:07.500Z." The difference is subtle but critical when a request crosses service boundaries.

Consider a user request entering your system at T=0 with a 5-second SLO. Your orchestrator spends 800ms planning a tool call. If you pass a 5-second timeout to the first agent, that agent has 5 full seconds—but you only have 4.2 remaining. Passing the timeout doesn't account for elapsed time.

If you pass a deadline instead—an absolute timestamp of T+5s—each downstream service can compute exactly how much time remains by subtracting the current time from the deadline. This is what gRPC does. The grpc-timeout header carries a single value that represents "how long remains," recalculated at each hop based on elapsed time. When the deadline expires, every service in the chain can detect it independently and cancel its own work.

This is the correct mental model for agent pipelines. The deadline belongs to the request, not the hop.

How Orchestration Frameworks Actually Handle This (Spoiler: Poorly)

LangChain supports per-tool timeouts using asyncio.wait_for(), and LangGraph has background job timeout configuration via environment variables. Neither framework propagates a deadline through the chain. When a tool times out in LangGraph, the timeout applies to that tool invocation in isolation. The chain doesn't know how much total budget remains; it will schedule another tool call with a fresh timeout as though the clock reset.

CrewAI exposes a max_execution_time parameter on the Agent class. Internally, it uses future.cancel()—which only cancels tasks that haven't started. An already-running LLM call will complete regardless of what future.cancel() returns, leaving orphaned threads that continue consuming tokens after the agent is "cancelled." Multiple open issues document resource leaks and orphaned threads resulting from this behavior.

The OpenAI Agents SDK separates the SSE read timeout from the HTTP operation timeout, and a documented bug in the post_writer method means the timeout parameter doesn't propagate to the component that actually maintains the streaming connection. The result is timeout configuration that appears correct but doesn't fire as expected.

Temporal, which several teams use for durable agent orchestration, has a subtler failure mode. When a long-running LLM inference activity doesn't heartbeat within its heartbeat timeout, Temporal marks the activity as timed out and retries it on another worker—while the original activity is still running. You now have two concurrent LLM calls producing the same output, with state mutations happening twice. The fix requires explicitly calling activity.RecordHeartbeat() during inference, which none of the framework wrappers do by default.

The common thread: each framework implements timeouts as a local concern, without any mechanism for a request-scoped deadline to propagate through the full chain and trigger cancellation at every active hop.

The Token Waste Problem Nobody Budgets For

Abandoned requests are expensive. A user who hits the back button or refreshes the page does not cancel the work your agent pipeline is doing on their behalf. The pipeline runs to completion, generates a full response, and then discards the result—but you pay for every token.

One documented production case: 43% of monthly LLM API spend—over 1,200outof1,200 out of 2,800—was traced to requests that ran to completion for users who had already left. The contributing factors were tool-calling retry loops that kept invoking expensive tools after the frontend had disconnected, and context windows that grew without bounds across retries, making each subsequent attempt more expensive than the last.

The retry pattern is particularly dangerous. If a tool call fails, most frameworks retry it. Each retry re-sends the full accumulated context plus the new tool invocation—and in a multi-hop pipeline, retries at hop two consume time that hop three cannot reclaim. The downstream effect is that retries cascade: hop two burns its budget retrying, hop three gets a truncated window, hop three fails, hop three retries, and by the time the pipeline terminates naturally, the user timeout fired seconds earlier.

The math compounds quickly. A five-hop pipeline where each hop retries twice with the default exponential backoff can consume 3× its expected token budget on a single failed request. Multiply by the fraction of requests that timeout, and the total cost impact is measurable.

The Architecture That Fixes It

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates