Deadline Propagation in Agent Chains: What Happens to Your p95 SLO at Hop Three
Most engineers building multi-step agent pipelines discover the same problem about two weeks after their first production incident: they set a 5-second timeout on their API gateway, their agent pipeline has four hops, and the system behaves as though there is no timeout at all. The agent at hop three doesn't know the upstream caller gave up three seconds ago. It keeps running, keeps calling tools, keeps generating tokens—and the user is already gone.
This isn't a configuration mistake. It's a structural problem. Latency constraints don't propagate across agent boundaries by default, and none of the major orchestration frameworks make deadline propagation easy. The result is a class of failures that looks like latency problems but is actually a context propagation problem.
The Math That Breaks Your SLO Before You Write a Single Line of Code
Start with the simplest case: a pipeline with five sequential agent hops. Each hop has a p95 latency of 1 second. What's the p95 latency of the full pipeline?
The intuitive answer is 5 seconds. The correct answer is worse than that.
If each hop independently meets its p95 target 95% of the time, then all five hops meet their targets simultaneously with probability 0.95^5 ≈ 0.77. That means roughly 1 in 4 requests will exceed your per-hop p95—not because any individual service is slow, but because the probabilities multiply.
The effect compounds at the tail. At p99, a single hop at 99% becomes five hops at (0.99)^5 ≈ 0.95—meaning your five-hop pipeline has a p99 equal to what a single hop delivers at p95. If you have p99 SLO commitments, you're already in trouble before you measure anything.
Real production pipelines make this worse in three ways:
- Hops are not independent. Shared GPU infrastructure, rate-limit queues, and upstream API dependencies create correlated slowdowns that hit all hops simultaneously.
- Each hop typically introduces coordination overhead: serialization, context assembly, tool dispatch, and response parsing that doesn't show up in the model inference latency.
- Retries at individual hops consume time that downstream hops can no longer use—and most frameworks implement per-hop retry logic without any awareness of the budget remaining for the full request.
The practical outcome: a pipeline that looks fast in unit tests reveals its true p95 only under load, when all these effects interact.
Why Deadlines Beat Timeouts for Distributed Systems
The vocabulary matters here. A timeout is a duration: "wait no longer than 3 seconds." A deadline is an absolute timestamp: "this request must complete by 14:32:07.500Z." The difference is subtle but critical when a request crosses service boundaries.
Consider a user request entering your system at T=0 with a 5-second SLO. Your orchestrator spends 800ms planning a tool call. If you pass a 5-second timeout to the first agent, that agent has 5 full seconds—but you only have 4.2 remaining. Passing the timeout doesn't account for elapsed time.
If you pass a deadline instead—an absolute timestamp of T+5s—each downstream service can compute exactly how much time remains by subtracting the current time from the deadline. This is what gRPC does. The grpc-timeout header carries a single value that represents "how long remains," recalculated at each hop based on elapsed time. When the deadline expires, every service in the chain can detect it independently and cancel its own work.
This is the correct mental model for agent pipelines. The deadline belongs to the request, not the hop.
How Orchestration Frameworks Actually Handle This (Spoiler: Poorly)
LangChain supports per-tool timeouts using asyncio.wait_for(), and LangGraph has background job timeout configuration via environment variables. Neither framework propagates a deadline through the chain. When a tool times out in LangGraph, the timeout applies to that tool invocation in isolation. The chain doesn't know how much total budget remains; it will schedule another tool call with a fresh timeout as though the clock reset.
CrewAI exposes a max_execution_time parameter on the Agent class. Internally, it uses future.cancel()—which only cancels tasks that haven't started. An already-running LLM call will complete regardless of what future.cancel() returns, leaving orphaned threads that continue consuming tokens after the agent is "cancelled." Multiple open issues document resource leaks and orphaned threads resulting from this behavior.
The OpenAI Agents SDK separates the SSE read timeout from the HTTP operation timeout, and a documented bug in the post_writer method means the timeout parameter doesn't propagate to the component that actually maintains the streaming connection. The result is timeout configuration that appears correct but doesn't fire as expected.
Temporal, which several teams use for durable agent orchestration, has a subtler failure mode. When a long-running LLM inference activity doesn't heartbeat within its heartbeat timeout, Temporal marks the activity as timed out and retries it on another worker—while the original activity is still running. You now have two concurrent LLM calls producing the same output, with state mutations happening twice. The fix requires explicitly calling activity.RecordHeartbeat() during inference, which none of the framework wrappers do by default.
The common thread: each framework implements timeouts as a local concern, without any mechanism for a request-scoped deadline to propagate through the full chain and trigger cancellation at every active hop.
The Token Waste Problem Nobody Budgets For
Abandoned requests are expensive. A user who hits the back button or refreshes the page does not cancel the work your agent pipeline is doing on their behalf. The pipeline runs to completion, generates a full response, and then discards the result—but you pay for every token.
One documented production case: 43% of monthly LLM API spend—over 2,800—was traced to requests that ran to completion for users who had already left. The contributing factors were tool-calling retry loops that kept invoking expensive tools after the frontend had disconnected, and context windows that grew without bounds across retries, making each subsequent attempt more expensive than the last.
The retry pattern is particularly dangerous. If a tool call fails, most frameworks retry it. Each retry re-sends the full accumulated context plus the new tool invocation—and in a multi-hop pipeline, retries at hop two consume time that hop three cannot reclaim. The downstream effect is that retries cascade: hop two burns its budget retrying, hop three gets a truncated window, hop three fails, hop three retries, and by the time the pipeline terminates naturally, the user timeout fired seconds earlier.
The math compounds quickly. A five-hop pipeline where each hop retries twice with the default exponential backoff can consume 3× its expected token budget on a single failed request. Multiply by the fraction of requests that timeout, and the total cost impact is measurable.
The Architecture That Fixes It
The core change is treating deadline as a first-class property of a request context, not a timeout configuration on individual components.
Set the deadline once, at the entry point. Your API gateway or request handler computes an absolute deadline based on your SLO target: deadline = now + budget. Pass this deadline—not a timeout duration—to every downstream call.
Propagate it through every hop. In Go, context.WithDeadline() handles this natively: pass the context to every downstream call and cancellation flows automatically when the deadline expires. In Python, you propagate the deadline value explicitly and recompute remaining time at each hop with remaining = deadline - time.time(). In HTTP-based agent frameworks, add a X-Request-Deadline header that each agent reads and respects.
Check remaining time before expensive operations. Before starting a long-running tool call or LLM inference, check whether enough budget remains to plausibly complete it. If remaining < estimated_cost, fail fast with a "deadline exceeded" error rather than starting work you cannot finish. This is the pattern gRPC services implement internally, and it's what prevents the cascade of started-but-doomed work that burns tokens without producing useful output.
Implement structured cancellation. When the deadline fires, cancellation must propagate down, not just stop waiting for results. This means:
- HTTP clients: forward AbortSignal to the underlying fetch call so the connection terminates
- LLM API calls: pass the abort signal to the streaming API; most providers will stop generation mid-token
- Background tool calls: maintain a registry of in-flight operations and cancel them explicitly
Reserve budget for overhead. A useful heuristic from SLO-first development: allocate no more than 80% of your total budget to planned operations, and hold 20% as reserve for coordination overhead, serialization, and error handling. For a 5-second SLO with three agent hops of equal weight, each hop gets roughly 1.3 seconds, not 1.67.
What a Budget-Aware Pipeline Looks Like
A concrete example with a 4-second SLO and three hops:
Entry: record deadline = now + 4000ms
Hop 1 (planner): budget = min(1500ms, remaining)
→ executes, records elapsed, passes remaining deadline
Hop 2 (retriever): remaining = deadline - now ≈ 2400ms, budget = min(1500ms, remaining)
→ if remaining < 300ms, return deadline_exceeded immediately
→ executes, records elapsed
Hop 3 (generator): remaining = deadline - now ≈ 900ms
→ if remaining < 500ms (minimum viable generation time), return deadline_exceeded
→ constrains max_tokens based on remaining time estimate
At each hop, two things happen: the deadline is checked against a minimum viable execution window, and the remaining budget is passed forward. Hop three doesn't get a fresh 4-second window—it gets whatever the chain didn't spend.
This structure also surfaces real cost data. When hops consistently return deadline_exceeded at the same stage, you know where your budget is going. When a tool call consistently consumes 80% of the hop budget, that's where optimization effort should focus.
The Monitoring You Need to Enforce It
Deadline propagation without observability is incomplete. The signals worth tracking:
- Hop-level latency distribution (not just pipeline total). Per-hop p50/p95/p99 helps you see which stages are budget consumers.
- Deadline-exceeded rate per hop. If hop three is returning deadline_exceeded on 15% of requests, the pipeline's upstream hops are running long—not that hop three is slow.
- Token spend on abandoned requests. Correlate request completion status with token usage to quantify the cost of requests that ran after their deadline expired.
- Cancellation propagation success rate. Confirm that when a frontend client disconnects, your backend pipeline actually cancels. Many teams assume this happens; few have verified it.
The absence of this data is why deadline problems typically surface as "costs are higher than expected" or "some users see slow responses" rather than as the structural failure they are.
The Default Is Wrong
Most agent orchestration frameworks default to no deadline propagation. Each hop gets its configured timeout from scratch. The user-facing SLO exists in the API gateway configuration and nowhere else in the stack.
For simple single-hop agent calls, this is probably fine. For pipelines with more than two hops, retries at any layer, or tool calls that consume significant latency budget, the default is a guaranteed path to tail latency blowouts and unexpected token costs.
The fix isn't complicated—it's a discipline problem. Deadline must be treated as a property of the request, not the service. Once that mental model is in place, the implementation follows naturally from whatever distributed context mechanism your stack already uses.
Start with instrumentation: add per-hop latency tracking and correlate it with pipeline outcomes. The data will show you exactly where the budget goes. Then add deadline propagation at the entry point and propagate it forward. The cascade of started-but-doomed work that characterizes most agent timeout failures will stop—and so will the token costs that come with it.
- https://grpc.io/docs/guides/deadlines/
- https://engineering.grab.com/context-deadlines-and-how-to-set-them
- https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/
- https://www.xgrid.co/resources/temporal-ai-agent-orchestration-failure-patterns/
- https://medium.com/@connect.hashblock/7-langchain-retry-timeout-patterns-for-flaky-tools-a371c3edc1d3
- https://medium.com/@ThinkingLoop/slo-first-development-10-latency-budgets-you-can-keep-6bcdb19e9c95
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
- https://dev.to/buildwithabid/how-i-found-1240month-in-wasted-llm-api-costs-and-built-a-tool-to-find-yours-3041
