Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

May 1, 2026 · 10 min read

Software Engineer

The dashboard said the agent was fast. P50 sat at 1.2 seconds, the team had a meeting to celebrate, and then the abandonment rate kept climbing. Nobody was looking at the graph the user actually lives on.

This is the reliable failure mode of multi-step agents in production: the median is the metric you can hit, the tail is the metric your users feel, and the gap between the two grows non-linearly with every sub-call you bolt onto the pipeline. A four-step agent where each step is "fast at the median" routinely produces a P99 that is six or eight times worse than any single step. Users do not experience the median. They experience the worst step in their particular trip.

If your team optimizes the wrong percentile, you will ship a system that benchmarks well, demos beautifully, and bleeds users in the long tail you never instrumented.

The Math: Why Composition Punishes the Median

Take a five-step agent. Classify intent, retrieve documents, rerank, generate, validate. Each step has a median of 200 ms and a P99 of 2 seconds — a 10x ratio between median and tail, which is conservative for LLM-backed steps. The naive intuition says end-to-end median is 1 second, end-to-end P99 is 10 seconds, end of story.

The naive intuition is wrong in both directions, and the direction it is most wrong is the one that hurts.

For sums of independent latencies, means add and variances add. Standard deviation grows as the square root of the number of steps, so on average the system looks more predictable as you add steps — which is the trap. The averaging only works at the mean. The tail of the sum is dominated by the tail of any one component. Concretely: the probability that some step in your pipeline lands in its own 1% tail is 1 - 0.99^N. For N=5 that is roughly 5%, meaning the user-visible P95 is dragged down to roughly the per-step P99. For N=10 the user-visible P90 is dragged down to per-step P99.

Translated: as you add steps, the percentile at which "one slow step" happens migrates leftward, eating ever larger portions of your distribution. Jeff Dean and Luiz Barroso's "Tail at Scale" laid out the canonical version of this math for fan-out systems — a server with a 10 ms average and 1-second P99, queried 100 times per request, produces a distribution where 63% of user requests take over a second. The same math holds for serial composition with a softer slope, and it never bends toward you.

This is why teams that watch median latency in production are systematically blindsided. Their dashboard is a measurement of the most reliable user experience. The least reliable user experience — the one that drives churn — is invisible at the percentile they chose.

Why LLM Tails Are Worse Than Database Tails

Tail amplification in classic distributed systems comes from queueing, garbage collection, network blips, and noisy neighbors. Those tails are real but bounded — a P99/P50 ratio of 5-10x is typical. LLM-backed steps in 2026 routinely show 20-50x ratios, and the causes are messier:

Output length variance. A model that emits 50 tokens at the median can emit 2000 tokens when it decides to "be thorough." Token-by-token generation latency compounds linearly with output length, and output length is a property of the input that you cannot predict from the input alone. The same prompt class can return in 800 ms or 28 seconds depending on what the model decided to do.
Provider-side queueing. Hosted inference is multi-tenant. When the provider's GPU pool is saturated, you sit in a queue you cannot see. Status-page green does not mean tail-green.
Tool cold starts. If a step calls a sandboxed code execution environment, a serverless function, or a vector database that just scaled down, the cold-start path is 5-30x the warm-start path.
Retry storms inside the SDK. Many provider SDKs retry transparently on 429s and 500s with exponential backoff. The first failed attempt plus its backoff is invisible in your trace unless you wired up the right hooks, but it is very visible in the user's wait.
Streaming TTFT versus completion. Time-to-first-token can be reasonable while time-to-completion is awful. If a downstream step blocks on the full output, you inherit completion latency, not TTFT.

The practical implication: you cannot estimate end-to-end latency by reading a model card or a vendor benchmark. You have to measure the tail of every step, end to end, on your actual prompt distribution.

What "Latency Budget" Actually Means

A budget is a contract between the system and the user. It says: we will respond by X seconds in 99% of cases, and we will degrade gracefully in the other 1%. The budget is not aspirational. It is the constraint every step must fit inside. Without a budget, "make it faster" is unbounded and unprioritizable.

A useful budget has three parts:

An end-to-end target percentile and value. Pick what the user actually feels. For interactive UX that is usually P95, sometimes P99. Pick a value the product can defend ("95% of answers in under 4 seconds"), not a value the team can hit.
A per-step allocation that sums to the target. If the target is 4 seconds at P95 and the agent has four steps, each step has roughly 1 second of P95 budget, minus headroom for orchestration overhead. The allocation is not equal — give the generation step more than the validation step.
A hard deadline and a fallback. When a step exceeds its budget, the system does something: a degraded answer, a cached fallback, a "still working" message, a retry with a smaller model. Without a deadline, exceeded budgets propagate as user-visible hangs.

Most teams have part one and skip parts two and three. They write a target into a planning doc, never decompose it, and never enforce it at runtime. Then they discover that step three is eating 80% of the budget under load, and they have no fallback because the architecture assumes every step succeeds within its expected time.

Design Levers, From Cheapest to Most Expensive

Before you ship hedging or speculative execution, walk down the cheap levers.

Eliminate steps. The fastest step is the one that does not run. Audit the pipeline for orchestration that is there because somebody added it, not because the user sees a difference. Most agents grow a "rerank" step that fires when retrieval is already good enough, a "validate" step that catches a failure mode that does not occur on the production prompt distribution, or a "format" step that could be a single regex. Every step you cut removes its tail from your distribution.

Parallelize independent calls. When two sub-calls do not depend on each other, run them concurrently and turn sum(latencies) into max(latencies). Max of two independent calls is meaningfully better than their sum at the median and at the tail. The discipline is to actually find the independence — a lot of agent code is sequential because someone wrote it that way, not because the data dependency requires it.

Stream and start downstream consumers early. If step two can begin parsing while step one is still emitting tokens, you hide step one's tail behind step two's compute. Structured streaming (JSON schema, partial parsing) makes this safe. Most agent frameworks default to "wait for full completion before starting next step" and most agents would be faster if they did not.

Use smaller models at the front of the pipeline. A routing or classification step does not need a frontier model. A 3B parameter model returning in 80 ms median is dramatically better than an 80B model returning in 800 ms median, and the tail of the small model is correspondingly tighter. Reserve the big model for the step where its capability actually matters.

Cache aggressively at semantic boundaries. Embeddings, retrieval results, and intermediate decompositions are often re-fetched across requests. A cache hit is a step with zero tail. The investment in a good semantic cache pays for itself in P99, not just in cost.

Hedge the slow step. This is the move from the Tail at Scale paper, and it works. Issue the same request to two replicas after a delay equal to the 95th-percentile expected latency. The duplicate request costs you ~5% extra load and clips the tail dramatically. For LLM calls this means sending the same prompt to a second provider or a second region after a timeout. Most teams resist this on cost grounds; the cost is real, the user experience improvement is also real, and the tradeoff should be a conscious one rather than a default.

Set per-step deadlines and bail. When a step exceeds its budget, abort it and run the fallback. The fallback might be a cached answer, a smaller model, or a "we couldn't get a great answer, here's a partial one" path. The system that bails gracefully at second four beats the system that hangs until second twelve every time, even though the bailing system has technically "more failures."

The Dashboard You Actually Need

If your tracing setup shows you per-step P50 and an end-to-end average, you are flying blind. The minimum useful dashboard for a multi-step agent shows:

End-to-end P50, P95, P99, and P99.9 on the user-visible request, broken out by route and prompt class. The P99.9 line is the one that catches the abandonment-driving outliers your P99 averages over.
Per-step P95 and P99, with each step's contribution to end-to-end as a percentage. This is how you find the step that is eating the budget.
Distribution shape, not just percentiles. A histogram or CDF of end-to-end latency reveals bimodality (cache hits versus misses, small versus large outputs) that percentiles flatten away.
Tail correlation. When end-to-end goes slow, which step went slow? Surface this so the on-call engineer does not have to derive it from raw spans.

OpenTelemetry plus a tracing backend gets you most of this. The discipline that does not come from tooling is looking at the right percentile at the right cadence. Median-latency dashboards in agent systems are worse than no dashboard, because they create false confidence.

Budget First, Optimize Second

The teams that ship reliable agents do not start with optimization. They start with a budget — an explicit number for the percentile their users feel — and then design the pipeline to fit inside it. Every step gets an allocation. Every allocation has a deadline. Every deadline has a fallback.

This sequence is the opposite of what most teams do. The default is to build the pipeline first, measure latency at the end, and then panic-optimize when the tail is already painful. By that point the architecture has assumed every step will succeed, the fallbacks do not exist, and the team is reduced to swapping models and hoping. Budgeting up front turns latency from a thing-you-discover into a thing-you-design.

The median is a vanity metric for systems with composition. The tail is the system. Build for it, measure it, and put a deadline on it — or let your users discover it for you.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

The Math: Why Composition Punishes the Median

Why LLM Tails Are Worse Than Database Tails

What "Latency Budget" Actually Means

Design Levers, From Cheapest to Most Expensive

The Dashboard You Actually Need

Budget First, Optimize Second

Recommended Reading

About Tian Pan

The Math: Why Composition Punishes the Median​

Why LLM Tails Are Worse Than Database Tails​

What "Latency Budget" Actually Means​

Design Levers, From Cheapest to Most Expensive​

The Dashboard You Actually Need​

Budget First, Optimize Second​

Recommended Reading

About Tian Pan

The Math: Why Composition Punishes the Median

Why LLM Tails Are Worse Than Database Tails

What "Latency Budget" Actually Means

Design Levers, From Cheapest to Most Expensive

The Dashboard You Actually Need

Budget First, Optimize Second