Skip to main content

Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel

· 10 min read
Tian Pan
Software Engineer

The dashboard said the agent was fast. P50 sat at 1.2 seconds, the team had a meeting to celebrate, and then the abandonment rate kept climbing. Nobody was looking at the graph the user actually lives on.

This is the reliable failure mode of multi-step agents in production: the median is the metric you can hit, the tail is the metric your users feel, and the gap between the two grows non-linearly with every sub-call you bolt onto the pipeline. A four-step agent where each step is "fast at the median" routinely produces a P99 that is six or eight times worse than any single step. Users do not experience the median. They experience the worst step in their particular trip.

If your team optimizes the wrong percentile, you will ship a system that benchmarks well, demos beautifully, and bleeds users in the long tail you never instrumented.

The Math: Why Composition Punishes the Median

Take a five-step agent. Classify intent, retrieve documents, rerank, generate, validate. Each step has a median of 200 ms and a P99 of 2 seconds — a 10x ratio between median and tail, which is conservative for LLM-backed steps. The naive intuition says end-to-end median is 1 second, end-to-end P99 is 10 seconds, end of story.

The naive intuition is wrong in both directions, and the direction it is most wrong is the one that hurts.

For sums of independent latencies, means add and variances add. Standard deviation grows as the square root of the number of steps, so on average the system looks more predictable as you add steps — which is the trap. The averaging only works at the mean. The tail of the sum is dominated by the tail of any one component. Concretely: the probability that some step in your pipeline lands in its own 1% tail is 1 - 0.99^N. For N=5 that is roughly 5%, meaning the user-visible P95 is dragged down to roughly the per-step P99. For N=10 the user-visible P90 is dragged down to per-step P99.

Translated: as you add steps, the percentile at which "one slow step" happens migrates leftward, eating ever larger portions of your distribution. Jeff Dean and Luiz Barroso's "Tail at Scale" laid out the canonical version of this math for fan-out systems — a server with a 10 ms average and 1-second P99, queried 100 times per request, produces a distribution where 63% of user requests take over a second. The same math holds for serial composition with a softer slope, and it never bends toward you.

This is why teams that watch median latency in production are systematically blindsided. Their dashboard is a measurement of the most reliable user experience. The least reliable user experience — the one that drives churn — is invisible at the percentile they chose.

Why LLM Tails Are Worse Than Database Tails

Tail amplification in classic distributed systems comes from queueing, garbage collection, network blips, and noisy neighbors. Those tails are real but bounded — a P99/P50 ratio of 5-10x is typical. LLM-backed steps in 2026 routinely show 20-50x ratios, and the causes are messier:

  • Output length variance. A model that emits 50 tokens at the median can emit 2000 tokens when it decides to "be thorough." Token-by-token generation latency compounds linearly with output length, and output length is a property of the input that you cannot predict from the input alone. The same prompt class can return in 800 ms or 28 seconds depending on what the model decided to do.
  • Provider-side queueing. Hosted inference is multi-tenant. When the provider's GPU pool is saturated, you sit in a queue you cannot see. Status-page green does not mean tail-green.
  • Tool cold starts. If a step calls a sandboxed code execution environment, a serverless function, or a vector database that just scaled down, the cold-start path is 5-30x the warm-start path.
  • Retry storms inside the SDK. Many provider SDKs retry transparently on 429s and 500s with exponential backoff. The first failed attempt plus its backoff is invisible in your trace unless you wired up the right hooks, but it is very visible in the user's wait.
  • Streaming TTFT versus completion. Time-to-first-token can be reasonable while time-to-completion is awful. If a downstream step blocks on the full output, you inherit completion latency, not TTFT.

The practical implication: you cannot estimate end-to-end latency by reading a model card or a vendor benchmark. You have to measure the tail of every step, end to end, on your actual prompt distribution.

What "Latency Budget" Actually Means

A budget is a contract between the system and the user. It says: we will respond by X seconds in 99% of cases, and we will degrade gracefully in the other 1%. The budget is not aspirational. It is the constraint every step must fit inside. Without a budget, "make it faster" is unbounded and unprioritizable.

A useful budget has three parts:

  1. An end-to-end target percentile and value. Pick what the user actually feels. For interactive UX that is usually P95, sometimes P99. Pick a value the product can defend ("95% of answers in under 4 seconds"), not a value the team can hit.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates