Latency Budgets for Multi-Step Agents: Why P50 Lies and P99 Is What Users Feel
The dashboard said the agent was fast. P50 sat at 1.2 seconds, the team had a meeting to celebrate, and then the abandonment rate kept climbing. Nobody was looking at the graph the user actually lives on.
This is the reliable failure mode of multi-step agents in production: the median is the metric you can hit, the tail is the metric your users feel, and the gap between the two grows non-linearly with every sub-call you bolt onto the pipeline. A four-step agent where each step is "fast at the median" routinely produces a P99 that is six or eight times worse than any single step. Users do not experience the median. They experience the worst step in their particular trip.
If your team optimizes the wrong percentile, you will ship a system that benchmarks well, demos beautifully, and bleeds users in the long tail you never instrumented.
