End-to-End Latency Is Not P99 of Your LLM Call: The Multipliers Nobody Measures in Agentic Systems
Your LLM API call completes in 500ms at P99. Your users are waiting 12 seconds. Both numbers are accurate, and neither is lying to you — they're just measuring completely different things. The gap between them is where most agentic systems silently bleed performance, and most teams never instrument it.
The problem is structural: P99 LLM latency is a single-call metric applied to a multi-step execution model. A ReAct agent making five sequential tool calls, retrying a hallucinated function, assembling a growing context, and generating a 300-token reasoning chain is not one LLM call. It's a distributed workflow where the LLM is just one node, and every other node has its own latency tax.
The Anatomy of an Agent Request
Consider a simple agent task: "Find the three most relevant documents on topic X, summarize each, and return a ranked list." A user submits this expecting a result in a few seconds. What actually happens:
- Prompt construction: the system assembles tool schemas, conversation history, and system prompt (100–500ms depending on context size)
- LLM call 1: model generates a
searchtool call (300–800ms) - Tool roundtrip: search executes over the network (100–300ms)
- LLM call 2: model receives results, decides to
fetchdocument A (300–800ms) - Tool roundtrip: fetch executes (200–500ms)
- Repeat for documents B and C
- LLM final call: generates 400-token summary (300ms inference + ~15 seconds decode time)
Even under optimistic conditions, this workflow exceeds 10 seconds. Your P99 LLM latency of 500ms contributed roughly 15% of total wall-clock time. Reporting "our model latency is 500ms" in this context is like reporting the database query time while ignoring the ORM, the network, and the HTTP stack.
The Four Multipliers Nobody Measures
1. Sequential Tool Roundtrips
Each tool call is a blocking synchronous operation: generate call → network → external API → parse response → next model step. At 200ms per round trip for a moderately fast external API, five sequential tool calls add one full second of pure waiting — before accounting for the model inference time between each call.
The math scales poorly. Production agents making ten tool calls in sequence accumulate 2–4 seconds of network overhead alone. Most monitoring dashboards never break this out. The latency appears as a delay between LLM calls, and since neither call is slow in isolation, the issue is invisible to anyone watching per-call P99s.
The fix exists: modern LLM APIs support parallel tool calling, where the model returns multiple tool invocations in a single response for concurrent execution. Salesforce research on deep research agents found consistent performance improvement as the number of parallel tool calls increased, with 40–60% wall-clock reduction for multi-step tasks. The tradeoff is that parallel execution requires the model to reason about which tools are independent — not all tool call sequences can be parallelized safely.
2. Retry Cascades from Hallucinated Tool Calls
This is the most pernicious multiplier because it's both large and disguised as a reliability problem. When an agent hallucinates a tool name or constructs an invalid argument schema, the system's retry handler treats it identically to a transient network failure and applies the same exponential backoff logic.
Analysis of production ReAct agent logs found that hallucinated tool calls consumed over 90% of the retry budget in some workloads. Each retry cycle adds another full LLM inference round (300–1500ms) plus backoff delay. Three retry cycles on a single hallucinated call adds 3–8 seconds before the agent moves to a real tool execution.
The deeper problem is instrumentation: teams measure "retry count" as a reliability metric, not a latency contributor. The retry handler fires, the error is caught, the metric increments — but the wall-clock cost never appears in latency dashboards. It's categorized as a failure recovery event, not a performance event.
Separating the error taxonomy helps here. Hallucinated tool calls fail with a different error signature than transient failures (TOOL_NOT_FOUND vs. timeout, connection refused). A circuit breaker that counts hallucination-type errors differently from infrastructure errors can stop applying retry delays that were designed for the wrong failure class.
3. Prompt Construction Overhead and Context Bloat
Before every LLM call in an agent loop, something has to assemble the prompt: pull conversation history, inject tool schemas, format tool results, apply the system prompt. In a ten-turn conversation with full history and five tool schemas, this context easily reaches 8,000–10,000 tokens.
The latency cost is twofold. First, prompt construction itself takes 100–500ms of CPU time per iteration for parsing, template rendering, and serialization. This is easy to miss because it happens in application code, not in LLM call spans. Second, longer prompts mean longer prefill phases. The prefill phase — processing the entire input context before generating any output — scales with token count. A 10,000-token prefill takes meaningfully longer than a 2,000-token prefill, and this time appears as "time to first token" (TTFT) on the LLM side while its root cause is context growth on the application side.
Multi-agent systems amplify this further. Research on orchestrator-subagent architectures found that token counts multiply 4–15x compared to equivalent single-call implementations without optimization. Verification steps that check subagent outputs consume up to 72% of total tokens in some configurations. None of this shows up as a per-call latency anomaly — individual calls look normal, but their count and their input sizes compound.
4. Decode Time Scaling with Output Length
Prefill and decode are fundamentally different computations, and most latency measurements conflate them. Prefill — reading the input prompt — is highly parallelizable. Decode — generating output tokens one at a time — is sequential and memory-bandwidth-bound.
At 30–50ms per output token, a 500-token reasoning chain adds 15–25 seconds of pure decode time. TTFT metrics look fine (the model started responding quickly) but the tail of the response arrives 20+ seconds later. This matters enormously for agents that use chain-of-thought reasoning, where the model thinks out loud before producing the final tool call or answer.
The practical implication: an agent configuration that produces shorter intermediate reasoning (fewer tokens per step) often delivers better end-to-end latency than one that produces more thorough reasoning, even if per-step accuracy is slightly lower. The tradeoff between output token count and response quality is real and usually uninstrumented.
How to Measure End-to-End Agent Latency Correctly
The industry has converged on OpenTelemetry as the right substrate, with the gen_ai.* semantic conventions standardized by the OpenTelemetry Generative AI SIG in 2024. The span model maps cleanly to the multipliers above:
Trace: "User Query"
├─ Span: prompt_construction (100-500ms)
├─ Span: agent_loop_iteration[1]
│ ├─ Span: llm_call (prefill + decode separately)
│ └─ Span: tool_call[search] (network + execution)
├─ Span: agent_loop_iteration[2]
│ ├─ Span: llm_call
│ └─ Span: tool_call[fetch]
└─ Span: agent_loop_iteration[3]
└─ Span: llm_call (final generation)
Each span captures model name and version, input and output token counts, TTFT and TPOT separately, tool call signatures, and retry metadata. The key discipline is measuring these four layers independently:
- Agent loop layer: total wall-clock time, iteration count, retry budget consumed
- LLM layer: TTFT (prefill proxy), TPOT (decode rate), output token count
- Tool layer: call latency including network roundtrip, execution time, error type
- Context layer: prompt size per iteration, assembly time, token growth rate
Platforms like Langfuse, OpenLLMetry, and Arize Phoenix have native OpenTelemetry backends that emit this structure without manual instrumentation for common frameworks. The instrumentation gap is usually in the tool layer: most off-the-shelf tracing handles LLM calls but drops spans when execution crosses into a custom tool implementation.
What the Numbers Actually Look Like
A five-tool research agent with no latency optimization running against a mid-tier LLM:
| Component | Time |
|---|---|
| Prompt construction (×5 iterations) | 500ms |
| LLM inference (×5 calls, P50) | 2,500ms |
| Tool roundtrips (×5 sequential) | 1,000ms |
| Retry cycle (1 hallucination) | 1,500ms |
| Decode for 400-token final output | 16,000ms |
| Total | ~21 seconds |
The P50 LLM inference is 500ms per call, total 2,500ms. That's 12% of end-to-end time. Reporting P50 LLM latency for this system as a proxy for user experience is functionally misleading.
Applying the highest-ROI optimizations changes the picture substantially:
- Parallel tool calls: cuts tool roundtrip time by 60%, saves ~600ms
- Prompt caching for static context: reduces prefill time by 30–50% and eliminates most prompt assembly overhead
- Shorter reasoning chains (200 tokens instead of 400): saves ~7 seconds of decode time
- Retry taxonomy separating hallucinations from infrastructure failures: eliminates most retry latency
The decode time reduction alone — by designing prompts that elicit shorter, more direct reasoning — is typically the single largest lever available, and it's the one most teams never pull because they're not measuring decode time separately from TTFT.
The Organizational Problem
Teams that report P99 LLM latency as their primary performance metric are optimizing the wrong thing. The metric is precise, reproducible, and almost entirely irrelevant to what users experience in multi-step agent workflows.
The organizational incentive structure makes this worse. LLM latency is easy to instrument and easy to explain in a dashboard. End-to-end agent latency requires tracing across service boundaries, correlating LLM call spans with tool execution spans, and distinguishing prefill from decode time. It's a distributed systems problem, and it requires distributed systems tooling — which most ML teams don't own.
The teams that have closed this gap — by treating agent latency as an SRE problem rather than an ML problem — find the same thing: the bottleneck is almost never the LLM call itself. It's the orchestration layer, the tool execution pattern, and the context growth that surrounds the LLM calls. Fixing those doesn't require a faster model. It requires instrumentation that makes them visible.
The first step is the same one it's always been in distributed systems: measure what users experience, not what's convenient to measure. P99 LLM latency is a component metric. What you need is end-to-end wall-clock time, broken down by the layers that compose it.
- https://docs.anyscale.com/llm/serving/benchmarking/metrics
- https://bentoml.com/llm/inference-optimization/llm-inference-metrics
- https://mlcommons.org/2025/04/llm-inference-v5/
- https://arxiv.org/html/2510.16276v1
- https://arxiv.org/html/2511.14136v1
- https://openreview.net/forum?id=tX3YcbNa5w
- https://arxiv.org/html/2509.23586v1
- https://www.parloa.com/knowledge-hub/agentic-ai-latency/
- https://cresta.com/blog/engineering-for-real-time-voice-agent-latency/
- https://opentelemetry.io/blog/2025/ai-agent-observability/
- https://langfuse.com/integrations/native/opentelemetry
- https://arxiv.org/html/2601.06007v2
- https://arxiv.org/html/2506.14852v2
- https://redis.io/blog/prefill-vs-decode/
- https://sambanova.ai/blog/agentic-inference-needs-hybrid-hardware/
- https://towardsdatascience.com/your-react-agent-is-wasting-90-of-its-retries-heres-how-to-stop-it/
- https://xqlin98.github.io/wide-deep-research-agent/
- https://openai.com/index/speeding-up-agentic-workflows-with-websockets/
- https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
