Where the 30 Seconds Went: Latency Attribution Inside an Agent Step Your APM Can't See
The dashboard says agent.run = 28s at p95. Users say the feature feels broken. The on-call engineer opens the trace, sees a single fat bar with no children worth investigating, and starts guessing. By the time someone has rebuilt enough mental model to know whether the bottleneck is the model, the retriever, or a tool call that nobody added a span to, the incident has aged into a backlog ticket and the user has given up.
This is the failure mode at the heart of agent operations in 2026: classical APM treats an agent step as a black box, and "agent latency" is not a metric — it is the sum of seven metrics that decompose the wall-clock time differently depending on what the agent decided to do that turn. A team that doesn't expose those seven numbers ships a feature whose slowness everyone can feel and nobody can fix.
The seven phases hiding inside one span
When a user-facing dashboard reports agent.step = 12s, the wall-clock time is almost always a stack of these phases, in order:
- Input assembly — building the message list, applying the system prompt, hydrating session state, fetching prior tool outputs. Cheap when warm, surprisingly expensive when the session has rolled over and a cache miss forces a database read on every turn.
- Retrieval — embedding the query, hitting the vector store, optionally re-ranking. Production vector DB queries land in the 50–300ms range per call, and re-ranking can add another 100–500ms on top depending on the model and the candidate count.
- Prefill — the model processing the entire input context to produce its first output token. Prefill is compute-bound and grows with input length; doubling the prompt roughly doubles time-to-first-token.
- Decode — the model generating tokens one at a time, with each step gated by memory bandwidth on the KV cache. This is where reasoning effort and output length burn budget linearly, and where shared GPU batches with other traffic introduce contention nobody on the application team can see.
- Tool dispatch — the harness parsing the model's tool call, validating arguments, and routing to the right handler. Usually small, but easy to bloat when the harness round-trips through a queue.
- Tool execution — the actual external work. Could be a 50ms cache read, could be a 9-second LLM-as-a-tool nested call. The variance here is the single largest source of "why is p99 so much worse than p50."
- Re-prefill — the model processing the tool result back into context to produce the next response token. Often forgotten in mental models because it looks visually like "the model thinking again," but it's a fresh prefill pass over a longer context that grows with every tool round.
A normal trace shows steps 1–7 collapsed into a single agent.step span, or worse, two of them broken out (llm.call and tool.call) and the other five swept under a parent span's wall-clock interval. The OpenTelemetry GenAI semantic conventions, which moved through the spec process during 2024–2025, define operation names like chat, execute_tool, and invoke_agent — but adopting the convention is what unlocks the breakdown, and a team that emits only the parent span gets the same flat view they had before.
The visible 12 seconds and the invisible 4
The most common misallocation of optimization budget on an agent product is this: the LLM call shows up as the largest visible chunk in the trace, so the team spends two sprints shaving prompt tokens, swapping in a faster model, and tuning streaming buffers. After all of that, p95 moves from 28 to 24 seconds. Users still complain.
The four seconds that nobody saved were spread across the other phases:
- 600ms of input assembly because every turn refetches the user's preference document instead of caching it for the session
- 1.2s of retrieval setup because the embedding model runs on a cold container that warms up after the first request and stays warm only as long as the session is active
- 800ms of tool dispatch because the harness queues tool calls through a Redis stream that's tuned for batch throughput, not interactive latency
- 1.4s of re-prefill on the second turn because the agent dropped a 4k-token tool result into context and the model has to process it before producing the next token
None of those four costs appears in the trace as a single span you'd notice. They show up as small gaps between spans, as work attributed to a generic "middleware" span, or as time that a parent span owns but no child accounts for. If you only see what's instrumented, you optimize what's instrumented, and you leave the four seconds on the floor.
The diagnostic fix is mechanical: every gap between a parent span's start and its first child's start, every gap between any two adjacent children, and every gap between the last child and the parent's end has to be either a named phase or an explicit "unaccounted" span. If a request claims 12 seconds and the spans only sum to 9, the missing 3 seconds is the bug.
Critical path is not the sum of all spans
This is the second mistake teams make once they finally do break out their phases: they look at a trace with a 9-second tool call and an 8-second LLM call running in parallel, and they ask "which one do I optimize first?" — when the answer is that the 8-second LLM call is the entire wall-clock cost and the 9-second tool call is on a parallel branch that doesn't extend the request.
Critical path tracing, an idea Google formalized in their distributed-latency work and that the broader observability community has been adopting since around 2022, asks one question of every span: if I could make this span instantaneous, would the whole request finish faster? If yes, the span is on the critical path. If no — because something else longer was running in parallel — the span has slack, and optimizing it is wasted effort.
For agents this matters more than for classical RPC fan-outs because the agent harness can issue parallel tool calls when the model emits multiple tool requests in a single turn, and modern frameworks increasingly do this by default. A trace that says "this turn took 14 seconds and these six tool calls each took between 2 and 9 seconds" tells you nothing actionable until you collapse the parallel branches and report the critical path. The on-call needs to see "the 9-second tool call is the critical path; the other five ran in parallel and ended earlier; if you make any of the other five faster, p95 does not move."
The implementation is a per-request post-processing step that walks the span tree, identifies the longest dependent chain from request start to request end, and emits a critical_path_ms and slack_ms attribute on the parent span. After that lands, the dashboard widget for "what to optimize" becomes a sorted list by critical-path contribution rather than total span duration, and the team stops optimizing slack.
Decode time is not LLM time
A subtle but expensive confusion lives inside even well-instrumented LLM spans: the span shows total LLM call duration, and engineers reason about it as one number. But the prefill phase and the decode phase have different cost models, different optimization levers, and different sensitivities to load on the inference cluster.
Prefill is compute-bound and scales with input length. If your prefill is slow, the levers are: shorter prompts, prompt caching at the provider (so a repeated system prompt doesn't repay its prefill cost on every turn), shorter retrieved passages, and avoiding stuffing tool definitions you never call.
Decode is memory-bound and scales with output length. If your decode is slow, the levers are: shorter outputs (often by changing prompt patterns rather than adding "be concise"), structured outputs with a constrained grammar (which can decode faster), reduced reasoning effort budgets on models that expose them, and switching to a faster model for the response generation while keeping the slower model for planning.
A team that only sees llm.call.duration_ms = 8200 will spend a sprint trying to "make the LLM faster" and conclude there's nothing to do. A team that sees llm.prefill_ms = 1800 and llm.decode_ms = 6400 will correctly identify that the output is too long and spend the same sprint changing one section of the prompt to cut average output by 40%, recovering 2.5 seconds. The information needed to make this distinction is available in every modern provider's response — time_to_first_token and total duration are both reported — but it has to be captured into separate span attributes, not collapsed.
The OpenTelemetry GenAI conventions explicitly call out gen_ai.server.time_to_first_token and gen_ai.server.time_per_output_token as the right attributes to emit. Teams that haven't migrated to those conventions are paying a debugging tax every incident.
Why this hurts more than it would for a regular service
In a classical service, latency is a UX concern. In an agent, latency is also a cost concern, because every second of wall-clock time during decode is a second the model is actively spending tokens — and on reasoning models, those tokens include the invisible "thinking" tokens that get billed but never reach the user. A 5-second decode that's 80% thinking tokens and 20% output tokens is paying 4 seconds of compute for nothing the user sees, and the only signal that this is happening is the gap between total_tokens and output_tokens in the response.
The cost-per-second math also makes the optimization budget tighter. A traditional web service can spend three engineer-weeks shaving 200ms off a request that ships a million times a day, and the math works because the optimization is permanent and the request is cheap. An agent step costs cents to dollars per invocation, ships at much lower volume, and a "wrong layer" optimization that targets the visible LLM span instead of the hidden retrieval setup is two engineer-weeks that produced no measurable user improvement and no cost reduction. The ratio of investigation cost to fix cost is much worse than for traditional services, which is why the up-front investment in per-phase visibility pays back faster.
There's a security and governance angle too. When the model decides at runtime which tools to call and how many times to call them, a regression in tool selection can quietly double per-request cost without changing any code on the team's side. The first signal of this is usually a budget alert weeks later. A team that emits per-phase timing also gets a per-phase token-cost breakdown for free from the same instrumentation, and "average tool calls per turn" becomes a leading indicator that catches the regression in days, not at the end of the billing cycle.
What "good" instrumentation looks like
A useful target for an agent product team is a where_did_the_budget_go report that any product engineer can pull in under five minutes when a user complains a session was slow. It needs to show, for the slow turn:
- Wall-clock duration of the turn
- Critical-path contribution per phase, sorted descending
- The prefill/decode split for each LLM call
- The parallel-vs-serial classification for each tool call
- Any unaccounted-for time as an explicit row, not a hidden gap
- Token cost per phase, so the latency story and the cost story line up
Building this report on top of OpenTelemetry GenAI semantic conventions takes a few days of harness work, and the payoff is that the next incident triage takes five minutes instead of two days. The teams that haven't built it tend to discover its absence the first time a customer-success conversation requires a real answer to "why was my Tuesday afternoon slow?"
The architectural realization
Agent latency is not a metric. It is the sum of seven metrics, each with a different cost model, a different set of optimization levers, and a different relationship to the critical path. A trace tool that reports a single number for "agent step duration" is the agent equivalent of a load balancer reporting only "request duration" with no breakdown of upstream backend time, network time, or queue time — useful for noticing that something is wrong, useless for fixing it.
The teams shipping agent products that feel fast at p95 have done this decomposition. They emit per-phase spans, they compute critical paths, they separate prefill from decode, they make unaccounted-for time visible, and they pair every latency number with a token-cost number from the same instrumentation. They didn't get faster than the rest of the field by being smarter about prompts. They got faster by being smarter about which seconds were actually slowing down the user, and not optimizing the rest.
- https://queue.acm.org/detail.cfm?id=3526967
- https://opentelemetry.io/docs/specs/semconv/gen-ai/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-agent-spans/
- https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests
- https://www.morphllm.com/llm-inference
- https://docs.anyscale.com/llm/serving/benchmarking/metrics
- https://bentoml.com/llm/inference-optimization/llm-inference-metrics
- https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html
- https://arize.com/ai-agents/agent-observability/
- https://www.langchain.com/articles/agent-observability
- https://phoenix.arize.com/
- https://appscale.blog/en/blog/hidden-costs-of-rag-in-production-vector-db-reranking-latency-2026
- https://www.elastic.co/observability-labs/blog/tracing-data-inferred-spans-opentelemetry
