Skip to main content

Where the 30 Seconds Went: Latency Attribution Inside an Agent Step Your APM Can't See

· 11 min read
Tian Pan
Software Engineer

The dashboard says agent.run = 28s at p95. Users say the feature feels broken. The on-call engineer opens the trace, sees a single fat bar with no children worth investigating, and starts guessing. By the time someone has rebuilt enough mental model to know whether the bottleneck is the model, the retriever, or a tool call that nobody added a span to, the incident has aged into a backlog ticket and the user has given up.

This is the failure mode at the heart of agent operations in 2026: classical APM treats an agent step as a black box, and "agent latency" is not a metric — it is the sum of seven metrics that decompose the wall-clock time differently depending on what the agent decided to do that turn. A team that doesn't expose those seven numbers ships a feature whose slowness everyone can feel and nobody can fix.

The seven phases hiding inside one span

When a user-facing dashboard reports agent.step = 12s, the wall-clock time is almost always a stack of these phases, in order:

  1. Input assembly — building the message list, applying the system prompt, hydrating session state, fetching prior tool outputs. Cheap when warm, surprisingly expensive when the session has rolled over and a cache miss forces a database read on every turn.
  2. Retrieval — embedding the query, hitting the vector store, optionally re-ranking. Production vector DB queries land in the 50–300ms range per call, and re-ranking can add another 100–500ms on top depending on the model and the candidate count.
  3. Prefill — the model processing the entire input context to produce its first output token. Prefill is compute-bound and grows with input length; doubling the prompt roughly doubles time-to-first-token.
  4. Decode — the model generating tokens one at a time, with each step gated by memory bandwidth on the KV cache. This is where reasoning effort and output length burn budget linearly, and where shared GPU batches with other traffic introduce contention nobody on the application team can see.
  5. Tool dispatch — the harness parsing the model's tool call, validating arguments, and routing to the right handler. Usually small, but easy to bloat when the harness round-trips through a queue.
  6. Tool execution — the actual external work. Could be a 50ms cache read, could be a 9-second LLM-as-a-tool nested call. The variance here is the single largest source of "why is p99 so much worse than p50."
  7. Re-prefill — the model processing the tool result back into context to produce the next response token. Often forgotten in mental models because it looks visually like "the model thinking again," but it's a fresh prefill pass over a longer context that grows with every tool round.

A normal trace shows steps 1–7 collapsed into a single agent.step span, or worse, two of them broken out (llm.call and tool.call) and the other five swept under a parent span's wall-clock interval. The OpenTelemetry GenAI semantic conventions, which moved through the spec process during 2024–2025, define operation names like chat, execute_tool, and invoke_agent — but adopting the convention is what unlocks the breakdown, and a team that emits only the parent span gets the same flat view they had before.

The visible 12 seconds and the invisible 4

The most common misallocation of optimization budget on an agent product is this: the LLM call shows up as the largest visible chunk in the trace, so the team spends two sprints shaving prompt tokens, swapping in a faster model, and tuning streaming buffers. After all of that, p95 moves from 28 to 24 seconds. Users still complain.

The four seconds that nobody saved were spread across the other phases:

  • 600ms of input assembly because every turn refetches the user's preference document instead of caching it for the session
  • 1.2s of retrieval setup because the embedding model runs on a cold container that warms up after the first request and stays warm only as long as the session is active
  • 800ms of tool dispatch because the harness queues tool calls through a Redis stream that's tuned for batch throughput, not interactive latency
  • 1.4s of re-prefill on the second turn because the agent dropped a 4k-token tool result into context and the model has to process it before producing the next token

None of those four costs appears in the trace as a single span you'd notice. They show up as small gaps between spans, as work attributed to a generic "middleware" span, or as time that a parent span owns but no child accounts for. If you only see what's instrumented, you optimize what's instrumented, and you leave the four seconds on the floor.

The diagnostic fix is mechanical: every gap between a parent span's start and its first child's start, every gap between any two adjacent children, and every gap between the last child and the parent's end has to be either a named phase or an explicit "unaccounted" span. If a request claims 12 seconds and the spans only sum to 9, the missing 3 seconds is the bug.

Critical path is not the sum of all spans

This is the second mistake teams make once they finally do break out their phases: they look at a trace with a 9-second tool call and an 8-second LLM call running in parallel, and they ask "which one do I optimize first?" — when the answer is that the 8-second LLM call is the entire wall-clock cost and the 9-second tool call is on a parallel branch that doesn't extend the request.

Critical path tracing, an idea Google formalized in their distributed-latency work and that the broader observability community has been adopting since around 2022, asks one question of every span: if I could make this span instantaneous, would the whole request finish faster? If yes, the span is on the critical path. If no — because something else longer was running in parallel — the span has slack, and optimizing it is wasted effort.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates