Skip to main content

Your LLM Span Is Lying: What APM Tools Don't Show About Inference Latency

· 8 min read
Tian Pan
Software Engineer

Your LLM call took 2,340 ms. Your APM span says so. That number is the most expensive lie in your observability stack, because four completely different failure modes all render as the same opaque purple bar. A prefill surge on a long prompt. A cold KV-cache on a tenant you haven't hit in an hour. A noisy neighbor in the provider's continuous batch. A silent routing change that parked your traffic in a different region. Same span. Same duration. Same p99 alert. Four different post-mortems.

The distributed-tracing discipline that worked for microservices — one span per network hop, a duration, a few tags — does not survive contact with hosted inference. An LLM call is not one thing. It's a pipeline of phases with radically different scaling characteristics, running on shared hardware whose behavior depends on who else is in the queue. Treating that as a single opaque span is how you end up spending three days debugging "the model got slow" when the model didn't move at all.

The two-regime problem under one duration

Every LLM call is two operations stapled together. Prefill is a forward pass over every prompt token to build the KV cache and emit the first output token. It's a large matrix multiplication, compute-bound, scaling roughly with prompt length times model FLOPs. Decode is the autoregressive loop that emits each subsequent token. Each step reloads the full weight tensor plus the growing KV cache from GPU memory, so decode is memory-bandwidth bound, scaling with output length times KV-cache size.

These phases respond to different knobs. Increase batch size and prefill gets worse (more tokens to chew through in one iteration) while decode gets better (amortizing the weight-reload cost). Chunk prefills and you improve inter-token latency at the cost of time-to-first-token. Enable speculative decoding and decode speeds up while prefill is untouched. An engine like vLLM exposes these tradeoffs directly, but a single span duration erases them.

The user-visible proxies are TTFT (time to first token) and ITL (inter-token latency, sometimes called TPOT). OpenTelemetry's current GenAI semantic conventions recognize gen_ai.response.time_to_first_chunk and nothing for decode-side distribution. So your dashboard can tell you TTFT regressed. It cannot tell you whether ITL regressed alongside it, which means it cannot distinguish "prompts got longer" from "the batch got crowded." Those require different fixes.

The cache-hit ratio the API gives you, but the span hides

Prompt caching is the biggest latency knob in most production systems. Anthropic quotes up to 85% latency reduction on long prompts; OpenAI quotes up to 80%. The wire format tells you exactly what happened: usage.cache_read_input_tokens versus usage.cache_creation_input_tokens on Anthropic, usage.prompt_tokens_details.cached_tokens on OpenAI, cachedContentTokenCount on Gemini. These fields are the single most useful signal for capacity planning, and almost nobody has them on their span.

The OTel GenAI spec does include slots for the Anthropic-style counters. What it doesn't derive is the ratio — and the ratio is what triggers the alert. When your 50k-token system prompt goes from 99% cache read to 60% cache read across a Tuesday afternoon, your p95 will climb by a second per call and your bill will triple, but your logs and dashboards will show nothing unusual unless you computed the ratio yourself and put it on the span.

Worse, a cache miss has multiple causes that matter. Did the prompt change? Did the provider evict your block due to pressure from another tenant? Is it a cold start? The span needs a cache_miss event with a reason attribute, because "cache hit ratio dropped 30%" reads completely differently depending on whether your own code changed something or the provider rebalanced capacity.

The batch you're in, and the neighbors you can't see

Every hosted inference endpoint worth using runs continuous batching. vLLM, TGI, SGLang, TensorRT-LLM, and every provider built on top of them interleave your tokens with whoever else arrived in the same iteration. This is how PagedAttention gets 20× throughput versus naive serving. It's also why your p99 latency is a function of your neighbors, not your prompt.

When a large prefill enters the batch, active decodes stall while the engine grinds through it. Anyscale's original write-up on continuous batching named it explicitly: as systems saturate, new requests get injected later, so request latency rises even though throughput looks healthy. BentoML's inference handbook adds the same warning — larger batches improve utilization but elevate tail latency, sometimes dramatically, when prefills crowd out active decodes.

None of this is on your span. There is no gen_ai.server.batch_size, no gen_ai.server.batch_position, no gen_ai.server.queue_wait_ms. These attributes exist in the server's internal counters — vLLM exposes them in Prometheus — but providers don't propagate them to you, and the OTel GenAI spec doesn't reserve a slot. So when a customer complains about a spike at 14:07, you can see that your span got slow, you cannot see that the provider's queue depth doubled for six minutes.

The request ID the provider wants, and the trace ID they don't index

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates