Your LLM Span Is Lying: What APM Tools Don't Show About Inference Latency
Your LLM call took 2,340 ms. Your APM span says so. That number is the most expensive lie in your observability stack, because four completely different failure modes all render as the same opaque purple bar. A prefill surge on a long prompt. A cold KV-cache on a tenant you haven't hit in an hour. A noisy neighbor in the provider's continuous batch. A silent routing change that parked your traffic in a different region. Same span. Same duration. Same p99 alert. Four different post-mortems.
The distributed-tracing discipline that worked for microservices — one span per network hop, a duration, a few tags — does not survive contact with hosted inference. An LLM call is not one thing. It's a pipeline of phases with radically different scaling characteristics, running on shared hardware whose behavior depends on who else is in the queue. Treating that as a single opaque span is how you end up spending three days debugging "the model got slow" when the model didn't move at all.
The two-regime problem under one duration
Every LLM call is two operations stapled together. Prefill is a forward pass over every prompt token to build the KV cache and emit the first output token. It's a large matrix multiplication, compute-bound, scaling roughly with prompt length times model FLOPs. Decode is the autoregressive loop that emits each subsequent token. Each step reloads the full weight tensor plus the growing KV cache from GPU memory, so decode is memory-bandwidth bound, scaling with output length times KV-cache size.
These phases respond to different knobs. Increase batch size and prefill gets worse (more tokens to chew through in one iteration) while decode gets better (amortizing the weight-reload cost). Chunk prefills and you improve inter-token latency at the cost of time-to-first-token. Enable speculative decoding and decode speeds up while prefill is untouched. An engine like vLLM exposes these tradeoffs directly, but a single span duration erases them.
The user-visible proxies are TTFT (time to first token) and ITL (inter-token latency, sometimes called TPOT). OpenTelemetry's current GenAI semantic conventions recognize gen_ai.response.time_to_first_chunk and nothing for decode-side distribution. So your dashboard can tell you TTFT regressed. It cannot tell you whether ITL regressed alongside it, which means it cannot distinguish "prompts got longer" from "the batch got crowded." Those require different fixes.
The cache-hit ratio the API gives you, but the span hides
Prompt caching is the biggest latency knob in most production systems. Anthropic quotes up to 85% latency reduction on long prompts; OpenAI quotes up to 80%. The wire format tells you exactly what happened: usage.cache_read_input_tokens versus usage.cache_creation_input_tokens on Anthropic, usage.prompt_tokens_details.cached_tokens on OpenAI, cachedContentTokenCount on Gemini. These fields are the single most useful signal for capacity planning, and almost nobody has them on their span.
The OTel GenAI spec does include slots for the Anthropic-style counters. What it doesn't derive is the ratio — and the ratio is what triggers the alert. When your 50k-token system prompt goes from 99% cache read to 60% cache read across a Tuesday afternoon, your p95 will climb by a second per call and your bill will triple, but your logs and dashboards will show nothing unusual unless you computed the ratio yourself and put it on the span.
Worse, a cache miss has multiple causes that matter. Did the prompt change? Did the provider evict your block due to pressure from another tenant? Is it a cold start? The span needs a cache_miss event with a reason attribute, because "cache hit ratio dropped 30%" reads completely differently depending on whether your own code changed something or the provider rebalanced capacity.
The batch you're in, and the neighbors you can't see
Every hosted inference endpoint worth using runs continuous batching. vLLM, TGI, SGLang, TensorRT-LLM, and every provider built on top of them interleave your tokens with whoever else arrived in the same iteration. This is how PagedAttention gets 20× throughput versus naive serving. It's also why your p99 latency is a function of your neighbors, not your prompt.
When a large prefill enters the batch, active decodes stall while the engine grinds through it. Anyscale's original write-up on continuous batching named it explicitly: as systems saturate, new requests get injected later, so request latency rises even though throughput looks healthy. BentoML's inference handbook adds the same warning — larger batches improve utilization but elevate tail latency, sometimes dramatically, when prefills crowd out active decodes.
None of this is on your span. There is no gen_ai.server.batch_size, no gen_ai.server.batch_position, no gen_ai.server.queue_wait_ms. These attributes exist in the server's internal counters — vLLM exposes them in Prometheus — but providers don't propagate them to you, and the OTel GenAI spec doesn't reserve a slot. So when a customer complains about a spike at 14:07, you can see that your span got slow, you cannot see that the provider's queue depth doubled for six minutes.
The request ID the provider wants, and the trace ID they don't index
You open a support ticket. The provider asks for the x-request-id from the response headers. You have a trace ID. They are not the same thing, and they are not joined anywhere.
OpenAI and Anthropic both return a request ID on every response — OpenAI as x-request-id and SDK field _request_id, Anthropic as request-id and message._request_id. These are the identifiers the provider indexes on internally. Without them, a joint post-mortem is "we saw latency around 14:07 UTC, maybe that helps" — and it does not help.
OTel reserves gen_ai.response.id for the provider's response ID (the id field in the response body, e.g., the chatcmpl-... string). That is a different value from the HTTP request ID, and support engineers usually want the latter. A gen_ai.provider.request_id attribute is a five-line instrumentation change that converts a vague ticket into a useful one. Most teams never ship it.
Speculative decoding variance, and the acceptance rate nobody logs
If your provider uses speculative decoding — and providers increasingly do, often silently — your latency has a new axis of variance. A small draft model proposes γ tokens, the big model verifies them, and accepted tokens ship for free. NVIDIA's published numbers show 2–3× speedup at acceptance rate α≥0.6 and γ≥5. Below that, speedup collapses, sometimes back toward baseline.
Acceptance rate is workload-dependent. Code generation and in-distribution prose hit high rates. Out-of-distribution tasks and adversarial prompts don't. Your traffic mix drifts over a quarter. Your p95 drifts with it. Neither the wire format nor any APM surfaces acceptance rate today, so the drift is invisible until a sales engineer runs a demo that used to feel snappy and doesn't anymore.
Proposed attributes — gen_ai.speculative.accepted_tokens, gen_ai.speculative.proposed_tokens, gen_ai.speculative.acceptance_rate — aren't in the spec. Teams running their own inference can emit them from vLLM's metrics. Teams on hosted APIs are at the mercy of whatever the provider chooses to expose in response headers, which today is usually nothing.
What the tracing surface should look like
A span that would actually let you disambiguate a p99 spike has about thirty attributes, not seven. The minimum useful schema, grouped by purpose:
- Timing split:
time_to_first_token,decode_duration, and aninter_token_latencydistribution (p50/p95/p99) as separate attributes. TTFT alone is not enough. - Cache accounting:
cache_read.input_tokens,cache_creation.input_tokens, a computedcache_hit_ratio, and a span event on miss with areason(prompt_changed/evicted/cold). - Batching state (provider-side, propagated via headers if the provider cooperates):
batch_size,batch_position,queue_wait_ms,running_requests,waiting_requests. - Speculative decoding:
accepted_tokens,proposed_tokens,acceptance_rate. - Provider correlation:
provider.request_id(the HTTP header, not the response body ID),provider.region,provider.pool_idso you can detect silent routing changes.
With those, a spike resolves cleanly. TTFT rose and ITL held → a prefill surge, probably a long prompt or a cache miss; check the cache-hit ratio. ITL rose and TTFT held → decode contention, likely a noisy neighbor or batch crowding; check queue_wait_ms. Both rose together → provider capacity event; file a ticket with the provider.request_id. Acceptance rate dropped → your traffic shifted or the provider swapped a draft model. Four hypotheses, four queries, forty minutes of debugging instead of three days.
The capacity-planning gap
The existing observability tools for LLMs — Langfuse, LangSmith, Traceloop, OpenLLMetry, Arize Phoenix, Helicone — are excellent at what they were built for: cost debugging and quality debugging. They'll tell you which prompts are expensive, which outputs failed evals, which tenants are hogging tokens. They will not tell you whether a latency regression is your code, your prompt, the provider's batch, or the provider's infrastructure, because the attributes required to answer that question don't exist in the spec they all inherit from.
Capacity planning — "can my service handle 2× traffic at the current latency SLO?" — is the question this tooling can't answer. The answer depends on prefill/decode ratios across your prompt distribution, cache-hit ratios you don't log, batch-position penalties you can't see, and speculative-decoding acceptance rates nobody reports. Teams that survive the next usage spike are teams that decided their span was too narrow and built the missing instrumentation themselves. The OTel spec will catch up, probably over 2026. Your production incident won't wait.
- https://opentelemetry.io/docs/specs/semconv/gen-ai/
- https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
- https://github.com/open-telemetry/semantic-conventions/blob/main/docs/gen-ai/gen-ai-spans.md
- https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html
- https://docs.vllm.ai/en/stable/configuration/optimization/
- https://www.anyscale.com/blog/continuous-batching-llm-inference
- https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching
- https://platform.openai.com/docs/guides/prompt-caching
- https://openai.com/index/api-prompt-caching/
- https://github.com/traceloop/openllmetry
- https://arize-ai.github.io/openinference/spec/semantic_conventions.html
- https://langfuse.com/docs/opentelemetry/get-started
- https://developer.nvidia.com/blog/an-introduction-to-speculative-decoding-for-reducing-latency-in-ai-inference/
- https://blog.hathora.dev/a-deep-dive-into-llm-inference-latencies/
