The Trace That Stops at the Provider Boundary

June 1, 2026 · 11 min read

Software Engineer

You did the tracing work. Retrieval has a span. Tool calls have spans. The orchestration loop has a span. A trace ID rides through every internal hop on W3C traceparent headers, just like the SRE playbook says. Then the request hits messages.create, the SDK records a single span called llm.call, and the next 2.8 seconds of your pipeline turn into a black rectangle on the flame graph with no internal structure. The 800 milliseconds before the first token shows up: opaque. The 2 seconds of decode after that: opaque. The share of the wall clock that was network, queue wait, prefill, or per-token decode: unknowable from your trace.

When a customer reports "the assistant felt slow today," your dashboard can confirm the slowness. It cannot localize it. The most expensive minute of your pipeline — measured in dollars, in p95, in user-visible lag — lives inside a vendor's data center, and the contract you accepted when you signed up gives you almost no visibility into it. You are on call for a black box.

This post is about what that black box actually contains, what providers leak through the cracks that most clients ignore, and how to synthesize the visibility your trace was supposed to give you in the first place.

The Anatomy of the One-Span LLM Call

The default integration with any major LLM SDK gives you exactly one span: start time when you call the API, end time when the response completes. That span has attributes — model name, token counts, maybe a stop reason — but no internal events. From your trace's perspective, the inside of the call is undefined behavior.

The inside is not undefined. It decomposes into at least four distinct phases, each with its own failure mode and its own dynamics:

Network egress and TLS — the time from your process to the provider's edge. Regional, but not zero. A 50ms TLS handshake on a cold connection is a different problem than a 50ms model.
Provider-side queue — the time your request sits in line behind other tenants on the same shared inference fleet. This depends on traffic you cannot see, not on anything your client did.
Prefill — the model processing your input tokens to produce internal state for generation. Compute-bound, roughly linear in prompt length, sensitive to whether the provider's KV cache absorbed any of your input.
Decode — generating output tokens one at a time. Memory-bandwidth-bound, roughly linear in output length, sensitive to batch size on the provider's side.

A 3-second llm.call that was 100ms network, 200ms queue, 600ms prefill, and 2.1s decode is a completely different problem from a 3-second call that was 100ms network, 2.4s queue, 100ms prefill, and 400ms decode. The first is a long-output problem. The second is a noisy-neighbor problem. Your trace, by default, tells you neither.

What the Provider Actually Hands You

The reason this is fixable is that providers leak more information than the typical SDK integration surfaces. The signals are there. They are mostly thrown away.

Streaming gives you per-token timestamps for free. Both Anthropic and OpenAI ship streaming responses as Server-Sent Events. If your client is reading the stream, you already know when each chunk arrived. The wall-clock time from request-send to the first content_block_delta is your true TTFT. The deltas between successive chunks are your inter-token latency series. Most integrations sum these into a final response and discard the timing. Don't.

Request IDs are correlation primitives. Anthropic returns a request-id header on every response. The SDKs surface it as a property on the response object. OpenAI returns x-request-id. These IDs are the only way a provider support engineer can find your specific call in their logs. A customer-facing incident where you tell support "the request was slow at 14:23 UTC" is unactionable; one where you say "request ID req_01abc... took 4.1s and we expected under 2s" gets a real answer. Most traces never record this ID. Most error reports never cite it.

Rate-limit headers are health signals. Both providers return headers like anthropic-ratelimit-requests-remaining, anthropic-ratelimit-tokens-reset, and equivalents on the OpenAI side. Your client probably reads these only when a 429 happens. They are useful before that — a remaining-token budget that drops faster than usual is a leading indicator of a brownout you have not noticed yet.

Cache status is in the response when caching is in play. When you use prompt caching, the response tells you what fraction of the input was a cache read versus a cache write. A cache-miss when you expected a hit explains a 600ms TTFT regression more cleanly than any other signal. If your trace does not record cache hit/miss per call, you are flying blind on the largest single lever you have over prefill latency.

The pattern is: providers expose health and timing signals through HTTP headers, streaming events, and response metadata, and the standard SDK return values strip them out. The work is putting them back into the trace.

A Useful Trace of an LLM Call

A trace that actually localizes the cost looks different from the one-span default. The OpenTelemetry GenAI semantic conventions, now in v1.37, define attribute names for most of what you need: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, gen_ai.response.finish_reasons. But the conventions are a shape, not a fill. The interesting work is what you put on the span.

A trace with localizable latency includes:

The span starts when you build the request, not when you call the SDK. Otherwise you cannot see prompt-construction time, which is non-trivial when retrieval results are being formatted into a multi-thousand-token block.
A gen_ai.ttft_ms attribute computed from the time between request-send and the first streamed chunk. This is the single most diagnostic number on the span.
A gen_ai.inter_token_latency_ms_p50 and _p99 attribute computed from inter-chunk deltas. A p99 ITL that suddenly jumps from 40ms to 200ms while p50 stays flat is the signature of provider-side batch contention.
A gen_ai.provider_request_id attribute with the request ID header. Index it in your trace store so you can pivot from a support ticket to the exact span.
A gen_ai.cache.input_tokens_read and gen_ai.cache.input_tokens_written attribute so a cache miss is visible in the trace and not buried in the response payload.
A gen_ai.queue_indicator attribute when the provider exposes one — some gateways return a queue-depth or load hint header. Anthropic and OpenAI vary on this, but if you proxy through an inference gateway like Envoy AI Gateway, you can synthesize one from the gateway's view.
Child spans inside the call when you can synthesize them. A network.connect child span for the TLS handshake (your HTTP client can give you this through its own callbacks). A provider.ttft child span from request-send to first chunk. A provider.decode child span from first chunk to last chunk. Even when the inside of the provider remains opaque, you can decompose your view of it into the phases you actually care about.

None of this requires the provider to expose anything they don't already. It is all client-side instrumentation on signals that are already in the stream.

The Side-Channels Every Client Should Have

Beyond per-call instrumentation, there are three patterns that give you visibility into the provider as a system, not just as the source of an individual response.

Synthetic probes. Run a fixed prompt against the provider's edge on a schedule — every 30 seconds is a good starting point. Record TTFT, total latency, and an error indicator. Three properties matter: the prompt is fixed (so the result is comparable over time), the traffic is independent of your real users (so you can distinguish "the provider got slow" from "our users sent harder prompts"), and the regions match your production traffic (so a regional brownout is visible). When a customer reports slowness, the first thing you check is whether your synthetic line for that region also got slow. If yes, escalate to the provider with the synthetic data. If no, the problem is on your side.

Fan-out shadowing on a sample of requests. For some small fraction of real requests — 1% is plenty — send the same prompt to a second provider in parallel and record the latency of both, discarding the second response. This is not load balancing and it is not failover; it is a comparator. When your primary provider's TTFT is 1.5s and the shadow's is 400ms on the same prompt, you have evidence that is not "compare to historical averages" — it is a live A/B against another vendor on the same input at the same moment. Use this to escalate; do not let it become a feature.

Provider request IDs in every customer-facing error. When an LLM call fails or times out, the error your user sees, the log line in your structured logs, and the trace span all should carry the provider's request ID. This is the cheapest piece of instrumentation in this entire post and the one most commonly missed. The cost of getting it wrong is that every provider-side investigation starts with you trying to reverse-engineer which of the millions of requests was the one your customer was complaining about.

The Eval Discipline: Trace Completeness as a Metric

The instrumentation work is only durable if you measure whether it is decaying. The metric that catches this is trace completeness — the fraction of total request latency that is attributable to a named cause, versus the fraction that is unaccounted.

Compute it per-trace: sum the durations of the leaf spans you have meaningful breakdowns for, divide by the total trace duration, and call the remainder "unaccounted." For a healthy trace, unaccounted should be a small single-digit percentage — handshake overhead, scheduling jitter, instrumentation lag. When it climbs to 30% on a tenant or a route, something is happening you do not have a span for, and the right response is to add the span before the next incident.

This metric has the property that it gets worse when your system grows new behavior — a newly added retry loop, a newly added fallback path, a new tool the agent calls — and your instrumentation hasn't caught up. It is a leading indicator of observability rot, and it survives the model upgrades, vendor changes, and code rewrites that would otherwise let your trace quality silently regress.

A complementary practice: pick a "trace of the day" each week — one real production trace, preferably from a slow request — and read it span by span. If you cannot tell a coherent story about where the time went, the trace failed, regardless of whether the request succeeded.

The Architectural Realization

The shape of this problem is structural, not incidental. The most consequential latency in an AI system lives inside a multi-tenant inference fleet you do not operate, behind an API surface that returns a final string and a token count. The vendor contract gives you the answer; it does not give you the process that produced the answer. The default SDK integration is built around the contract, not around the operational reality, and so it surfaces the answer and discards the process.

The fix is not waiting for vendors to give you more. The fix is treating every byte the wire already gives you — every streamed chunk, every header, every response metadata field — as telemetry, not as plumbing. Per-token timestamps are free; you have to choose to record them. Request IDs are free; you have to choose to propagate them. Cache hit rates are free; you have to choose to surface them as span attributes instead of fields on a response object the SDK throws away.

A team that does this synthesizes visibility into a system they do not own. A team that does not has signed up to be on call for a black box, and the next incident will be the one where the customer asks why and the trace cannot answer.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Trace That Stops at the Provider Boundary

The Anatomy of the One-Span LLM Call

What the Provider Actually Hands You

A Useful Trace of an LLM Call

The Side-Channels Every Client Should Have

The Eval Discipline: Trace Completeness as a Metric

The Architectural Realization

Recommended Reading

About Tian Pan

The Anatomy of the One-Span LLM Call​

What the Provider Actually Hands You​

A Useful Trace of an LLM Call​

The Side-Channels Every Client Should Have​

The Eval Discipline: Trace Completeness as a Metric​

The Architectural Realization​

Recommended Reading

About Tian Pan

The Anatomy of the One-Span LLM Call

What the Provider Actually Hands You

A Useful Trace of an LLM Call

The Side-Channels Every Client Should Have

The Eval Discipline: Trace Completeness as a Metric

The Architectural Realization