Skip to main content

The traceparent header your gateway dropped between LLM call and tool execution

· 11 min read
Tian Pan
Software Engineer

A user reports that the agent answered correctly but the database update never happened. You open your observability tool, search for the trace ID stamped on the user-facing conversation, and find a clean tree — five LLM calls, four tool decisions, a final response. No errors. Then you search for the tool service that owns the database write, and you find another trace, with the same wall-clock window but a different trace ID, a different root span, and no link back. You search the gateway logs. Three more orphan traces. The agent run that looked like a single coherent interaction in the chat UI fragmented, in your tracing backend, into a forest.

The header that should have stitched it together is traceparent. It is a 55-byte W3C-standard string that every span in a distributed system uses to identify its parent. It is also, in most production LLM agent stacks, dropped at least once between the user's request and the side effect the user actually wanted.

Most teams discover this the same way: an incident where the agent's behavior is hard to reconstruct because the trace tree they expected to read top-to-bottom is actually four trees they cannot correlate. The fix is not subtle, but the failure mode is structural — it shows up the same way across LiteLLM, Bifrost, custom Node gateways, OpenAI Responses API, and any tool service that does not explicitly extract the inbound trace context.

Where the chain breaks

The W3C Trace Context spec defines two headers — traceparent and tracestate — that any HTTP-aware service is supposed to forward to participate in a trace. In a synchronous web app, the chain is short and most instrumentation libraries handle it automatically. In an agent loop, the chain is longer than it looks, and most of the hops are not synchronous web calls in the way the spec assumes.

A typical agent turn touches at least four boundaries. The orchestrator sends a chat completion request to the AI gateway. The gateway forwards to the model provider. The model returns tool-call instructions. The orchestrator dispatches those instructions to one or more tool services. Each tool service may itself call other internal services. The response from the tool service goes back to the orchestrator, which feeds it to the next model call. Now do that three or four times in a single user turn.

For the trace to remain coherent across that loop, the traceparent header needs to travel four directions: orchestrator→gateway, gateway→provider, orchestrator→tool, and tool→downstream-service. In practice the headers leak at the second and third boundaries.

The gateway boundary leaks because most AI gateways treat the outbound provider request as a fresh connection, not as a continuation of the inbound trace. Bifrost, for example, extracts inbound W3C context and links its own internal spans correctly — and then issues outbound calls to OpenAI or Anthropic without injecting that context into the new request. LiteLLM by default operates on an allowlist that strips most client headers for security, and traceparent is not on the default forward list. Portkey supports forwarding it, but only when you explicitly configure header forwarding for the route. Custom Node and Python gateways written in-house almost never propagate it because the developer who wrote the forwarding logic thought of Authorization and Content-Type and stopped there.

The tool dispatch boundary leaks for a different reason. Tool calls are often dispatched through a message queue, a serverless invocation, or a worker pool. HTTP context propagation is mostly automatic in OpenTelemetry SDKs, but queue propagation is not — every producer has to inject the context into message headers, and every consumer has to extract it before starting its span. Forget either side and the consumer's first span becomes a fresh root.

Why the response leg makes it worse

There is a second, subtler failure mode that compounds the first: the response leg. The W3C spec is about request-side propagation. Most server frameworks do not echo traceparent back in responses — the spec does not require them to, and the OpenTelemetry instrumentation libraries do not rely on the response carrying it.

This becomes a problem when an LLM gateway buffers a request, forwards it to the model provider, and the response comes back with a different identity than the orchestrator expected. The model provider's request-id (Anthropic) or x-request-id (OpenAI) is recorded in the response, but there is no trace ID that lets you join the orchestrator's view of the request with the provider's view of the same request. Anthropic and OpenAI do not participate in your trace.

For most teams that is fine — the provider is a black box — but it changes how you have to think about your trace's structure. Your trace should treat the provider call as a leaf span that your gateway closes when the response arrives. If the gateway is producing a parent span at all, it must do so explicitly and finish it explicitly. If the gateway is forwarding a streaming response, the parent span needs to remain open until the stream finishes, not until the initial HTTP exchange completes. Streaming agent responses where the parent span closes on 200 OK and the actual tokens stream for thirty more seconds will produce traces where every tool call inside those thirty seconds appears as a sibling of, not a child of, the chat completion that triggered them.

Async tool calls and the queue boundary

The OpenTelemetry GenAI semantic conventions, finalized over the past year, define a clean span vocabulary: create_agent, invoke_agent, invoke_workflow, execute_tool. Each execute_tool span is supposed to carry gen_ai.tool.call.arguments, gen_ai.tool.call.result, duration, retry count, and an error state. Span kind is INTERNAL. The contract is clear.

What the spec cannot enforce is parent-child linkage when the tool runs in a different process. If your orchestrator dispatches execute_tool via SQS or a Kafka topic, the consumer worker that actually runs the tool has to be the one to inject the context into the queue message and extract it on the other side. Both ends of that handoff are application code. The OpenTelemetry SDK can help you do it, but it cannot do it for you, because the queue API surface is too varied for blanket auto-instrumentation.

The same applies to threadpools, asyncio tasks, and Lambda invocations. Each of them is a context boundary that the OTel SDK has hooks for, but those hooks have to be wired. If your tool service runs in a worker process spawned by a celery task launched by a webhook, you have three hops, none of which propagate trace context unless someone wrote the propagator. The result is the classic agent observability anti-pattern: every async tool call emits a span with a parent ID that does not exist in the corpus, so the span becomes an orphan root in your tracing backend.

A useful debugging query against your trace store is: count distinct root spans per agent.session.id over a five-minute window. A healthy agent run should produce one root span per turn. If you are seeing four or eight root spans per turn, you have at least three broken propagation boundaries, and the agent's behavior cannot be reconstructed from your traces alone.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates