Skip to main content

Distributed Tracing Across Agent Service Boundaries: The Context Propagation Gap

· 11 min read
Tian Pan
Software Engineer

Most distributed tracing setups work fine until you add agents. The moment your system has Agent A spawning Agent B across a microservice boundary—Agent B calling a tool server, that tool server fetching from a vector database—the coherent end-to-end view shatters into disconnected fragments. Your tracing backend shows individual operations, but you've lost the causal chain that tells you why something happened, which user request triggered it, and where in the pipeline 800 milliseconds went.

This isn't a monitoring configuration problem. It's a context propagation architecture problem, and it has a specific technical shape that most teams discover the hard way.

Why W3C TraceContext Breaks at Agent Boundaries

The W3C Trace Context standard solves a narrow problem: propagating trace identity across a single HTTP request boundary. Every request carries a traceparent header with format version-trace-id-parent-id-trace-flags. Downstream services read this header, create child spans under the parent span ID, and return. Simple, reliable, well-supported.

The assumption baked into this model is synchronous, request-scoped communication. One service calls another, gets a response, and the trace is done. Agents violate this assumption in three distinct ways.

First, agents communicate asynchronously. When an orchestrator agent enqueues work for a worker agent via a message queue, there's no HTTP request to carry the traceparent header. The worker agent starts processing a message, creates its own root span, and the causal link is severed. Your tracing backend now shows two separate traces for what was actually one logical operation.

Second, agents call agents across trust boundaries. Model Context Protocol (MCP) servers are the clearest example. When an agent invokes an MCP server to execute a tool, the MCP protocol doesn't automatically receive and propagate traceparent headers from the calling agent. Each MCP server operation appears as an isolated root span unless you manually inject the header at the invocation site.

Third, agent frameworks have opaque internal loops. Frameworks like AutoGen run tool loops internally, making LLM calls and tool invocations without exposing instrumentation hooks. From the outside, you see one top-level span for the entire agent execution. What happened inside—which LLM call took 2 seconds, which tool returned a malformed response, which retry attempt finally succeeded—is invisible.

The practical result: a query to a multi-agent system that should produce one connected trace instead produces three to ten orphaned root spans in Jaeger or Zipkin, with no way to join them back together in the dashboard.

What Orphaned Spans Actually Look Like

The failure has a diagnostic signature. In your tracing backend, look for:

  • Root spans with no parent that appear mid-sequence. If you see a span with parent_span_id: null that started 300ms after the beginning of your user request, something failed to propagate context at that boundary.
  • Trace ID discontinuities. The user request enters your gateway with trace ID 4bf92f3577b34da6a3ce929d0e0e4736. By the time it reaches Agent B, the trace ID has changed. This means Agent B created a new trace root instead of continuing the original.
  • Tool call spans as roots. A tool invocation span should be a child of an LLM span, which is a child of an agent span. When tool call spans appear as roots, the framework's internal context wasn't propagated to the tool layer.
  • Timing gaps with no explaining spans. The orchestrator span ends at T+500ms and the worker span starts at T+550ms, with nothing between them. Those 50ms are a message queue transit time that's now invisible to you.

Each of these failure modes requires a different fix. Knowing which one you have is the first step toward reconnecting your trace.

The Core Fix: Explicit Context Extraction and Injection

The standard pattern for reconnecting orphaned spans is extracting the current trace context before crossing any async or service boundary, serializing it into your message or request payload, and reattaching it on the other side.

In Python with OpenTelemetry, this looks structurally like this: capture the active span context before you enqueue a task, serialize it to a carrier dictionary using the propagator, store that dictionary alongside your message payload, and on the consumer side, extract the context from the carrier before creating any new spans.

This pattern works for message queues, job queues, and any other async handoff where HTTP headers aren't automatically available. The key insight is that traceparent is just a string. It can travel through any medium—message body, database row, Redis key—as long as you put it there on the sender side and read it on the receiver side.

For MCP servers specifically, the fix is injecting traceparent and tracestate into whatever header or metadata mechanism the MCP transport supports. The Red Hat implementation pattern uses a decorator that wraps each MCP server invocation to inject the current span context before the call fires.

Baggage: Propagating Business Context Without Modifying Every Function

W3C Baggage is the underused sibling of TraceContext. While TraceContext propagates trace identity, Baggage propagates arbitrary key-value pairs that follow automatically across all spans within a trace, without you passing them through function parameters.

The practical use case: you want every span—LLM call, tool invocation, vector database query—to carry the user ID and session ID that initiated the original request. Without Baggage, you'd have to thread those values through every function call. With Baggage, you set them once at the request boundary and they appear automatically in all descendant spans.

This matters for multi-agent systems because the orchestrator knows who the user is, but the specialist agents don't—and they shouldn't need to. Setting session ID in Baggage at the entry point means your observability backend can filter all spans for a given user session without any agent needing explicit awareness of the correlation requirement.

One caveat: Baggage values travel in HTTP headers alongside traceparent, which means they're visible to every downstream service. Don't put sensitive data in Baggage. Use it for correlation identifiers, not content.

Async Context Loss in Python: The Specific Failure

Python's asyncio uses context variables (contextvars.Context) to store OpenTelemetry's active span. When you create an asyncio task with asyncio.create_task(), the task inherits the current context automatically in Python 3.7+. This sounds like it should work—and it does, in straightforward cases.

It fails when:

  • You create tasks from a background thread that has no active span context
  • You use a thread pool executor and the worker thread doesn't inherit the event loop's context
  • You create tasks inside a callback registered from a different context
  • You're using an asyncio event loop created independently of the main application context

The diagnostic: tasks created in these edge cases produce spans with no parent, appearing as root spans in your tracing backend even though they're logically children of the operation that created them.

The fix pattern is explicit context capture before the task creation point, then explicit context attachment inside the task using token = context.attach(captured_context) followed by context.detach(token) in a finally block. This is verbose but reliable.

For teams using frameworks that abstract asyncio (FastAPI, aiohttp, Celery with async workers), check whether the framework's OTel integration handles this correctly. Most modern versions do, but it's worth verifying by creating a test trace with a deliberate async boundary and checking whether the child span is correctly nested in your backend.

The OpenTelemetry GenAI Conventions Gap

OpenTelemetry's GenAI Semantic Conventions define standard attribute names for LLM spans: gen_ai.request.model, gen_ai.usage.input_tokens, gen_ai.usage.output_tokens, and similar. Datadog adopted these conventions in 2025, signaling that the spec is gaining industry traction.

The gap is in multi-agent hierarchies. The spec defines spans for individual LLM calls and tool invocations, but doesn't yet standardize attributes for orchestrator-to-worker relationships. There's no standard gen_ai.agent.role (orchestrator vs. specialist), no standard way to represent agent delegation depth, no standard cost attribution across the full agent DAG.

This means every team builds custom spans for the orchestration layer. The LLM call spans are interoperable across tools. The agent coordination spans are proprietary, which is exactly the layer you need most when debugging a production failure in a multi-agent pipeline.

The OTel GenAI SIG (launched April 2024) is actively working on multi-agent span semantics. The spec is experimental heading into 2026. If you're building observability infrastructure now, implement the existing GenAI conventions for LLM calls, and add your own structured attributes for agent coordination spans in a way that won't conflict with the eventual standard. Namespacing your custom attributes under app.agent.* rather than gen_ai.* keeps them separable.

Comparing Instrumentation Approaches

The tooling landscape has split into three architectural approaches, each with different tradeoffs at agent service boundaries.

SDK-based instrumentation (Langfuse, LangSmith, Datadog with GenAI support) requires adding the SDK to each agent service and calling instrumentation methods explicitly. This gives you the richest data—you can capture intermediate reasoning steps, confidence scores, custom business metrics—but it means every new agent service needs to be instrumented, and context propagation across boundaries is still your responsibility.

Proxy-based instrumentation (Helicone, Portkey) intercepts at the LLM API boundary. You change your LLM client's base URL to point to the proxy, and the proxy records inputs and outputs automatically. Zero code changes, 50-80ms added latency, and you get visibility into LLM calls. What you don't get: visibility into internal agent logic, tool call spans, inter-agent communication, or anything that happens between LLM calls. For multi-agent systems where the interesting failure modes live in the orchestration layer, proxy-based observability captures the wrong layer.

OpenTelemetry-first tools (Phoenix by Arize, OpenLLMetry by Traceloop, SigNoz with GenAI support) emit standard OTel format and export to any compatible backend. The key advantage: your agent services emit OTel spans, and those spans flow into Jaeger, Tempo, Datadog, Honeycomb, or any other OTel-compatible backend without vendor lock-in. Context propagation still needs to be implemented correctly, but at least you're propagating W3C TraceContext—the standard that every OTel-compatible backend understands.

For multi-agent systems, OTel-first is the lowest-risk choice. You get vendor flexibility, the instrumentation layer is standard, and when the GenAI semantic conventions stabilize you can adopt them incrementally without changing backends.

Sensitive Content in Observability Backends

Multi-agent systems often process user content that shouldn't travel to observability backends: PII, financial data, health information, confidential business documents. This creates a tension with the observability goal of capturing prompts and completions to debug failures.

The pattern that resolves this: store sensitive content in span events, not span attributes. Events are structured log entries attached to a span, filterable independently of the span itself. Your collection pipeline can strip events matching a sensitive: true tag before forwarding to the backend, preserving the span structure and timing while discarding the sensitive payload.

This only works if your collection pipeline actually implements the filtering. Define the filter rule at the collection stage, not at the agent, so the policy is enforced uniformly regardless of which agent service emits the data.

A Practical Starting Point

If you're instrumenting a multi-agent system from scratch, the highest-ROI sequence is:

Start with OpenTelemetry auto-instrumentation for your LLM calls. This gives you visibility into the model layer with minimal code. Add explicit context extraction and injection at every message queue boundary. These are the places where orphaned spans are most likely to appear and hardest to debug otherwise. Instrument agent-to-agent HTTP calls with standard HTTP client instrumentation, which handles TraceContext propagation automatically. Add custom spans for your orchestration logic—task decomposition, agent selection, result aggregation—using your own attribute namespace. Finally, instrument MCP server calls explicitly with header injection.

Don't instrument everything at once. The places where context silently drops are the async and service boundaries. Get those right first, and the rest of the trace will connect naturally.

The goal isn't perfect observability—it's having enough causal context that when a multi-agent workflow produces a wrong answer, you can determine within minutes whether the failure was in retrieval, in a specific LLM call, in agent handoff logic, or in a tool that returned malformed data. A connected trace that answers that question is worth far more than a polished dashboard that shows per-service metrics without showing how those services relate.

References:Let's stay in touch and Follow me for more thoughts and updates