Skip to main content

Multimodal Traces: When Modalities Must Share an ID

· 11 min read
Tian Pan
Software Engineer

A user called your support agent. They talked, the agent listened, the user uploaded a screenshot of the error mid-call, the agent reasoned over the image and the transcript, and the conversation ended with a follow-up email summarizing the fix. Three days later the user files a complaint: the fix did not work, and the email never arrived. You open your observability stack and you find three separate traces in three separate UIs. The voice pipeline shows you an ASR trace. The vision pipeline shows a span over the image upload. The LLM call shows a chat trace with a token count and a tool call. Nothing in any of these dashboards tells you they were the same conversation.

This is the postmortem nobody wants to write. Not because the data is missing — every individual modality logged what it was supposed to — but because the join across modalities was never built. Each pipeline grew its own tracing convention from whatever its model vendor shipped by default, and the conversational turn that bound them together exists only in the head of the engineer who designed the agent.

Multimodal is a tracing problem before it is a modeling problem. We have spent two years arguing about which model can simultaneously reason over text and pixels and waveforms, and almost no time arguing about how to debug what happens when one does. The result is a generation of production agents that work beautifully on the happy path and turn into archeology projects the first time anyone needs to reconstruct what actually happened.

The Span Tree That Forgot Its Root

Distributed tracing was built on a clean mental model. A user clicks something. A root span opens. Every downstream call becomes a child span. The parent-child relationships capture the execution hierarchy, and three months later you can pull up the trace and see the whole story.

Multimodal pipelines violate this model in a way most teams have not yet noticed. The voice subsystem opens its own root span at the moment audio starts streaming. The vision subsystem opens its own root span when an image arrives over a separate channel. The LLM call opens its own root span when the orchestrator finally fires a chat completion. Each of these subsystems was instrumented by a different team using whatever defaults its SDK shipped with, and none of them was told that they were all children of a single conversational turn that has no representation anywhere.

You end up with three trace trees that should have been three branches of one tree. The voice trace has rich timing data about ASR latency and turn-taking. The vision trace has the binary image artifact and the encoding pipeline. The LLM trace has the prompt, the response, and the tool calls. The conversation that united them is not a span. It is not even a header. It is a fact that exists only in your code, and nothing in your telemetry pipeline knows that fact is supposed to flow downstream.

The W3C traceparent header was designed exactly for this kind of stitching, but it was designed to flow over HTTP. When your modality boundaries are not HTTP boundaries — when audio is streaming over a WebRTC pipe and images are arriving over a separate upload channel and the LLM call is the only thing that even has an HTTP request to attach a header to — the standard propagation mechanism quietly fails. The fix is not technical. The fix is recognizing that every modality boundary in your pipeline is a place where context can be dropped, and instrumenting each crossing as if it were a network call.

The Conversational Turn Is the Real Unit of Work

The hop that matters in a multimodal agent is not the span. It is the turn. A turn is the smallest unit of interaction that has independent meaning to the user: user speaks, agent listens, agent thinks, agent responds. Inside that turn, the agent may transcribe audio, retrieve a document, look at an image, call a tool, and emit a synthesized response. The user does not perceive these as separate events. To the user, the turn either worked or it did not.

Voice observability platforms have started treating the turn as a first-class primitive — logging audio input, transcription hypotheses, prompt execution, tool calls, and synthesized speech as a single correlated record. This is the right abstraction, and it generalizes beyond voice. A multimodal turn is whatever the user did and whatever the agent did in response, threaded under one ID that survives across every modality the turn touched.

The discipline is to mint that ID at the user-action boundary — the moment the turn begins, before any subsystem has a chance to open a span of its own — and to thread it through every modality's intermediate representation as opaque metadata. Audio frames get the turn ID. Image uploads get the turn ID. Tool calls get the turn ID. Cache writes get the turn ID. When the LLM call finally happens, it inherits the turn ID. When the trace UI loads, every span tagged with that ID lines up on the same timeline, in the order the user actually experienced them.

This is harder than it sounds because the turn ID has to survive every place where one modality hands data to another. The vision pipeline takes an image and produces a feature vector and a description; both of those have to carry the turn ID forward to the LLM. The ASR pipeline takes audio and produces a transcript; the transcript has to carry the turn ID. The teams that own these pipelines often have no incentive to thread metadata they do not consume, and the agent team that does consume it discovers the gap only when someone asks them to reconstruct a turn from logs.

What Did the Model Actually See

The worst kind of multimodal postmortem is the one where the answer is "we don't know." You have the LLM's response. You have the user's complaint. You do not have the image the user uploaded, because it went to the vision pipeline's bucket and was purged on its own retention schedule. You do not have the audio, because it went to the ASR vendor's storage and you never had a copy. You have the transcript the ASR produced, which may or may not be what the audio actually contained.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates