6 posts tagged with "distributed-tracing"

The traceparent header your gateway dropped between LLM call and tool execution

June 3, 2026 · 11 min read

Software Engineer

A user reports that the agent answered correctly but the database update never happened. You open your observability tool, search for the trace ID stamped on the user-facing conversation, and find a clean tree — five LLM calls, four tool decisions, a final response. No errors. Then you search for the tool service that owns the database write, and you find another trace, with the same wall-clock window but a different trace ID, a different root span, and no link back. You search the gateway logs. Three more orphan traces. The agent run that looked like a single coherent interaction in the chat UI fragmented, in your tracing backend, into a forest.

The header that should have stitched it together is traceparent. It is a 55-byte W3C-standard string that every span in a distributed system uses to identify its parent. It is also, in most production LLM agent stacks, dropped at least once between the user's request and the side effect the user actually wanted.

The Trace Timeline Whose Timestamps Were Stamped by the Client Clock, Not the Gateway

June 2, 2026 · 9 min read

Tian Pan

Software Engineer

You opened the trace for a slow conversation. The model call started 800 milliseconds before the user pressed send. You blamed the user's laptop, closed the tab, moved on.

That is not one user with a bad clock. That is roughly a third of your traffic, and every debug session that crosses the client boundary is reading a timeline that does not exist. Browser clocks are user-settable, frequently unsynchronized, and occasionally wrong by days. The instrumentation SDK that ships with most observability stacks stamps client spans with whatever the device reports, links them by traceparent ID into a tree with server spans stamped by a synchronized server clock, and hands the result to your on-call engineer as if the two halves were comparable. They are not.

Multimodal Traces: When Modalities Must Share an ID

May 23, 2026 · 11 min read

Tian Pan

Software Engineer

A user called your support agent. They talked, the agent listened, the user uploaded a screenshot of the error mid-call, the agent reasoned over the image and the transcript, and the conversation ended with a follow-up email summarizing the fix. Three days later the user files a complaint: the fix did not work, and the email never arrived. You open your observability stack and you find three separate traces in three separate UIs. The voice pipeline shows you an ASR trace. The vision pipeline shows a span over the image upload. The LLM call shows a chat trace with a token count and a tool call. Nothing in any of these dashboards tells you they were the same conversation.

This is the postmortem nobody wants to write. Not because the data is missing — every individual modality logged what it was supposed to — but because the join across modalities was never built. Each pipeline grew its own tracing convention from whatever its model vendor shipped by default, and the conversational turn that bound them together exists only in the head of the engineer who designed the agent.

The Distributed Trace That Goes Dark at the Agent Handoff

May 17, 2026 · 11 min read

Tian Pan

Software Engineer

You open the trace for a failed run. The span tree is beautiful: the user request, the planner agent's reasoning, three tool calls, token counts, latencies, all of it nested cleanly. Then the planner hands off to a specialist agent — and the trace ends. Not with an error span. It just stops. The next thing you have is a separate, rootless trace from the specialist agent that begins mid-thought, with no parent, no inputs you can see, and no connection to the request that caused it.

The bug lives in that gap. It always does. The handoff is where one agent's assumptions meet another agent's interpretation, and it is the single place your trace cannot follow.

This is not a logging problem. Your agents are probably emitting spans correctly on both sides. The problem is that the trace context — the thread ID that stitches spans into one story — did not survive the jump from caller to callee. Every HTTP client and gRPC stub in your stack propagates that context for free. Your agent handoff does not, because nobody told it to.

The Audit Trail Mismatch: When User, Agent, and Tool Each Have Different Logs

May 9, 2026 · 10 min read

Tian Pan

Software Engineer

A regulator emails you a single question: did this user authorize this transaction? Six hours later, three engineers are in a chat trying to join the chat surface's conversation log to the planner agent's reasoning trace to the tool's API record. The chat log has a turn ID and the user-visible message but no tool call detail. The planner trace has a tool-invocation record with timestamps that drift from the chat log by several hundred milliseconds. The tool's log has the API call with its own correlation ID that appears nowhere in the agent's record. The downstream service's log has yet another ID with no link back. The team eventually reconstructs the answer by joining on user IDs and approximate timestamps, hopes nothing critical is off by a turn, and ships a PDF to legal.

This is the audit trail mismatch. Every layer's owner believes their logs are fine — and individually, they are. The joined view is the artifact that doesn't exist, and nobody owns its absence. The team only finds out it doesn't exist when an incident, a customer escalation, or a regulator forces the join.

The Attribution Gap: How to Trace a User Complaint Back to a Specific Model Decision

April 20, 2026 · 12 min read

Tian Pan

Software Engineer

A support ticket arrives: "Your AI gave me completely wrong advice about my insurance policy." You check the logs. You find a timestamp and a user ID. The actual model response is there, printed verbatim. But you have no idea which prompt version produced it, which context chunks were retrieved, whether a tool was called mid-chain, or which of the three model versions you've deployed in the past month actually handled that request. You can read the output. You cannot explain it.

This is the attribution gap — and it's the operational problem most AI teams hit six to eighteen months after they first ship a model-backed feature. The failure isn't in the model or the prompt; it's in the observability infrastructure. Traditional logging captures request-response pairs. LLM pipelines are not request-response pairs. They're decision trees: context retrieval, prompt assembly, optional tool calls, model inference, post-processing, conditional branching. When something goes wrong, you need the full tree, not just the leaf.

About Tian Pan