Skip to main content

13 posts tagged with "opentelemetry"

View all tags

The Trace Timeline Whose Timestamps Were Stamped by the Client Clock, Not the Gateway

· 9 min read
Tian Pan
Software Engineer

You opened the trace for a slow conversation. The model call started 800 milliseconds before the user pressed send. You blamed the user's laptop, closed the tab, moved on.

That is not one user with a bad clock. That is roughly a third of your traffic, and every debug session that crosses the client boundary is reading a timeline that does not exist. Browser clocks are user-settable, frequently unsynchronized, and occasionally wrong by days. The instrumentation SDK that ships with most observability stacks stamps client spans with whatever the device reports, links them by traceparent ID into a tree with server spans stamped by a synchronized server clock, and hands the result to your on-call engineer as if the two halves were comparable. They are not.

The Trace That Stops at the Provider Boundary

· 11 min read
Tian Pan
Software Engineer

You did the tracing work. Retrieval has a span. Tool calls have spans. The orchestration loop has a span. A trace ID rides through every internal hop on W3C traceparent headers, just like the SRE playbook says. Then the request hits messages.create, the SDK records a single span called llm.call, and the next 2.8 seconds of your pipeline turn into a black rectangle on the flame graph with no internal structure. The 800 milliseconds before the first token shows up: opaque. The 2 seconds of decode after that: opaque. The share of the wall clock that was network, queue wait, prefill, or per-token decode: unknowable from your trace.

When a customer reports "the assistant felt slow today," your dashboard can confirm the slowness. It cannot localize it. The most expensive minute of your pipeline — measured in dollars, in p95, in user-visible lag — lives inside a vendor's data center, and the contract you accepted when you signed up gives you almost no visibility into it. You are on call for a black box.

The Distributed Trace That Goes Dark at the Agent Handoff

· 11 min read
Tian Pan
Software Engineer

You open the trace for a failed run. The span tree is beautiful: the user request, the planner agent's reasoning, three tool calls, token counts, latencies, all of it nested cleanly. Then the planner hands off to a specialist agent — and the trace ends. Not with an error span. It just stops. The next thing you have is a separate, rootless trace from the specialist agent that begins mid-thought, with no parent, no inputs you can see, and no connection to the request that caused it.

The bug lives in that gap. It always does. The handoff is where one agent's assumptions meet another agent's interpretation, and it is the single place your trace cannot follow.

This is not a logging problem. Your agents are probably emitting spans correctly on both sides. The problem is that the trace context — the thread ID that stitches spans into one story — did not survive the jump from caller to callee. Every HTTP client and gRPC stub in your stack propagates that context for free. Your agent handoff does not, because nobody told it to.

Thinking Tokens Are Invisible in Your Logs and Loud on Your Bill

· 9 min read
Tian Pan
Software Engineer

The first person to notice your reasoning-model regression is almost never on the engineering team. It is the finance analyst who pings your manager on a Tuesday afternoon because the previous month's Anthropic invoice came in 2.4x higher than the prior one, and "we didn't ship anything that should have done that." You open the dashboard, look at request volume — flat. Latency p99 — flat. Output tokens per response — flat. Error rate — flat. Every panel you wired up six months ago says the system is healthy. Finance is looking at a different number, and they are right.

The number they are looking at is reasoning tokens, and most observability stacks were built before the field existed.

The Streamed-Response Trace Schema Gap: Why Your APM Lies About LLM Latency

· 10 min read
Tian Pan
Software Engineer

A pager fires at 02:14: customer reports that the assistant "freezes mid-sentence" on long answers. You open the trace. The span for the LLM call shows 8.4 seconds — green, within SLO, no error attribute, finish reason stop. The dashboard widget that aggregates p95 latency for that endpoint is sitting at 9.1s, exactly where it has been for a month. By every signal the APM exposes, the request succeeded.

The user saw the first 200 milliseconds look great, watched the next four seconds produce a coherent paragraph, then watched the same three-sentence fragment repeat for the remaining four seconds before the connection ended. The stuck content loop is a real failure, and the trace knows nothing about it — because the trace was designed for a system that finishes when it returns, not for a system whose behavior is the wall of intermediate state it produced along the way.

Agent Identifiability: When Your Trace Can't Tell You Which Agent Did What

· 11 min read
Tian Pan
Software Engineer

A user reports the assistant gave them a wrong answer at 9:47 a.m. You open the trace. There are three hundred and forty spans. They are almost all named agent.run, llm.invoke, or tool.call. Some have a parent. Some are siblings. Three of them retried. One of them retried and then was cancelled. None of them tells you whether the bad output came from the planner, the worker, the critic, the reflection pass, or the second retry of the worker after the critic flagged it.

You spend the next hour grepping log lines for a UUID prefix you saw in a screenshot, cross-referencing timestamps against a Slack notification, and reconstructing the agent topology in your head from the indentation pattern in the trace viewer. Eventually you guess that the third worker invocation ran with a model alias that silently flipped to a different snapshot the night before. You cannot prove it from the trace alone.

The agent worked. The trace is intact. The hairball is the bug.

Agent Trace Sampling: When 'Log Everything' Costs $80K and Still Misses the Regression

· 10 min read
Tian Pan
Software Engineer

The bill arrived in March. Eighty-one thousand dollars on traces alone, up from twelve in November. The team had turned on full agent tracing in October on the theory that more visibility was always better. By Q1 the observability line was running ahead of the inference line — and when an actual regression hit production, the trace that contained the failure was buried under twenty million successful spans nobody needed.

The mistake was not the decision to instrument. The mistake was importing a request-tracing mental model into a workload that does not behave like requests.

A typical web request produces a span tree with a handful of children: handler, database call, cache lookup, downstream service. An agent request produces a tree with five LLM calls, three tool invocations, two vector lookups, intermediate scratchpads, and a planner that reconsiders three of those steps. The same sampling policy that worked for the API gateway — head-sample 1%, keep everything else representative — produces a trace store where the median trace is a 200-span monster, the long tail is the only thing that matters, and the rate at which you discover incidents is uncorrelated with the rate at which you spend money.

Your Span Names Are an Undocumented API: Telemetry Contracts Between Agent Teams

· 10 min read
Tian Pan
Software Engineer

The cost spike that paged finance at 3 a.m. was not a cost spike. It was a span rename. Someone on the agent platform team decided that llm.completion.synthesis should really be llm.generate.answer because it read more naturally, opened a small PR, ran their tests, and shipped. Three days later finance's monthly token-spend dashboard showed a 60% drop. Nobody had cut spending. The aggregation rule still grouped by the old name, and the new spans flowed past it into an "other" bucket that the dashboard didn't even render. The bill didn't move. The dashboard did.

This is a class of incident I keep watching teams rediscover. Span names and attribute keys are not labels for humans to read in a trace UI. They are the public schema of an undocumented API, with consumers that the producer team has never met — eval pipelines that filter on them, cost dashboards that group by them, SLO alerts that fire on their durations, FinOps reports that sum their token attributes. A "harmless rename" inside one team is a wire-protocol break for four other teams that never saw the PR.

Your LLM Span Is Lying: What APM Tools Don't Show About Inference Latency

· 8 min read
Tian Pan
Software Engineer

Your LLM call took 2,340 ms. Your APM span says so. That number is the most expensive lie in your observability stack, because four completely different failure modes all render as the same opaque purple bar. A prefill surge on a long prompt. A cold KV-cache on a tenant you haven't hit in an hour. A noisy neighbor in the provider's continuous batch. A silent routing change that parked your traffic in a different region. Same span. Same duration. Same p99 alert. Four different post-mortems.

The distributed-tracing discipline that worked for microservices — one span per network hop, a duration, a few tags — does not survive contact with hosted inference. An LLM call is not one thing. It's a pipeline of phases with radically different scaling characteristics, running on shared hardware whose behavior depends on who else is in the queue. Treating that as a single opaque span is how you end up spending three days debugging "the model got slow" when the model didn't move at all.

Distributed Tracing Across Agent Service Boundaries: The Context Propagation Gap

· 11 min read
Tian Pan
Software Engineer

Most distributed tracing setups work fine until you add agents. The moment your system has Agent A spawning Agent B across a microservice boundary—Agent B calling a tool server, that tool server fetching from a vector database—the coherent end-to-end view shatters into disconnected fragments. Your tracing backend shows individual operations, but you've lost the causal chain that tells you why something happened, which user request triggered it, and where in the pipeline 800 milliseconds went.

This isn't a monitoring configuration problem. It's a context propagation architecture problem, and it has a specific technical shape that most teams discover the hard way.

Distributed Tracing for Agent Pipelines: Why Your APM Tool Is Flying Blind

· 9 min read
Tian Pan
Software Engineer

Your Datadog dashboard is green. Your Jaeger traces look clean. Your P99 latency is within SLA. And your agent pipeline is silently burning $4,000 a day on retry loops that never surface an error.

Traditional APM tools were designed for microservices — deterministic paths, bounded payloads, predictable fan-out. Agent pipelines break every one of those assumptions. The execution path isn't known until runtime. Tool call depth varies wildly. A single "request" might spawn dozens of LLM calls across minutes. And when something goes wrong, the failure mode is usually not an exception — it's a silent retry cascade that inflates cost and latency while returning plausible-looking output.

The result is a generation of engineering teams flying blind, trusting dashboards that measure the wrong things.

Your Agent Traces Are Lying: Cardinality, Sampling, and Span Hierarchies for LLM Agents

· 11 min read
Tian Pan
Software Engineer

Your tracing dashboard says the agent made eight calls to serve a user request. In reality, it made forty-seven. Your head-based sampler quietly dropped most of them. The ones you kept are technically correct but causally useless — child spans orphaned from a root their parent sampler threw away.

This is not a visualization bug. It is the predictable outcome of pointing distributed tracing infrastructure designed for ten-span HTTP fan-outs at systems that generate hundreds of spans per user turn. Default OpenTelemetry configurations systematically undercount the work agents do, and the teams running those agents usually do not notice until a customer complains about latency the trace viewer says does not exist.