Skip to main content

10 posts tagged with "opentelemetry"

View all tags

Thinking Tokens Are Invisible in Your Logs and Loud on Your Bill

· 9 min read
Tian Pan
Software Engineer

The first person to notice your reasoning-model regression is almost never on the engineering team. It is the finance analyst who pings your manager on a Tuesday afternoon because the previous month's Anthropic invoice came in 2.4x higher than the prior one, and "we didn't ship anything that should have done that." You open the dashboard, look at request volume — flat. Latency p99 — flat. Output tokens per response — flat. Error rate — flat. Every panel you wired up six months ago says the system is healthy. Finance is looking at a different number, and they are right.

The number they are looking at is reasoning tokens, and most observability stacks were built before the field existed.

The Streamed-Response Trace Schema Gap: Why Your APM Lies About LLM Latency

· 10 min read
Tian Pan
Software Engineer

A pager fires at 02:14: customer reports that the assistant "freezes mid-sentence" on long answers. You open the trace. The span for the LLM call shows 8.4 seconds — green, within SLO, no error attribute, finish reason stop. The dashboard widget that aggregates p95 latency for that endpoint is sitting at 9.1s, exactly where it has been for a month. By every signal the APM exposes, the request succeeded.

The user saw the first 200 milliseconds look great, watched the next four seconds produce a coherent paragraph, then watched the same three-sentence fragment repeat for the remaining four seconds before the connection ended. The stuck content loop is a real failure, and the trace knows nothing about it — because the trace was designed for a system that finishes when it returns, not for a system whose behavior is the wall of intermediate state it produced along the way.

Agent Identifiability: When Your Trace Can't Tell You Which Agent Did What

· 11 min read
Tian Pan
Software Engineer

A user reports the assistant gave them a wrong answer at 9:47 a.m. You open the trace. There are three hundred and forty spans. They are almost all named agent.run, llm.invoke, or tool.call. Some have a parent. Some are siblings. Three of them retried. One of them retried and then was cancelled. None of them tells you whether the bad output came from the planner, the worker, the critic, the reflection pass, or the second retry of the worker after the critic flagged it.

You spend the next hour grepping log lines for a UUID prefix you saw in a screenshot, cross-referencing timestamps against a Slack notification, and reconstructing the agent topology in your head from the indentation pattern in the trace viewer. Eventually you guess that the third worker invocation ran with a model alias that silently flipped to a different snapshot the night before. You cannot prove it from the trace alone.

The agent worked. The trace is intact. The hairball is the bug.

Agent Trace Sampling: When 'Log Everything' Costs $80K and Still Misses the Regression

· 10 min read
Tian Pan
Software Engineer

The bill arrived in March. Eighty-one thousand dollars on traces alone, up from twelve in November. The team had turned on full agent tracing in October on the theory that more visibility was always better. By Q1 the observability line was running ahead of the inference line — and when an actual regression hit production, the trace that contained the failure was buried under twenty million successful spans nobody needed.

The mistake was not the decision to instrument. The mistake was importing a request-tracing mental model into a workload that does not behave like requests.

A typical web request produces a span tree with a handful of children: handler, database call, cache lookup, downstream service. An agent request produces a tree with five LLM calls, three tool invocations, two vector lookups, intermediate scratchpads, and a planner that reconsiders three of those steps. The same sampling policy that worked for the API gateway — head-sample 1%, keep everything else representative — produces a trace store where the median trace is a 200-span monster, the long tail is the only thing that matters, and the rate at which you discover incidents is uncorrelated with the rate at which you spend money.

Your Span Names Are an Undocumented API: Telemetry Contracts Between Agent Teams

· 10 min read
Tian Pan
Software Engineer

The cost spike that paged finance at 3 a.m. was not a cost spike. It was a span rename. Someone on the agent platform team decided that llm.completion.synthesis should really be llm.generate.answer because it read more naturally, opened a small PR, ran their tests, and shipped. Three days later finance's monthly token-spend dashboard showed a 60% drop. Nobody had cut spending. The aggregation rule still grouped by the old name, and the new spans flowed past it into an "other" bucket that the dashboard didn't even render. The bill didn't move. The dashboard did.

This is a class of incident I keep watching teams rediscover. Span names and attribute keys are not labels for humans to read in a trace UI. They are the public schema of an undocumented API, with consumers that the producer team has never met — eval pipelines that filter on them, cost dashboards that group by them, SLO alerts that fire on their durations, FinOps reports that sum their token attributes. A "harmless rename" inside one team is a wire-protocol break for four other teams that never saw the PR.

Your LLM Span Is Lying: What APM Tools Don't Show About Inference Latency

· 8 min read
Tian Pan
Software Engineer

Your LLM call took 2,340 ms. Your APM span says so. That number is the most expensive lie in your observability stack, because four completely different failure modes all render as the same opaque purple bar. A prefill surge on a long prompt. A cold KV-cache on a tenant you haven't hit in an hour. A noisy neighbor in the provider's continuous batch. A silent routing change that parked your traffic in a different region. Same span. Same duration. Same p99 alert. Four different post-mortems.

The distributed-tracing discipline that worked for microservices — one span per network hop, a duration, a few tags — does not survive contact with hosted inference. An LLM call is not one thing. It's a pipeline of phases with radically different scaling characteristics, running on shared hardware whose behavior depends on who else is in the queue. Treating that as a single opaque span is how you end up spending three days debugging "the model got slow" when the model didn't move at all.

Distributed Tracing Across Agent Service Boundaries: The Context Propagation Gap

· 11 min read
Tian Pan
Software Engineer

Most distributed tracing setups work fine until you add agents. The moment your system has Agent A spawning Agent B across a microservice boundary—Agent B calling a tool server, that tool server fetching from a vector database—the coherent end-to-end view shatters into disconnected fragments. Your tracing backend shows individual operations, but you've lost the causal chain that tells you why something happened, which user request triggered it, and where in the pipeline 800 milliseconds went.

This isn't a monitoring configuration problem. It's a context propagation architecture problem, and it has a specific technical shape that most teams discover the hard way.

Distributed Tracing for Agent Pipelines: Why Your APM Tool Is Flying Blind

· 9 min read
Tian Pan
Software Engineer

Your Datadog dashboard is green. Your Jaeger traces look clean. Your P99 latency is within SLA. And your agent pipeline is silently burning $4,000 a day on retry loops that never surface an error.

Traditional APM tools were designed for microservices — deterministic paths, bounded payloads, predictable fan-out. Agent pipelines break every one of those assumptions. The execution path isn't known until runtime. Tool call depth varies wildly. A single "request" might spawn dozens of LLM calls across minutes. And when something goes wrong, the failure mode is usually not an exception — it's a silent retry cascade that inflates cost and latency while returning plausible-looking output.

The result is a generation of engineering teams flying blind, trusting dashboards that measure the wrong things.

Your Agent Traces Are Lying: Cardinality, Sampling, and Span Hierarchies for LLM Agents

· 11 min read
Tian Pan
Software Engineer

Your tracing dashboard says the agent made eight calls to serve a user request. In reality, it made forty-seven. Your head-based sampler quietly dropped most of them. The ones you kept are technically correct but causally useless — child spans orphaned from a root their parent sampler threw away.

This is not a visualization bug. It is the predictable outcome of pointing distributed tracing infrastructure designed for ten-span HTTP fan-outs at systems that generate hundreds of spans per user turn. Default OpenTelemetry configurations systematically undercount the work agents do, and the teams running those agents usually do not notice until a customer complains about latency the trace viewer says does not exist.

Mastering AI Agent Observability: Why Your Dashboards Are Lying to You

· 9 min read
Tian Pan
Software Engineer

Your agent is returning HTTP 200s. Latency is within SLA. Error rates are flat. Everything on the dashboard looks green — and your users are getting confidently wrong answers.

This is the core observability gap in AI systems: the metrics that traditionally signal system health are almost entirely irrelevant to whether your agent is actually doing its job. An agent can fluently hallucinate, skip required tools, use stale retrieval results, or reason itself into logical contradictions — all while your monitoring shows zero anomalies. The standard playbook for service observability doesn't transfer to agentic systems, and teams that don't understand this gap ship agents they can't trust, debug, or improve.