Skip to main content

6 posts tagged with "llm-observability"

View all tags

Thinking Tokens Are Invisible in Your Logs and Loud on Your Bill

· 9 min read
Tian Pan
Software Engineer

The first person to notice your reasoning-model regression is almost never on the engineering team. It is the finance analyst who pings your manager on a Tuesday afternoon because the previous month's Anthropic invoice came in 2.4x higher than the prior one, and "we didn't ship anything that should have done that." You open the dashboard, look at request volume — flat. Latency p99 — flat. Output tokens per response — flat. Error rate — flat. Every panel you wired up six months ago says the system is healthy. Finance is looking at a different number, and they are right.

The number they are looking at is reasoning tokens, and most observability stacks were built before the field existed.

The Streamed-Response Trace Schema Gap: Why Your APM Lies About LLM Latency

· 10 min read
Tian Pan
Software Engineer

A pager fires at 02:14: customer reports that the assistant "freezes mid-sentence" on long answers. You open the trace. The span for the LLM call shows 8.4 seconds — green, within SLO, no error attribute, finish reason stop. The dashboard widget that aggregates p95 latency for that endpoint is sitting at 9.1s, exactly where it has been for a month. By every signal the APM exposes, the request succeeded.

The user saw the first 200 milliseconds look great, watched the next four seconds produce a coherent paragraph, then watched the same three-sentence fragment repeat for the remaining four seconds before the connection ended. The stuck content loop is a real failure, and the trace knows nothing about it — because the trace was designed for a system that finishes when it returns, not for a system whose behavior is the wall of intermediate state it produced along the way.

The Monday Morning AI Degradation Your Dashboard Treats As Noise

· 10 min read
Tian Pan
Software Engineer

Pull up your AI feature's latency and quality dashboards and squint. The line is mostly flat with occasional spikes your team has been calling "noise" or "provider weirdness" for months. Now break that same data out by hour-of-day and day-of-week. The noise resolves into a face: every Monday between 9 and 11am Eastern, your p95 latency is 30–60% worse than it is on a Saturday night, your cache hit rate dips 10–20 points, your retry rate doubles, and your token spend per task quietly climbs. The dashboard wasn't lying. It was averaging.

Most teams discover this pattern the way you discover a slow leak: by tracing the cost back from a quarterly bill nobody can explain. The instinct is to call it provider flakiness, file a ticket with the inference vendor, and move on. But the pattern isn't really about your LLM provider. It's about the fact that your AI feature now sits on top of a stack of shared, time-of-day-sensitive systems — the model API, the embedding API, the dependent SaaS tools your agent calls, the customer's own infrastructure on the receiving end of webhooks — and the cyclic load patterns of every one of them compose. You inherited the diurnal curve of an entire dependency chain, and your dashboard is showing you the average of all of them.

Your Tool-Result Cache Is a Stale-Data Contract You Never Wrote

· 11 min read
Tian Pan
Software Engineer

The trace looks clean. The agent called get_inventory_status, the tool returned {"available": 142, "warehouse": "SEA-3"}, and the model wove that into a confident answer. The customer placed an order. The warehouse said the item had been out of stock since 9 a.m. The cached row was four hours old. Nobody on the team had decided four hours was acceptable — that was just whatever the cache framework defaulted to when the platform team wired up the wrapper.

This is the failure mode that gets misfiled as a hallucination. The model isn't confabulating; it is faithfully reasoning over a stale tool result that nobody bothered to label as stale. The trace logs a clean call and a clean response, the eval set never saw a stale-cache case, and the regression compounds quietly across every customer who hits the same TTL window.

The Attribution Gap: How to Trace a User Complaint Back to a Specific Model Decision

· 12 min read
Tian Pan
Software Engineer

A support ticket arrives: "Your AI gave me completely wrong advice about my insurance policy." You check the logs. You find a timestamp and a user ID. The actual model response is there, printed verbatim. But you have no idea which prompt version produced it, which context chunks were retrieved, whether a tool was called mid-chain, or which of the three model versions you've deployed in the past month actually handled that request. You can read the output. You cannot explain it.

This is the attribution gap — and it's the operational problem most AI teams hit six to eighteen months after they first ship a model-backed feature. The failure isn't in the model or the prompt; it's in the observability infrastructure. Traditional logging captures request-response pairs. LLM pipelines are not request-response pairs. They're decision trees: context retrieval, prompt assembly, optional tool calls, model inference, post-processing, conditional branching. When something goes wrong, you need the full tree, not just the leaf.

The Agent Debugging Problem: Why Printf Doesn't Work When Your Code Thinks

· 9 min read
Tian Pan
Software Engineer

Your agent returned a 200 status code. The response was fluent, grammatically perfect, and completely wrong. Welcome to the agent debugging problem — where the system never crashes, never throws an exception, and fails in ways that look indistinguishable from success.

Traditional debugging assumes that bugs manifest as errors. A stack trace points you to the line. A failing assertion tells you what went wrong. But agents don't crash when they make bad decisions. They confidently execute the wrong plan, call the wrong tool with plausible-looking parameters, and deliver a polished answer built on a hallucinated foundation. The bug isn't in your code — it's in your agent's reasoning, and your debugger has no idea what reasoning even looks like.