Skip to main content

4 posts tagged with "llm-observability"

View all tags

The Monday Morning AI Degradation Your Dashboard Treats As Noise

· 10 min read
Tian Pan
Software Engineer

Pull up your AI feature's latency and quality dashboards and squint. The line is mostly flat with occasional spikes your team has been calling "noise" or "provider weirdness" for months. Now break that same data out by hour-of-day and day-of-week. The noise resolves into a face: every Monday between 9 and 11am Eastern, your p95 latency is 30–60% worse than it is on a Saturday night, your cache hit rate dips 10–20 points, your retry rate doubles, and your token spend per task quietly climbs. The dashboard wasn't lying. It was averaging.

Most teams discover this pattern the way you discover a slow leak: by tracing the cost back from a quarterly bill nobody can explain. The instinct is to call it provider flakiness, file a ticket with the inference vendor, and move on. But the pattern isn't really about your LLM provider. It's about the fact that your AI feature now sits on top of a stack of shared, time-of-day-sensitive systems — the model API, the embedding API, the dependent SaaS tools your agent calls, the customer's own infrastructure on the receiving end of webhooks — and the cyclic load patterns of every one of them compose. You inherited the diurnal curve of an entire dependency chain, and your dashboard is showing you the average of all of them.

Your Tool-Result Cache Is a Stale-Data Contract You Never Wrote

· 11 min read
Tian Pan
Software Engineer

The trace looks clean. The agent called get_inventory_status, the tool returned {"available": 142, "warehouse": "SEA-3"}, and the model wove that into a confident answer. The customer placed an order. The warehouse said the item had been out of stock since 9 a.m. The cached row was four hours old. Nobody on the team had decided four hours was acceptable — that was just whatever the cache framework defaulted to when the platform team wired up the wrapper.

This is the failure mode that gets misfiled as a hallucination. The model isn't confabulating; it is faithfully reasoning over a stale tool result that nobody bothered to label as stale. The trace logs a clean call and a clean response, the eval set never saw a stale-cache case, and the regression compounds quietly across every customer who hits the same TTL window.

The Attribution Gap: How to Trace a User Complaint Back to a Specific Model Decision

· 12 min read
Tian Pan
Software Engineer

A support ticket arrives: "Your AI gave me completely wrong advice about my insurance policy." You check the logs. You find a timestamp and a user ID. The actual model response is there, printed verbatim. But you have no idea which prompt version produced it, which context chunks were retrieved, whether a tool was called mid-chain, or which of the three model versions you've deployed in the past month actually handled that request. You can read the output. You cannot explain it.

This is the attribution gap — and it's the operational problem most AI teams hit six to eighteen months after they first ship a model-backed feature. The failure isn't in the model or the prompt; it's in the observability infrastructure. Traditional logging captures request-response pairs. LLM pipelines are not request-response pairs. They're decision trees: context retrieval, prompt assembly, optional tool calls, model inference, post-processing, conditional branching. When something goes wrong, you need the full tree, not just the leaf.

The Agent Debugging Problem: Why Printf Doesn't Work When Your Code Thinks

· 9 min read
Tian Pan
Software Engineer

Your agent returned a 200 status code. The response was fluent, grammatically perfect, and completely wrong. Welcome to the agent debugging problem — where the system never crashes, never throws an exception, and fails in ways that look indistinguishable from success.

Traditional debugging assumes that bugs manifest as errors. A stack trace points you to the line. A failing assertion tells you what went wrong. But agents don't crash when they make bad decisions. They confidently execute the wrong plan, call the wrong tool with plausible-looking parameters, and deliver a polished answer built on a hallucinated foundation. The bug isn't in your code — it's in your agent's reasoning, and your debugger has no idea what reasoning even looks like.