Skip to main content

299 posts tagged with "observability"

View all tags

Decision Provenance in Agentic Systems: Audit Trails That Actually Work

· 13 min read
Tian Pan
Software Engineer

An agent running in your production system deletes 10,000 database records. The deletion matches valid business logic — the records were flagged correctly. But three months later, a regulator asks a simple question: who authorized this, and on what basis did the agent decide? You open your logs. You find the SQL statement. You find the timestamp. You find nothing else.

This is the decision provenance problem. You can prove that your agent acted; you cannot prove why, or whether that action was ever sanctioned by a human who understood what they were approving. With autonomous agents now executing workflows that span hours, dozens of tool calls, and decisions with real-world consequences, the gap between "we have logs" and "we have accountability" has become operationally dangerous.

Distributed Tracing Across Agent Service Boundaries: The Context Propagation Gap

· 11 min read
Tian Pan
Software Engineer

Most distributed tracing setups work fine until you add agents. The moment your system has Agent A spawning Agent B across a microservice boundary—Agent B calling a tool server, that tool server fetching from a vector database—the coherent end-to-end view shatters into disconnected fragments. Your tracing backend shows individual operations, but you've lost the causal chain that tells you why something happened, which user request triggered it, and where in the pipeline 800 milliseconds went.

This isn't a monitoring configuration problem. It's a context propagation architecture problem, and it has a specific technical shape that most teams discover the hard way.

Hallucination Is Not a Root Cause: A Debugging Methodology for AI in Production

· 10 min read
Tian Pan
Software Engineer

When a lawyer cited non-existent court cases in a federal filing, the incident was widely reported as "ChatGPT hallucinated." When a consulting firm's government report contained phantom footnotes, the postmortem read "AI fabricated citations." When a healthcare transcription tool inserted violent language into medical notes, the explanation was simply "the model hallucinated." In each case, an expensive failure got a three-word root cause that made remediation impossible.

"The model hallucinated" is the AI equivalent of writing "unknown error" in a stack trace. It describes what happened without telling you why it happened or how to fix it. Every hallucination has a diagnosable cause — usually one of four categories — and each category demands a different engineering response. Teams that understand this distinction ship AI systems that degrade gracefully. Teams that don't keep playing whack-a-mole with prompts.

Why Hallucination Rate Is the Wrong Primary Metric for Production LLM Systems

· 8 min read
Tian Pan
Software Engineer

Your LLM's hallucination rate is 3%. Your users hate it anyway. This isn't a contradiction — it's a symptom of measuring the wrong thing.

Hallucination rate has become the default headline metric for LLM quality because it's easy to explain to stakeholders and straightforward to compute on a benchmark. But in production, it correlates poorly with what users actually care about: did the task get done, was the result trustworthy enough to act on, and did the system save them time?

Invisible Model Drift: How Silent Provider Updates Break Production AI

· 10 min read
Tian Pan
Software Engineer

Your prompts worked on Monday. On Wednesday, users start complaining that responses feel off — answers are shorter, the JSON parsing downstream is breaking intermittently, the classifier that had been 94% accurate is now hovering around 79%. You haven't deployed anything. The model you're calling still has the same name in your config. But something changed.

This is invisible model drift: the silent, undocumented behavior changes that LLM providers push without announcement. It is one of the least-discussed operational hazards in AI engineering, and it hits teams that have done everything "right" — with evals, with monitoring, with stable prompt engineering. The model just changed underneath them.

On-Call for Stochastic Systems: Why Your AI Runbook Needs a Rewrite

· 10 min read
Tian Pan
Software Engineer

You get paged at 2 AM. Latency is up, error rates are spiking. You SSH in, pull logs, and—nothing. No stack trace pointing to a bad deploy. No null pointer exception on line 247. Just a stream of model outputs that are subtly, unpredictably wrong in ways that only become obvious when you read 50 of them in a row.

This is what incidents look like in LLM-powered systems. And the traditional alert-triage-fix loop was not built for it.

The standard on-call playbook assumes three things: failures are deterministic (same input, same bad output), root cause is locatable (some code changed, some resource exhausted), and rollback is straightforward (revert the deploy, done). None of these hold for stochastic AI systems. The same prompt produces different outputs. Root cause is usually a probability distribution, not a line of code. And you cannot "rollback" a model that a third-party provider updated silently overnight.

SLOs for Non-Deterministic AI Features: Setting Error Budgets When Wrong Is Probabilistic

· 10 min read
Tian Pan
Software Engineer

Your AI feature is "up." Latency is fine. Error rate is 0.2%. The dashboard is green. But over the past two weeks, the summarization quality quietly dropped — outputs are now technically coherent but factually shallow, consistently missing the key detail users care about. Nobody filed a bug. No alert fired. And you won't know until the next quarterly review when retention numbers come in.

This is the failure mode that traditional SLOs are blind to. Availability and latency measure whether your service is responding — not whether it's responding well. For deterministic systems, those two things are nearly equivalent. For LLM features, they can diverge silently for weeks.

SRE for AI Agents: What Actually Breaks at 3am

· 10 min read
Tian Pan
Software Engineer

A market research pipeline ran uninterrupted for eleven days. Four LangChain agents — an Analyzer and a Verifier — passed requests back and forth, made no progress on the original task, and accumulated $47,000 in API charges before anyone noticed. The system never returned an error. No alert fired. The billing dashboard finally caught it, days after the damage was done.

This is not an edge case. It is the canonical AI agent incident. And if you are running agents in production today, your existing SRE runbooks almost certainly do not cover it.

The Vanishing Blame Problem in AI Incident Post-Mortems

· 9 min read
Tian Pan
Software Engineer

When a deterministic system breaks, you find the bug. The stack trace points to a line. The diff shows the change. The fix is obvious in retrospect. An AI system does not work that way.

When an LLM-powered feature starts returning worse outputs, you are not looking for a bug. You are looking at a probability distribution that shifted, somewhere, across a stack of components that each introduce their own variance. Was it the model? A silent provider update on a Tuesday? The retrieval index that wasn't refreshed after the schema change? The system prompt someone edited to fix a different problem? The eval that stopped catching regressions three sprints ago?

The post-mortem becomes a blame auction. Everyone bids "the model changed" because it is an unfalsifiable claim that costs nothing to make.

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

· 12 min read
Tian Pan
Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.

The AI Ops Dashboard Nobody Builds Until It's Too Late

· 11 min read
Tian Pan
Software Engineer

The most dangerous indicator on your AI system's health dashboard is a green status light next to a 99.9% uptime number. If your first signal of a failing model is a support ticket, you don't have observability — you have vibes.

Traditional APM tools were built for a world where failure is binary: the request succeeded or it didn't. For LLM-powered features, that model breaks down completely. A request can complete in 300ms, return HTTP 200, consume tokens, and produce an answer that is confidently wrong, unhelpful, or quietly degraded from what it produced six weeks ago. None of those failure states trigger your existing alerts.

Research consistently shows that latency and error rate together cover less than 20% of the failure space for LLM-powered features. The other 80% hides in five failure modes that most teams discover only after users have already noticed.

Tracing the Planning Layer: Why Your Agent Traces Are Missing Half the Story

· 11 min read
Tian Pan
Software Engineer

Your agent called the wrong tool three times before finally succeeding, and your trace dashboard shows you exactly which tools were called, in what order, with full latency breakdowns. What the trace doesn't show you is the part that matters: why the agent thought those tool calls were the right move, what goal it was trying to satisfy, and what assumption it was operating under when it made each wrong decision.

This is the gap at the center of agent observability in 2026. Practitioners have invested heavily in tool-call tracing. The tooling is mature, the OpenTelemetry semantic conventions are established, and the dashboards are beautiful. But agent debugging keeps running into the same wall: you have complete visibility into what the agent did, and zero visibility into why.