348 posts tagged with "observability"

Dead Reckoning for Long-Running Agents: Knowing Where Your Agent Is Without Stopping It

April 19, 2026 · 11 min read

Software Engineer

Before GPS, sailors used dead reckoning: take your last confirmed position, note your speed and heading, and project forward. It works until the accumulated error compounds into something irreversible—a reef you didn't see coming.

Long-running AI agents have exactly this problem. When an agent spends two hours orchestrating API calls, writing documents, and executing multi-step plans, the people running it often have no better visibility than a sailor without instruments. The agent either finishes or it doesn't. The failure mode isn't the crash—it's the silent loop that burns $30 in tokens while appearing to work, or the agent that "successfully" completes the wrong task because its world model drifted an hour into execution.

Production data makes this concrete: agents with undetected loops have been documented repeating the same tool call 58 times before manual intervention. A two-hour runaway at frontier model rates costs $15–40 before anyone notices. And the worst failures aren't the ones that error out—they're the 12–18% of "successful" runs that return plausible-looking wrong answers.

Decision Provenance in Agentic Systems: Audit Trails That Actually Work

April 19, 2026 · 13 min read

Tian Pan

Software Engineer

An agent running in your production system deletes 10,000 database records. The deletion matches valid business logic — the records were flagged correctly. But three months later, a regulator asks a simple question: who authorized this, and on what basis did the agent decide? You open your logs. You find the SQL statement. You find the timestamp. You find nothing else.

This is the decision provenance problem. You can prove that your agent acted; you cannot prove why, or whether that action was ever sanctioned by a human who understood what they were approving. With autonomous agents now executing workflows that span hours, dozens of tool calls, and decisions with real-world consequences, the gap between "we have logs" and "we have accountability" has become operationally dangerous.

Distributed Tracing Across Agent Service Boundaries: The Context Propagation Gap

April 19, 2026 · 11 min read

Tian Pan

Software Engineer

Most distributed tracing setups work fine until you add agents. The moment your system has Agent A spawning Agent B across a microservice boundary—Agent B calling a tool server, that tool server fetching from a vector database—the coherent end-to-end view shatters into disconnected fragments. Your tracing backend shows individual operations, but you've lost the causal chain that tells you why something happened, which user request triggered it, and where in the pipeline 800 milliseconds went.

This isn't a monitoring configuration problem. It's a context propagation architecture problem, and it has a specific technical shape that most teams discover the hard way.

Hallucination Is Not a Root Cause: A Debugging Methodology for AI in Production

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

When a lawyer cited non-existent court cases in a federal filing, the incident was widely reported as "ChatGPT hallucinated." When a consulting firm's government report contained phantom footnotes, the postmortem read "AI fabricated citations." When a healthcare transcription tool inserted violent language into medical notes, the explanation was simply "the model hallucinated." In each case, an expensive failure got a three-word root cause that made remediation impossible.

"The model hallucinated" is the AI equivalent of writing "unknown error" in a stack trace. It describes what happened without telling you why it happened or how to fix it. Every hallucination has a diagnosable cause — usually one of four categories — and each category demands a different engineering response. Teams that understand this distinction ship AI systems that degrade gracefully. Teams that don't keep playing whack-a-mole with prompts.

Why Hallucination Rate Is the Wrong Primary Metric for Production LLM Systems

April 19, 2026 · 8 min read

Tian Pan

Software Engineer

Your LLM's hallucination rate is 3%. Your users hate it anyway. This isn't a contradiction — it's a symptom of measuring the wrong thing.

Hallucination rate has become the default headline metric for LLM quality because it's easy to explain to stakeholders and straightforward to compute on a benchmark. But in production, it correlates poorly with what users actually care about: did the task get done, was the result trustworthy enough to act on, and did the system save them time?

Invisible Model Drift: How Silent Provider Updates Break Production AI

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your prompts worked on Monday. On Wednesday, users start complaining that responses feel off — answers are shorter, the JSON parsing downstream is breaking intermittently, the classifier that had been 94% accurate is now hovering around 79%. You haven't deployed anything. The model you're calling still has the same name in your config. But something changed.

This is invisible model drift: the silent, undocumented behavior changes that LLM providers push without announcement. It is one of the least-discussed operational hazards in AI engineering, and it hits teams that have done everything "right" — with evals, with monitoring, with stable prompt engineering. The model just changed underneath them.

On-Call for Stochastic Systems: Why Your AI Runbook Needs a Rewrite

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

You get paged at 2 AM. Latency is up, error rates are spiking. You SSH in, pull logs, and—nothing. No stack trace pointing to a bad deploy. No null pointer exception on line 247. Just a stream of model outputs that are subtly, unpredictably wrong in ways that only become obvious when you read 50 of them in a row.

This is what incidents look like in LLM-powered systems. And the traditional alert-triage-fix loop was not built for it.

The standard on-call playbook assumes three things: failures are deterministic (same input, same bad output), root cause is locatable (some code changed, some resource exhausted), and rollback is straightforward (revert the deploy, done). None of these hold for stochastic AI systems. The same prompt produces different outputs. Root cause is usually a probability distribution, not a line of code. And you cannot "rollback" a model that a third-party provider updated silently overnight.

SLOs for Non-Deterministic AI Features: Setting Error Budgets When Wrong Is Probabilistic

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature is "up." Latency is fine. Error rate is 0.2%. The dashboard is green. But over the past two weeks, the summarization quality quietly dropped — outputs are now technically coherent but factually shallow, consistently missing the key detail users care about. Nobody filed a bug. No alert fired. And you won't know until the next quarterly review when retention numbers come in.

This is the failure mode that traditional SLOs are blind to. Availability and latency measure whether your service is responding — not whether it's responding well. For deterministic systems, those two things are nearly equivalent. For LLM features, they can diverge silently for weeks.

SRE for AI Agents: What Actually Breaks at 3am

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

A market research pipeline ran uninterrupted for eleven days. Four LangChain agents — an Analyzer and a Verifier — passed requests back and forth, made no progress on the original task, and accumulated $47,000 in API charges before anyone noticed. The system never returned an error. No alert fired. The billing dashboard finally caught it, days after the damage was done.

This is not an edge case. It is the canonical AI agent incident. And if you are running agents in production today, your existing SRE runbooks almost certainly do not cover it.

The Vanishing Blame Problem in AI Incident Post-Mortems

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

When a deterministic system breaks, you find the bug. The stack trace points to a line. The diff shows the change. The fix is obvious in retrospect. An AI system does not work that way.

When an LLM-powered feature starts returning worse outputs, you are not looking for a bug. You are looking at a probability distribution that shifted, somewhere, across a stack of components that each introduce their own variance. Was it the model? A silent provider update on a Tuesday? The retrieval index that wasn't refreshed after the schema change? The system prompt someone edited to fix a different problem? The eval that stopped catching regressions three sprints ago?

The post-mortem becomes a blame auction. Everyone bids "the model changed" because it is an unfalsifiable claim that costs nothing to make.

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

April 18, 2026 · 12 min read

Tian Pan

Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.

The AI Ops Dashboard Nobody Builds Until It's Too Late

April 18, 2026 · 11 min read

Tian Pan

Software Engineer

The most dangerous indicator on your AI system's health dashboard is a green status light next to a 99.9% uptime number. If your first signal of a failing model is a support ticket, you don't have observability — you have vibes.

Traditional APM tools were built for a world where failure is binary: the request succeeded or it didn't. For LLM-powered features, that model breaks down completely. A request can complete in 300ms, return HTTP 200, consume tokens, and produce an answer that is confidently wrong, unhelpful, or quietly degraded from what it produced six weeks ago. None of those failure states trigger your existing alerts.

Research consistently shows that latency and error rate together cover less than 20% of the failure space for LLM-powered features. The other 80% hides in five failure modes that most teams discover only after users have already noticed.

About Tian Pan