348 posts tagged with "observability"

AI Incident Response Playbooks: Why Your On-Call Runbook Doesn't Work for LLMs

April 20, 2026 · 10 min read

Software Engineer

Your monitoring dashboard shows elevated latency, a small error rate spike, and then nothing. Users are already complaining in Slack. A quarter of your AI feature's responses are hallucinating in ways that look completely valid to your alerting system. By the time you find the cause — a six-word change to a prompt deployed two hours ago — you've had a slow-burn incident that your runbook never anticipated.

This is the defining challenge of operating AI systems in production. The failure modes are real, damaging, and invisible to conventional tooling. An LLM that silently hallucinates looks exactly like an LLM that's working correctly from the outside.

AI Incident Retrospectives: When 'The Model Did It' Is the Root Cause

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your customer support AI told a passenger he could buy a full-fare ticket and claim a retroactive bereavement discount afterward. He trusted it, flew, and filed the claim. The company denied it. A tribunal ruled the company liable for $650 anyway — because there was no distinction in the law between a human employee and a chatbot giving authoritative-sounding advice. The chatbot wasn't crashing. No alerts fired. No p99 latency spiked. The system was "working."

That is the defining characteristic of AI incidents: the application doesn't fail — it succeeds at producing the wrong output, confidently and at scale. And when you sit down to write the post-mortem, the classical toolbox falls apart.

Behavioral Signals That Actually Measure User Satisfaction in AI Products

April 20, 2026 · 9 min read

Tian Pan

Software Engineer

Most AI product teams ship a thumbs-up/thumbs-down widget and call it a satisfaction measurement system. They are measuring something — just not satisfaction.

A developer who presses thumbs-down on a Copilot suggestion because the function signature is wrong, and a developer who presses thumbs-down because the suggestion was excellent but not what they needed right now, are generating the same signal. Meanwhile, the developer who quietly regenerated the response four times before giving up generates no explicit signal at all. That absent signal is a better predictor of churn than anything the rating widget captures.

The implicit behavioral record your users leave while using your AI product is richer, more honest, and more actionable than anything they'll type or tap voluntarily. This post covers which signals to collect, why they outperform explicit feedback, and the instrumentation schema that keeps AI-specific telemetry from poisoning your general product analytics.

Data Lineage for AI Systems: Tracking the Path from Source to Response

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

A user files a support ticket: "Your AI assistant told me the contract renewal deadline was March 15th. It was February 28th. We missed it." You pull up the logs. The response was generated. The model didn't error. Every metric is green. But you have no idea which document it retrieved, what the model read, or whether the date came from the context or was hallucinated entirely.

This is the data lineage gap. And it's not a monitoring problem — it's an architecture problem baked in from the start.

Why Your LLM Alerting Is Always Two Weeks Late

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams discover their LLM has been degrading for two weeks by reading a Slack message that starts with "hey, has anyone noticed the AI outputs seem off lately?" By that point the damage is done: users have already formed opinions, support tickets have accumulated, and the business stakeholder who championed the feature is quietly losing confidence in it.

The frustrating part is that your infrastructure was healthy the entire time. HTTP 200s, 180ms p50 latency, $0.04 per request—everything green on the dashboard. The model just got quieter, vaguer, shorter, and more hesitant in ways that infrastructure monitoring cannot see.

This is not a monitoring gap you can close with more Datadog dashboards. It requires a different class of metrics entirely.

Pipeline Attribution in Compound AI Systems: Finding the Weakest Link Before It Finds You

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Your retrieval precision went up. Your reranker scores improved. Your generator faithfulness metrics look better than last quarter. And yet your users are complaining that the system is getting worse.

This is one of the more disorienting failure modes in production AI engineering, and it happens more often than teams expect. When you build a compound AI system — one where retrieval feeds a reranker, which feeds a generator, which feeds a validator — you inherit a fundamental attribution problem. End-to-end quality is the only metric that actually matters, but it's the hardest one to act on. You can't fix "the system is worse." You need to fix a specific component. And in a four-stage pipeline, that turns out to be genuinely hard.

Prompt Cache Hit Rate: The Production Metric Your Cost Dashboard Is Missing

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

The first time your team enables prompt caching, it feels like free money. Within hours, your token cost drops 40–60% and latency shrinks. Engineers celebrate and move on. Three months later, someone notices costs have quietly crept back up. The cache hit rate that started at 72% is now 18%. Nothing was deliberately broken. Nobody noticed.

This is the most common arc in production LLM deployments: caching is enabled once, never monitored, and silently degrades as the codebase evolves. Cache hit rate is the most impactful cost lever in an LLM stack, and most teams treat it as a one-time setup task rather than a production metric.

Your Prompt Is a Liability with No Type System

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

Three words nearly killed a production feature. A team added "please be concise" to a customer-facing prompt during a routine copy improvement pass. Within four hours, structured-output error rates spiked dramatically, downstream parsing broke, and revenue-generating workflows halted. The fix was straightforward — revert the change. The nightmare was that they didn't know which change caused it, because the prompt lived as a hardcoded string constant with no version history, no tests, and no rollback mechanism. The incident was preventable with infrastructure that most teams still haven't built.

Prompts are now the most important and least governed code in your system.

When Your AI Feature Ages Out: Knowledge Cutoffs and Temporal Grounding in Production

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Your AI feature shipped in Q3. Evals looked good. Users were happy. Six months later, satisfaction scores have dropped 18 points, but your dashboards still show 99.9% uptime and sub-200ms latency. Nothing looks broken. Nothing is broken — in the traditional sense. The model is responding. The infrastructure is healthy. The feature is just quietly wrong.

This is what temporal decay looks like in production AI systems. It doesn't announce itself with errors. It accumulates as a gap between what the model knows and what the world has become — and by the time your support queue reflects it, the damage has been running for months.

The AI Feature Maintenance Cliff: Why Your AI-Powered Features Age Faster Than You Think

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

You ship an AI-powered feature, users love it, and then three months later your support inbox fills up with confused complaints. Nothing in your infrastructure changed. The code is identical. But the feature quietly stopped being good.

This is the AI feature maintenance cliff: the moment when accumulated silent degradation becomes a visible failure. Unlike traditional software bugs, which announce themselves with stack traces and failed requests, AI quality erosion returns HTTP 200 with well-formed JSON and completely wrong answers. Your dashboards are green. Your feature is broken.

A cross-institutional study covering 32 datasets across four industries found that 91% of ML models degrade over time without proactive intervention. That's not a tail risk — it's the expected outcome for every AI feature you ship and walk away from.

The Feedback Loop You Never Closed: Turning User Behavior into AI Ground Truth

April 19, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams building AI products spend weeks designing rating widgets, click-to-rate stars, thumbs-up/thumbs-down buttons. Then they look at the data six months later and find a 2% response rate — biased toward outlier experiences, dominated by people with strong opinions, and almost entirely useless for distinguishing a 7/10 output from a 9/10 one.

Meanwhile, every user session is generating a continuous stream of honest, unambiguous behavioral signals. The user who accepts a code suggestion and moves on is satisfied. The user who presses Ctrl+Z immediately is not. The user who rephrases their question four times in a row is telling you something explicit ratings will never capture: the first three responses failed. These signals exist whether you collect them or not. The question is whether you're closing the loop.

Why 'Fix the Prompt' Is a Root Cause Fallacy: Blameless Postmortems for AI Systems

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

Your LLM-powered feature starts returning nonsense. The on-call engineer pages the ML team. They look at the output, compare it to what the prompt was supposed to produce, and within the hour the ticket is resolved: "bad prompt — tweaked and redeployed." Incident closed. Postmortem written. Action items: "improve prompt engineering process."

Two weeks later, the same class of failure happens again. Different prompt, different feature — but the same invisible root cause.

The "fix the prompt" reflex is the AI engineering equivalent of blaming the last developer to touch a file. It gives postmortems a clean ending without requiring anyone to understand what actually broke. And unlike traditional software, where this reflex is merely lazy, in AI systems it's structurally dangerous — because non-deterministic systems fail in ways that prompt changes cannot fix.

About Tian Pan