Skip to main content

299 posts tagged with "observability"

View all tags

AI Incident Retrospectives: When 'The Model Did It' Is the Root Cause

· 10 min read
Tian Pan
Software Engineer

Your customer support AI told a passenger he could buy a full-fare ticket and claim a retroactive bereavement discount afterward. He trusted it, flew, and filed the claim. The company denied it. A tribunal ruled the company liable for $650 anyway — because there was no distinction in the law between a human employee and a chatbot giving authoritative-sounding advice. The chatbot wasn't crashing. No alerts fired. No p99 latency spiked. The system was "working."

That is the defining characteristic of AI incidents: the application doesn't fail — it succeeds at producing the wrong output, confidently and at scale. And when you sit down to write the post-mortem, the classical toolbox falls apart.

Behavioral Signals That Actually Measure User Satisfaction in AI Products

· 9 min read
Tian Pan
Software Engineer

Most AI product teams ship a thumbs-up/thumbs-down widget and call it a satisfaction measurement system. They are measuring something — just not satisfaction.

A developer who presses thumbs-down on a Copilot suggestion because the function signature is wrong, and a developer who presses thumbs-down because the suggestion was excellent but not what they needed right now, are generating the same signal. Meanwhile, the developer who quietly regenerated the response four times before giving up generates no explicit signal at all. That absent signal is a better predictor of churn than anything the rating widget captures.

The implicit behavioral record your users leave while using your AI product is richer, more honest, and more actionable than anything they'll type or tap voluntarily. This post covers which signals to collect, why they outperform explicit feedback, and the instrumentation schema that keeps AI-specific telemetry from poisoning your general product analytics.

Data Lineage for AI Systems: Tracking the Path from Source to Response

· 10 min read
Tian Pan
Software Engineer

A user files a support ticket: "Your AI assistant told me the contract renewal deadline was March 15th. It was February 28th. We missed it." You pull up the logs. The response was generated. The model didn't error. Every metric is green. But you have no idea which document it retrieved, what the model read, or whether the date came from the context or was hallucinated entirely.

This is the data lineage gap. And it's not a monitoring problem — it's an architecture problem baked in from the start.

Why Your LLM Alerting Is Always Two Weeks Late

· 10 min read
Tian Pan
Software Engineer

Most teams discover their LLM has been degrading for two weeks by reading a Slack message that starts with "hey, has anyone noticed the AI outputs seem off lately?" By that point the damage is done: users have already formed opinions, support tickets have accumulated, and the business stakeholder who championed the feature is quietly losing confidence in it.

The frustrating part is that your infrastructure was healthy the entire time. HTTP 200s, 180ms p50 latency, $0.04 per request—everything green on the dashboard. The model just got quieter, vaguer, shorter, and more hesitant in ways that infrastructure monitoring cannot see.

This is not a monitoring gap you can close with more Datadog dashboards. It requires a different class of metrics entirely.

Pipeline Attribution in Compound AI Systems: Finding the Weakest Link Before It Finds You

· 10 min read
Tian Pan
Software Engineer

Your retrieval precision went up. Your reranker scores improved. Your generator faithfulness metrics look better than last quarter. And yet your users are complaining that the system is getting worse.

This is one of the more disorienting failure modes in production AI engineering, and it happens more often than teams expect. When you build a compound AI system — one where retrieval feeds a reranker, which feeds a generator, which feeds a validator — you inherit a fundamental attribution problem. End-to-end quality is the only metric that actually matters, but it's the hardest one to act on. You can't fix "the system is worse." You need to fix a specific component. And in a four-stage pipeline, that turns out to be genuinely hard.

Prompt Cache Hit Rate: The Production Metric Your Cost Dashboard Is Missing

· 10 min read
Tian Pan
Software Engineer

The first time your team enables prompt caching, it feels like free money. Within hours, your token cost drops 40–60% and latency shrinks. Engineers celebrate and move on. Three months later, someone notices costs have quietly crept back up. The cache hit rate that started at 72% is now 18%. Nothing was deliberately broken. Nobody noticed.

This is the most common arc in production LLM deployments: caching is enabled once, never monitored, and silently degrades as the codebase evolves. Cache hit rate is the most impactful cost lever in an LLM stack, and most teams treat it as a one-time setup task rather than a production metric.

Your Prompt Is a Liability with No Type System

· 10 min read
Tian Pan
Software Engineer

Three words nearly killed a production feature. A team added "please be concise" to a customer-facing prompt during a routine copy improvement pass. Within four hours, structured-output error rates spiked dramatically, downstream parsing broke, and revenue-generating workflows halted. The fix was straightforward — revert the change. The nightmare was that they didn't know which change caused it, because the prompt lived as a hardcoded string constant with no version history, no tests, and no rollback mechanism. The incident was preventable with infrastructure that most teams still haven't built.

Prompts are now the most important and least governed code in your system.

When Your AI Feature Ages Out: Knowledge Cutoffs and Temporal Grounding in Production

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped in Q3. Evals looked good. Users were happy. Six months later, satisfaction scores have dropped 18 points, but your dashboards still show 99.9% uptime and sub-200ms latency. Nothing looks broken. Nothing is broken — in the traditional sense. The model is responding. The infrastructure is healthy. The feature is just quietly wrong.

This is what temporal decay looks like in production AI systems. It doesn't announce itself with errors. It accumulates as a gap between what the model knows and what the world has become — and by the time your support queue reflects it, the damage has been running for months.

The AI Feature Maintenance Cliff: Why Your AI-Powered Features Age Faster Than You Think

· 9 min read
Tian Pan
Software Engineer

You ship an AI-powered feature, users love it, and then three months later your support inbox fills up with confused complaints. Nothing in your infrastructure changed. The code is identical. But the feature quietly stopped being good.

This is the AI feature maintenance cliff: the moment when accumulated silent degradation becomes a visible failure. Unlike traditional software bugs, which announce themselves with stack traces and failed requests, AI quality erosion returns HTTP 200 with well-formed JSON and completely wrong answers. Your dashboards are green. Your feature is broken.

A cross-institutional study covering 32 datasets across four industries found that 91% of ML models degrade over time without proactive intervention. That's not a tail risk — it's the expected outcome for every AI feature you ship and walk away from.

The Feedback Loop You Never Closed: Turning User Behavior into AI Ground Truth

· 10 min read
Tian Pan
Software Engineer

Most teams building AI products spend weeks designing rating widgets, click-to-rate stars, thumbs-up/thumbs-down buttons. Then they look at the data six months later and find a 2% response rate — biased toward outlier experiences, dominated by people with strong opinions, and almost entirely useless for distinguishing a 7/10 output from a 9/10 one.

Meanwhile, every user session is generating a continuous stream of honest, unambiguous behavioral signals. The user who accepts a code suggestion and moves on is satisfied. The user who presses Ctrl+Z immediately is not. The user who rephrases their question four times in a row is telling you something explicit ratings will never capture: the first three responses failed. These signals exist whether you collect them or not. The question is whether you're closing the loop.

Why 'Fix the Prompt' Is a Root Cause Fallacy: Blameless Postmortems for AI Systems

· 9 min read
Tian Pan
Software Engineer

Your LLM-powered feature starts returning nonsense. The on-call engineer pages the ML team. They look at the output, compare it to what the prompt was supposed to produce, and within the hour the ticket is resolved: "bad prompt — tweaked and redeployed." Incident closed. Postmortem written. Action items: "improve prompt engineering process."

Two weeks later, the same class of failure happens again. Different prompt, different feature — but the same invisible root cause.

The "fix the prompt" reflex is the AI engineering equivalent of blaming the last developer to touch a file. It gives postmortems a clean ending without requiring anyone to understand what actually broke. And unlike traditional software, where this reflex is merely lazy, in AI systems it's structurally dangerous — because non-deterministic systems fail in ways that prompt changes cannot fix.

Dead Reckoning for Long-Running Agents: Knowing Where Your Agent Is Without Stopping It

· 11 min read
Tian Pan
Software Engineer

Before GPS, sailors used dead reckoning: take your last confirmed position, note your speed and heading, and project forward. It works until the accumulated error compounds into something irreversible—a reef you didn't see coming.

Long-running AI agents have exactly this problem. When an agent spends two hours orchestrating API calls, writing documents, and executing multi-step plans, the people running it often have no better visibility than a sailor without instruments. The agent either finishes or it doesn't. The failure mode isn't the crash—it's the silent loop that burns $30 in tokens while appearing to work, or the agent that "successfully" completes the wrong task because its world model drifted an hour into execution.

Production data makes this concrete: agents with undetected loops have been documented repeating the same tool call 58 times before manual intervention. A two-hour runaway at frontier model rates costs $15–40 before anyone notices. And the worst failures aren't the ones that error out—they're the 12–18% of "successful" runs that return plausible-looking wrong answers.