Skip to main content

233 posts tagged with "observability"

View all tags

Sampling Bias in Agent Traces: Why Your Debug Dataset Silently Excludes the Failures You Care About

· 9 min read
Tian Pan
Software Engineer

The debug corpus your team stares at every Monday is not a representative sample of production. It is an actively biased one, and the bias is in exactly the wrong direction. Head-based sampling at 1% retains the median request a hundred times before it keeps a single rare catastrophic trajectory — and most teams discover this only when a failure mode that has been quietly recurring for months finally drives a refund or an outage, and they go looking for examples in the trace store and find none.

This is not an exotic edge case. It is the default behavior of every observability stack that was designed for stateless web services and then pointed at a long-horizon agent. The same sampling math that worked fine for HTTP request tracing systematically erases the trajectories that matter most when each "request" is a thirty-step plan that may invoke a dozen tools, regenerate three subplans, and consume tens of thousands of tokens before something subtle goes wrong on step twenty-seven.

The fix is not "sample more." Sampling more makes the bill explode without changing the bias — you just get more of what you already had too much of. The fix is to change what you sample, keyed on outcomes you can only know after the trajectory finishes. That requires throwing out the head-based defaults and rebuilding the retention layer around tail signals, anomaly weighting, and bounded reservoirs that survive the long tail of agent execution.

Token Spend Is a Security Signal Your SOC Isn't Watching

· 11 min read
Tian Pan
Software Engineer

The fastest-moving breach signal in your stack isn't in your SIEM. It's in a spreadsheet someone in finance opens on the first of the month. When an attacker steals an LLM API key, exploits a prompt injection to exfiltrate data, or rides a compromised tenant session to query an adjacent customer's memory, the footprint shows up first as a token-usage anomaly — long before any DLP rule fires, any auth alert trips, or any endpoint agent notices something weird. Billing sees it. Security doesn't.

That gap is not theoretical. Sysdig's threat research team coined "LLMjacking" after watching attackers rack up five-figure daily bills on stolen cloud credentials, and the category has since matured into an organized criminal industry with $30-per-account marketplaces and documented campaigns pushing victim costs past $100,000 per day. OWASP catalogued a startup that ate a $200,000 bill in 48 hours from a leaked key. A Stanford research group burned $9,200 in 12 hours on a forgotten token in a Jupyter notebook. The common thread in every one of these incidents: the billing graph told the story hours or days before anyone in security noticed.

Time-to-First-Token Is the Latency SLO You Aren't Instrumenting

· 11 min read
Tian Pan
Software Engineer

Pull the last week of production traces and look at your latency dashboard. You almost certainly have p50 and p99 on total request latency. You probably have token throughput. You may even have a tokens-per-second chart, because a provider benchmark talked you into it. What you almost certainly do not have is a per-model, per-route, per-tenant histogram of time to first token — the single number that governs how fast your product feels.

This is not a small oversight. For any streaming interface — chat, code completion, agent sidebars, voice — perceived speed is set by how long the user stares at a blinking cursor before anything appears. Once the first token lands, the user is reading; subsequent tokens compete with their reading speed, not with their patience. Total latency matters for throughput planning and budget. TTFT matters for whether the product feels alive.

The gap between these two numbers is widening. Reasoning models can produce identical total latency to their non-reasoning siblings while pushing TTFT from 400 ms to 30 seconds. A routing change that "keeps latency flat" can silently turn a snappy assistant into a hanging window. If you are not graphing TTFT, you are shipping UX regressions you cannot see.

The AI Audit Trail Is a Product Feature, Not a Compliance Checkbox

· 8 min read
Tian Pan
Software Engineer

McKinsey's 2025 survey found that 75% of business leaders were using generative AI in some form — but nearly half had already experienced a significant negative consequence. That gap is not a model quality problem. It's a trust problem. And the fastest path to closing it is not more evals, better prompts, or a new frontier model. It's showing users exactly what the agent did.

Most engineering teams treat the audit trail as an afterthought — something you wire up for GDPR compliance or SOC 2, then lock in an internal dashboard that only ops reads. That's the wrong frame. When users can see which tool the agent called, what data it retrieved, and which reasoning branch produced the answer, three things happen: adoption goes up, support escalations go down, and model errors surface days earlier than they would from any backend alert.

The AI Feature Lifecycle Decay Problem: How to Catch Degradation Before Users Do

· 10 min read
Tian Pan
Software Engineer

Your AI feature shipped clean. The demo impressed, the launch metrics looked great, and the model benchmarked at 88% accuracy on your test set. Then, about three months later, a customer success manager forwards you a screenshot. The AI recommendation made no sense. You pull the logs, run a quick evaluation, and find accuracy has drifted to 71%. No alert fired. No error was thrown. Infrastructure dashboards showed green the whole time.

This pattern is not a freak occurrence. Research across 32 production datasets found that 91% of ML models degrade over time — and most of the degradation is silent. The systems keep running, the code doesn't change, but the predictions get progressively worse as the real world moves on without the model.

AI Incident Response Playbooks: Why Your On-Call Runbook Doesn't Work for LLMs

· 10 min read
Tian Pan
Software Engineer

Your monitoring dashboard shows elevated latency, a small error rate spike, and then nothing. Users are already complaining in Slack. A quarter of your AI feature's responses are hallucinating in ways that look completely valid to your alerting system. By the time you find the cause — a six-word change to a prompt deployed two hours ago — you've had a slow-burn incident that your runbook never anticipated.

This is the defining challenge of operating AI systems in production. The failure modes are real, damaging, and invisible to conventional tooling. An LLM that silently hallucinates looks exactly like an LLM that's working correctly from the outside.

AI Incident Retrospectives: When 'The Model Did It' Is the Root Cause

· 10 min read
Tian Pan
Software Engineer

Your customer support AI told a passenger he could buy a full-fare ticket and claim a retroactive bereavement discount afterward. He trusted it, flew, and filed the claim. The company denied it. A tribunal ruled the company liable for $650 anyway — because there was no distinction in the law between a human employee and a chatbot giving authoritative-sounding advice. The chatbot wasn't crashing. No alerts fired. No p99 latency spiked. The system was "working."

That is the defining characteristic of AI incidents: the application doesn't fail — it succeeds at producing the wrong output, confidently and at scale. And when you sit down to write the post-mortem, the classical toolbox falls apart.

Behavioral Signals That Actually Measure User Satisfaction in AI Products

· 9 min read
Tian Pan
Software Engineer

Most AI product teams ship a thumbs-up/thumbs-down widget and call it a satisfaction measurement system. They are measuring something — just not satisfaction.

A developer who presses thumbs-down on a Copilot suggestion because the function signature is wrong, and a developer who presses thumbs-down because the suggestion was excellent but not what they needed right now, are generating the same signal. Meanwhile, the developer who quietly regenerated the response four times before giving up generates no explicit signal at all. That absent signal is a better predictor of churn than anything the rating widget captures.

The implicit behavioral record your users leave while using your AI product is richer, more honest, and more actionable than anything they'll type or tap voluntarily. This post covers which signals to collect, why they outperform explicit feedback, and the instrumentation schema that keeps AI-specific telemetry from poisoning your general product analytics.

Data Lineage for AI Systems: Tracking the Path from Source to Response

· 10 min read
Tian Pan
Software Engineer

A user files a support ticket: "Your AI assistant told me the contract renewal deadline was March 15th. It was February 28th. We missed it." You pull up the logs. The response was generated. The model didn't error. Every metric is green. But you have no idea which document it retrieved, what the model read, or whether the date came from the context or was hallucinated entirely.

This is the data lineage gap. And it's not a monitoring problem — it's an architecture problem baked in from the start.

Why Your LLM Alerting Is Always Two Weeks Late

· 10 min read
Tian Pan
Software Engineer

Most teams discover their LLM has been degrading for two weeks by reading a Slack message that starts with "hey, has anyone noticed the AI outputs seem off lately?" By that point the damage is done: users have already formed opinions, support tickets have accumulated, and the business stakeholder who championed the feature is quietly losing confidence in it.

The frustrating part is that your infrastructure was healthy the entire time. HTTP 200s, 180ms p50 latency, $0.04 per request—everything green on the dashboard. The model just got quieter, vaguer, shorter, and more hesitant in ways that infrastructure monitoring cannot see.

This is not a monitoring gap you can close with more Datadog dashboards. It requires a different class of metrics entirely.

Pipeline Attribution in Compound AI Systems: Finding the Weakest Link Before It Finds You

· 10 min read
Tian Pan
Software Engineer

Your retrieval precision went up. Your reranker scores improved. Your generator faithfulness metrics look better than last quarter. And yet your users are complaining that the system is getting worse.

This is one of the more disorienting failure modes in production AI engineering, and it happens more often than teams expect. When you build a compound AI system — one where retrieval feeds a reranker, which feeds a generator, which feeds a validator — you inherit a fundamental attribution problem. End-to-end quality is the only metric that actually matters, but it's the hardest one to act on. You can't fix "the system is worse." You need to fix a specific component. And in a four-stage pipeline, that turns out to be genuinely hard.

Prompt Cache Hit Rate: The Production Metric Your Cost Dashboard Is Missing

· 10 min read
Tian Pan
Software Engineer

The first time your team enables prompt caching, it feels like free money. Within hours, your token cost drops 40–60% and latency shrinks. Engineers celebrate and move on. Three months later, someone notices costs have quietly crept back up. The cache hit rate that started at 72% is now 18%. Nothing was deliberately broken. Nobody noticed.

This is the most common arc in production LLM deployments: caching is enabled once, never monitored, and silently degrades as the codebase evolves. Cache hit rate is the most impactful cost lever in an LLM stack, and most teams treat it as a one-time setup task rather than a production metric.