Skip to main content

233 posts tagged with "observability"

View all tags

SLOs for Non-Deterministic AI Features: Setting Error Budgets When Wrong Is Probabilistic

· 10 min read
Tian Pan
Software Engineer

Your AI feature is "up." Latency is fine. Error rate is 0.2%. The dashboard is green. But over the past two weeks, the summarization quality quietly dropped — outputs are now technically coherent but factually shallow, consistently missing the key detail users care about. Nobody filed a bug. No alert fired. And you won't know until the next quarterly review when retention numbers come in.

This is the failure mode that traditional SLOs are blind to. Availability and latency measure whether your service is responding — not whether it's responding well. For deterministic systems, those two things are nearly equivalent. For LLM features, they can diverge silently for weeks.

SRE for AI Agents: What Actually Breaks at 3am

· 10 min read
Tian Pan
Software Engineer

A market research pipeline ran uninterrupted for eleven days. Four LangChain agents — an Analyzer and a Verifier — passed requests back and forth, made no progress on the original task, and accumulated $47,000 in API charges before anyone noticed. The system never returned an error. No alert fired. The billing dashboard finally caught it, days after the damage was done.

This is not an edge case. It is the canonical AI agent incident. And if you are running agents in production today, your existing SRE runbooks almost certainly do not cover it.

The Vanishing Blame Problem in AI Incident Post-Mortems

· 9 min read
Tian Pan
Software Engineer

When a deterministic system breaks, you find the bug. The stack trace points to a line. The diff shows the change. The fix is obvious in retrospect. An AI system does not work that way.

When an LLM-powered feature starts returning worse outputs, you are not looking for a bug. You are looking at a probability distribution that shifted, somewhere, across a stack of components that each introduce their own variance. Was it the model? A silent provider update on a Tuesday? The retrieval index that wasn't refreshed after the schema change? The system prompt someone edited to fix a different problem? The eval that stopped catching regressions three sprints ago?

The post-mortem becomes a blame auction. Everyone bids "the model changed" because it is an unfalsifiable claim that costs nothing to make.

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

· 12 min read
Tian Pan
Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.

The AI Ops Dashboard Nobody Builds Until It's Too Late

· 11 min read
Tian Pan
Software Engineer

The most dangerous indicator on your AI system's health dashboard is a green status light next to a 99.9% uptime number. If your first signal of a failing model is a support ticket, you don't have observability — you have vibes.

Traditional APM tools were built for a world where failure is binary: the request succeeded or it didn't. For LLM-powered features, that model breaks down completely. A request can complete in 300ms, return HTTP 200, consume tokens, and produce an answer that is confidently wrong, unhelpful, or quietly degraded from what it produced six weeks ago. None of those failure states trigger your existing alerts.

Research consistently shows that latency and error rate together cover less than 20% of the failure space for LLM-powered features. The other 80% hides in five failure modes that most teams discover only after users have already noticed.

Tracing the Planning Layer: Why Your Agent Traces Are Missing Half the Story

· 11 min read
Tian Pan
Software Engineer

Your agent called the wrong tool three times before finally succeeding, and your trace dashboard shows you exactly which tools were called, in what order, with full latency breakdowns. What the trace doesn't show you is the part that matters: why the agent thought those tool calls were the right move, what goal it was trying to satisfy, and what assumption it was operating under when it made each wrong decision.

This is the gap at the center of agent observability in 2026. Practitioners have invested heavily in tool-call tracing. The tooling is mature, the OpenTelemetry semantic conventions are established, and the dashboards are beautiful. But agent debugging keeps running into the same wall: you have complete visibility into what the agent did, and zero visibility into why.

AI Infrastructure Carbon Accounting: The Sustainability Cost Your Team Hasn't Measured Yet

· 9 min read
Tian Pan
Software Engineer

Every engineering team building on LLMs right now is making infrastructure decisions with a hidden cost they're not measuring. You track tokens. You track latency. You track API spend. But almost nobody tracks the carbon output of the inference workload they're running — and that gap is closing fast, from both the regulatory side and the market side.

AI systems now account for 2.5–3.7% of global greenhouse gas emissions, officially surpassing aviation's 2% contribution, and growing at 15% annually. US data centers running AI-specific servers consumed 53–76 TWh in 2024 alone — enough to power 7.2 million homes for a year. The scale is not hypothetical anymore, and the expectation that engineering teams will have visibility into their contribution is becoming a real organizational pressure.

AI Oncall: What to Page On When Your System Thinks

· 11 min read
Tian Pan
Software Engineer

A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.

This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.

The AI Product Metrics Trap: When Engagement Looks Like Value but Isn't

· 11 min read
Tian Pan
Software Engineer

A METR study published in 2025 asked 16 experienced open-source developers to predict how much faster AI tools would make them. They guessed 24% faster. The study then measured what actually happened across 246 real tasks — bug fixes, features, refactors — randomly assigned to AI-allowed and AI-disallowed conditions. The result: developers with AI access were 19% slower. After the study concluded, participants were surveyed again. They still believed AI had made them 20% faster.

That gap — between perceived productivity and measured productivity — is not a quirk of one study. It is the central problem with how most teams currently measure AI features. The signals that feel like success are, in many cases, measuring the novelty of the tool rather than its usefulness. And the first 30 days are the worst time to look.

AI for SRE Log Analysis: The Tiered Architecture That Actually Works

· 9 min read
Tian Pan
Software Engineer

When teams first wire an LLM into their log pipeline, the demo is impressive. You paste a stack trace, and GPT-4 explains the root cause in plain English. So the natural next step is obvious: automate it. Send all your logs through the model and let it find the problems.

This is how you burn $125,000 a month and page your on-call engineers with hallucinations.

The math is simple and brutal. A mid-size production system generates around one billion log lines per day. At roughly 50 tokens per log entry, that's 50 billion tokens daily. Even at GPT-4o's discounted rate of $2.50 per million input tokens, you're looking at $125,000 per day before accounting for output costs, retries, or inference overhead. Real-time frontier model analysis of streaming logs is not an optimization problem — it's the wrong architecture.

The Alignment Tax: Measuring the Real Cost of Shipping Safe AI

· 9 min read
Tian Pan
Software Engineer

Teams building production AI systems tend to discover the alignment tax the same way: someone files a latency complaint, someone else traces it to the moderation pipeline, and suddenly a previously invisible cost line becomes very visible. By that point, the safety layers have been stacked — refusal classifier, output filter, toxicity scorer, human-in-the-loop queue — and nobody measured any of them individually. Unpicking them is painful, expensive, and politically fraught because now it looks like you're arguing against safety.

The better path is to treat safety overhead as a first-class engineering metric from day one. The alignment tax is real, it's measurable, and it compounds. A 150ms guardrail check sounds fine until you chain three of them together in an agentic workflow and wonder why your 95th-percentile latency is at four seconds.

Behavioral SLAs for AI-Powered APIs: Writing Contracts for Non-Deterministic Outputs

· 10 min read
Tian Pan
Software Engineer

Your payment service has a 99.9% uptime SLA. Requests either succeed or fail with a documented error code. When something breaks, you know exactly what broke.

Now imagine you've shipped a smart invoice-parsing API that wraps an LLM. One Monday morning, your largest customer calls: "Your API returned a valid JSON object, but the total_amount field is off by a factor of ten on invoices with foreign currencies." Your service returned HTTP 200. Your uptime dashboard is green. By every traditional SLA metric, you didn't break anything. But you absolutely broke something — and you have no contractual language to even describe what went wrong.

This is the gap at the center of most AI API deployments today. The contract that governs what your API promises was written for deterministic systems, and LLMs are not deterministic systems.