Skip to main content

299 posts tagged with "observability"

View all tags

AI Infrastructure Carbon Accounting: The Sustainability Cost Your Team Hasn't Measured Yet

· 9 min read
Tian Pan
Software Engineer

Every engineering team building on LLMs right now is making infrastructure decisions with a hidden cost they're not measuring. You track tokens. You track latency. You track API spend. But almost nobody tracks the carbon output of the inference workload they're running — and that gap is closing fast, from both the regulatory side and the market side.

AI systems now account for 2.5–3.7% of global greenhouse gas emissions, officially surpassing aviation's 2% contribution, and growing at 15% annually. US data centers running AI-specific servers consumed 53–76 TWh in 2024 alone — enough to power 7.2 million homes for a year. The scale is not hypothetical anymore, and the expectation that engineering teams will have visibility into their contribution is becoming a real organizational pressure.

AI Oncall: What to Page On When Your System Thinks

· 11 min read
Tian Pan
Software Engineer

A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.

This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.

The AI Product Metrics Trap: When Engagement Looks Like Value but Isn't

· 11 min read
Tian Pan
Software Engineer

A METR study published in 2025 asked 16 experienced open-source developers to predict how much faster AI tools would make them. They guessed 24% faster. The study then measured what actually happened across 246 real tasks — bug fixes, features, refactors — randomly assigned to AI-allowed and AI-disallowed conditions. The result: developers with AI access were 19% slower. After the study concluded, participants were surveyed again. They still believed AI had made them 20% faster.

That gap — between perceived productivity and measured productivity — is not a quirk of one study. It is the central problem with how most teams currently measure AI features. The signals that feel like success are, in many cases, measuring the novelty of the tool rather than its usefulness. And the first 30 days are the worst time to look.

AI for SRE Log Analysis: The Tiered Architecture That Actually Works

· 9 min read
Tian Pan
Software Engineer

When teams first wire an LLM into their log pipeline, the demo is impressive. You paste a stack trace, and GPT-4 explains the root cause in plain English. So the natural next step is obvious: automate it. Send all your logs through the model and let it find the problems.

This is how you burn $125,000 a month and page your on-call engineers with hallucinations.

The math is simple and brutal. A mid-size production system generates around one billion log lines per day. At roughly 50 tokens per log entry, that's 50 billion tokens daily. Even at GPT-4o's discounted rate of $2.50 per million input tokens, you're looking at $125,000 per day before accounting for output costs, retries, or inference overhead. Real-time frontier model analysis of streaming logs is not an optimization problem — it's the wrong architecture.

The Alignment Tax: Measuring the Real Cost of Shipping Safe AI

· 9 min read
Tian Pan
Software Engineer

Teams building production AI systems tend to discover the alignment tax the same way: someone files a latency complaint, someone else traces it to the moderation pipeline, and suddenly a previously invisible cost line becomes very visible. By that point, the safety layers have been stacked — refusal classifier, output filter, toxicity scorer, human-in-the-loop queue — and nobody measured any of them individually. Unpicking them is painful, expensive, and politically fraught because now it looks like you're arguing against safety.

The better path is to treat safety overhead as a first-class engineering metric from day one. The alignment tax is real, it's measurable, and it compounds. A 150ms guardrail check sounds fine until you chain three of them together in an agentic workflow and wonder why your 95th-percentile latency is at four seconds.

Behavioral SLAs for AI-Powered APIs: Writing Contracts for Non-Deterministic Outputs

· 10 min read
Tian Pan
Software Engineer

Your payment service has a 99.9% uptime SLA. Requests either succeed or fail with a documented error code. When something breaks, you know exactly what broke.

Now imagine you've shipped a smart invoice-parsing API that wraps an LLM. One Monday morning, your largest customer calls: "Your API returned a valid JSON object, but the total_amount field is off by a factor of ten on invoices with foreign currencies." Your service returned HTTP 200. Your uptime dashboard is green. By every traditional SLA metric, you didn't break anything. But you absolutely broke something — and you have no contractual language to even describe what went wrong.

This is the gap at the center of most AI API deployments today. The contract that governs what your API promises was written for deterministic systems, and LLMs are not deterministic systems.

Distributed Tracing for Agent Pipelines: Why Your APM Tool Is Flying Blind

· 9 min read
Tian Pan
Software Engineer

Your Datadog dashboard is green. Your Jaeger traces look clean. Your P99 latency is within SLA. And your agent pipeline is silently burning $4,000 a day on retry loops that never surface an error.

Traditional APM tools were designed for microservices — deterministic paths, bounded payloads, predictable fan-out. Agent pipelines break every one of those assumptions. The execution path isn't known until runtime. Tool call depth varies wildly. A single "request" might spawn dozens of LLM calls across minutes. And when something goes wrong, the failure mode is usually not an exception — it's a silent retry cascade that inflates cost and latency while returning plausible-looking output.

The result is a generation of engineering teams flying blind, trusting dashboards that measure the wrong things.

Fleet Health for AI Agents: What Single-Agent Observability Gets Wrong at Scale

· 9 min read
Tian Pan
Software Engineer

Most teams figure out single-agent observability well enough. They add tracing, track token counts, hook up alerts on error rates. Then they scale to a hundred concurrent agents and discover their entire monitoring stack is watching the wrong things.

The problems that kill fleets are not the problems that kill individual agents. A single misbehaving agent triggering a recursive reasoning loop can burn through a month's API budget in under an hour. A model provider's silent quality degradation can make every agent in your fleet confidently wrong simultaneously — all while your infrastructure dashboard shows green. These failures don't show up in latency charts or HTTP error rates, because they aren't infrastructure failures. They're semantic ones.

Knowledge Cutoff Is a Silent Production Bug

· 11 min read
Tian Pan
Software Engineer

Most production AI failures are loud. The model returns a 5xx. The schema validation throws. The eval suite catches the regression before it ships. But there is a category of failure that is completely silent — no error, no exception, no alert fires — because the system is working exactly as designed. It is just working with a snapshot of reality from 18 months ago.

Your LLM has a knowledge cutoff. That cutoff is not a documentation footnote. It is a slowly widening gap between what your model believes to be true and what is actually true, and it compounds every day you keep the same model in production. Teams celebrate launch, then watch user trust quietly erode over the next six months as the world moves and the model stays still.

The Multi-Turn Session State Collapse Problem

· 10 min read
Tian Pan
Software Engineer

Your per-request error rates look clean. Latency is within SLO. The LLM judge is scoring outputs at 87%. And then a user files a support ticket: "I told the bot my account number three times. It just asked me again." A different user: "It agreed to a refund, then two turns later denied the policy existed."

Single-turn failures are visible. The request comes in, the model hallucinates or refuses, your eval catches it, you fix the prompt. The feedback loop is tight. Multi-turn failures work differently: the session starts fine, degrades gradually turn by turn, and your monitoring never fires because each individual response is technically coherent. The problem is the session as a whole — and almost no team instruments for that.

Research across major frontier models (Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro) shows an average 39% performance drop when moving from single-turn to multi-turn conversations. That number hides the real story: only about 16% of the drop is capability loss. The other 23 points are a reliability crisis — the gap between a model's best and worst performance on the same task doubles as conversation length grows. You're not just getting worse outputs; you're getting inconsistent ones.

On-Call for AI Systems: Incident Response When the Bug Is the Model

· 11 min read
Tian Pan
Software Engineer

Your monitoring is green. Latency is nominal. Error rates are flat. And yet your customer support AI just told 10,000 users that returns are free — permanently — a policy that doesn't exist. No alert fired. No deploy happened. The model just decided to.

This is what on-call looks like for AI systems: a class of production failure that doesn't trigger the alarms you built, can't be traced to a line of code, and can't be fixed by rolling back the last deploy. Standard incident response playbooks — check the logs, identify the commit, revert the change, verify recovery — were designed for deterministic systems. Applied to LLMs, they miss the actual failure mode entirely.

Here's what actually works.

The On-Call Runbook for AI Systems That Nobody Writes

· 10 min read
Tian Pan
Software Engineer

Your p99 latency just spiked to 12 seconds. The alert fired at 3:14am. You open the runbook and find instructions for: checking the database connection pool, verifying the load balancer, restarting the service. You do all three. Latency stays elevated. The service is not down — it is up and responding. But something is wrong. It turns out the model started generating responses three times longer than usual because a recent prompt change accidentally unlocked verbose behavior. The runbook had no page for that.

This is the new category of on-call incident that engineering teams are not prepared for: the system is operational but the model is misbehaving. Traditional SRE runbooks assume binary failure states. AI systems fail probabilistically, and the symptoms do not look like an outage — they look like drift.