Skip to main content

161 posts tagged with "observability"

View all tags

Distributed Tracing for Agent Pipelines: Why Your APM Tool Is Flying Blind

· 9 min read
Tian Pan
Software Engineer

Your Datadog dashboard is green. Your Jaeger traces look clean. Your P99 latency is within SLA. And your agent pipeline is silently burning $4,000 a day on retry loops that never surface an error.

Traditional APM tools were designed for microservices — deterministic paths, bounded payloads, predictable fan-out. Agent pipelines break every one of those assumptions. The execution path isn't known until runtime. Tool call depth varies wildly. A single "request" might spawn dozens of LLM calls across minutes. And when something goes wrong, the failure mode is usually not an exception — it's a silent retry cascade that inflates cost and latency while returning plausible-looking output.

The result is a generation of engineering teams flying blind, trusting dashboards that measure the wrong things.

Fleet Health for AI Agents: What Single-Agent Observability Gets Wrong at Scale

· 9 min read
Tian Pan
Software Engineer

Most teams figure out single-agent observability well enough. They add tracing, track token counts, hook up alerts on error rates. Then they scale to a hundred concurrent agents and discover their entire monitoring stack is watching the wrong things.

The problems that kill fleets are not the problems that kill individual agents. A single misbehaving agent triggering a recursive reasoning loop can burn through a month's API budget in under an hour. A model provider's silent quality degradation can make every agent in your fleet confidently wrong simultaneously — all while your infrastructure dashboard shows green. These failures don't show up in latency charts or HTTP error rates, because they aren't infrastructure failures. They're semantic ones.

Knowledge Cutoff Is a Silent Production Bug

· 11 min read
Tian Pan
Software Engineer

Most production AI failures are loud. The model returns a 5xx. The schema validation throws. The eval suite catches the regression before it ships. But there is a category of failure that is completely silent — no error, no exception, no alert fires — because the system is working exactly as designed. It is just working with a snapshot of reality from 18 months ago.

Your LLM has a knowledge cutoff. That cutoff is not a documentation footnote. It is a slowly widening gap between what your model believes to be true and what is actually true, and it compounds every day you keep the same model in production. Teams celebrate launch, then watch user trust quietly erode over the next six months as the world moves and the model stays still.

The Multi-Turn Session State Collapse Problem

· 10 min read
Tian Pan
Software Engineer

Your per-request error rates look clean. Latency is within SLO. The LLM judge is scoring outputs at 87%. And then a user files a support ticket: "I told the bot my account number three times. It just asked me again." A different user: "It agreed to a refund, then two turns later denied the policy existed."

Single-turn failures are visible. The request comes in, the model hallucinates or refuses, your eval catches it, you fix the prompt. The feedback loop is tight. Multi-turn failures work differently: the session starts fine, degrades gradually turn by turn, and your monitoring never fires because each individual response is technically coherent. The problem is the session as a whole — and almost no team instruments for that.

Research across major frontier models (Claude 3.7 Sonnet, GPT-4.1, Gemini 2.5 Pro) shows an average 39% performance drop when moving from single-turn to multi-turn conversations. That number hides the real story: only about 16% of the drop is capability loss. The other 23 points are a reliability crisis — the gap between a model's best and worst performance on the same task doubles as conversation length grows. You're not just getting worse outputs; you're getting inconsistent ones.

On-Call for AI Systems: Incident Response When the Bug Is the Model

· 11 min read
Tian Pan
Software Engineer

Your monitoring is green. Latency is nominal. Error rates are flat. And yet your customer support AI just told 10,000 users that returns are free — permanently — a policy that doesn't exist. No alert fired. No deploy happened. The model just decided to.

This is what on-call looks like for AI systems: a class of production failure that doesn't trigger the alarms you built, can't be traced to a line of code, and can't be fixed by rolling back the last deploy. Standard incident response playbooks — check the logs, identify the commit, revert the change, verify recovery — were designed for deterministic systems. Applied to LLMs, they miss the actual failure mode entirely.

Here's what actually works.

The On-Call Runbook for AI Systems That Nobody Writes

· 10 min read
Tian Pan
Software Engineer

Your p99 latency just spiked to 12 seconds. The alert fired at 3:14am. You open the runbook and find instructions for: checking the database connection pool, verifying the load balancer, restarting the service. You do all three. Latency stays elevated. The service is not down — it is up and responding. But something is wrong. It turns out the model started generating responses three times longer than usual because a recent prompt change accidentally unlocked verbose behavior. The runbook had no page for that.

This is the new category of on-call incident that engineering teams are not prepared for: the system is operational but the model is misbehaving. Traditional SRE runbooks assume binary failure states. AI systems fail probabilistically, and the symptoms do not look like an outage — they look like drift.

Prompt Canaries: The Deployment Primitive Your AI Team Is Missing

· 10 min read
Tian Pan
Software Engineer

In April 2025, a system prompt change shipped to one of the world's most-used AI products. Error rates stayed flat. Latency was fine. The deployment dashboards showed green. Within three days, millions of users had noticed something deeply wrong: the model had become relentlessly flattering, agreeing with bad ideas, validating poor reasoning, manufacturing enthusiasm for anything a user said. The rollback announcement came after the incident had already spread across social media, with users posting screenshots as evidence. For a period, Twitter was the production alerting system.

This is what happens when you treat prompt and model changes like config updates rather than behavioral deployments. Teams that have spent years building canary infrastructure for code continue to push AI changes out as a single atomic flip—instantly global, instantly irreversible, with no graduated rollout and no automated rollback signal except user complaints.

Canary deployments for LLM behavior are not a nice-to-have. They are the missing infrastructure layer that separates teams who catch regressions internally from teams who discover them via support tickets.

Agent Fleet Observability: Monitoring 1,000 Concurrent Agent Runs Without Dashboard Blindness

· 12 min read
Tian Pan
Software Engineer

Running a hundred agents in production feels manageable. You have traces, you have dashboards, you know when something breaks. Running a thousand concurrent agent runs is a different problem entirely — not because the agents are more complex, but because the monitoring model you built for ten agents silently stops working long before you notice.

The failure mode is subtle. Everything looks fine. Your span trees are there. Your error rates are low. And then a prompt regression that degraded output quality for 40% of sessions for six hours shows up only because a customer complained — not because your observability stack caught it.

This is the dashboard blindness problem: per-agent tracing works beautifully at small scale and fails quietly at fleet scale. Here is why it happens and what to do instead.

Your Agent Traces Are Lying: Cardinality, Sampling, and Span Hierarchies for LLM Agents

· 11 min read
Tian Pan
Software Engineer

Your tracing dashboard says the agent made eight calls to serve a user request. In reality, it made forty-seven. Your head-based sampler quietly dropped most of them. The ones you kept are technically correct but causally useless — child spans orphaned from a root their parent sampler threw away.

This is not a visualization bug. It is the predictable outcome of pointing distributed tracing infrastructure designed for ten-span HTTP fan-outs at systems that generate hundreds of spans per user turn. Default OpenTelemetry configurations systematically undercount the work agents do, and the teams running those agents usually do not notice until a customer complains about latency the trace viewer says does not exist.

AI-Assisted Incident Response: How LLMs Change the SRE Playbook Without Replacing It

· 11 min read
Tian Pan
Software Engineer

Here is the paradox that nobody in the AIOps vendor space is advertising: organizations that invested over $1M in AI tooling for incident response saw their operational toil rise to 30% of engineering time—up from 25%, the first increase in five years. Teams expected the automation to replace manual work. Instead, they got a new job: verifying what the AI said before acting on it. The old tasks didn't go away. A verification layer appeared on top.

This is not an argument against AI in incident response. The same data shows a 40% reduction in mean time to resolution when AI is integrated well, and some teams report cutting investigation time from two hours to under thirty minutes. The argument is more precise: the failure modes of AI copilots are qualitatively different from the failure modes of traditional SRE tooling, and most teams aren't set up to catch them.

AI Feature Decommissioning Forensics: What Dead Features Teach That Successful Ones Cannot

· 10 min read
Tian Pan
Software Engineer

Here's an uncomfortable pattern: the AI feature your team is about to launch next quarter already died at your company two years ago. It shipped under a different name, with a different prompt, solving a vaguely different problem, and it got quietly decommissioned after six months of flat adoption. Nobody wrote it up. Nobody connected the dots. The leading indicators that would have saved this cycle were sitting in dashboards that got archived along with the feature.

Most engineering orgs are elaborate machines for remembering successes. Launches get retrospectives, blog posts, internal celebrations. The features that got killed — the ones with 12% weekly active users despite a polished demo, the ones whose unit economics inverted when token costs compounded across a longer-than-expected tool chain, the ones users learned to trust, lost trust in, and then routed around — generate almost no institutional memory. And the failure patterns embedded in those deaths are exactly the ones your planning process has no way to price in.

The AI Incident Severity Taxonomy: When Is a Hallucination a Sev-0?

· 11 min read
Tian Pan
Software Engineer

A legal team's AI-powered research assistant fabricated three case citations and slipped them into a court filing. The citations looked plausible — real courts, real-sounding case names, coherent holdings. Nobody caught them before the brief was submitted. The incident cost the firm an emergency hearing, a public apology, and a bar inquiry.

Was that a sev-0? A sev-2? The answer depends on which framework you use — and traditional severity models will give you the wrong answer almost every time.

Software incident severity classification was built for deterministic systems. A service is either responding or it isn't. A database query either succeeds or throws an error. The failure modes are binary, the blame is traceable to a commit, and the fix is a rollback or a patch. AI systems break all three of those assumptions simultaneously, and organizations that apply traditional severity frameworks to LLM failures end up either panicking over noise or dismissing structural failures as one-off quirks.