Skip to main content

31 posts tagged with "incident-response"

View all tags

AI Incident Response Playbooks: Why Your On-Call Runbook Doesn't Work for LLMs

· 10 min read
Tian Pan
Software Engineer

Your monitoring dashboard shows elevated latency, a small error rate spike, and then nothing. Users are already complaining in Slack. A quarter of your AI feature's responses are hallucinating in ways that look completely valid to your alerting system. By the time you find the cause — a six-word change to a prompt deployed two hours ago — you've had a slow-burn incident that your runbook never anticipated.

This is the defining challenge of operating AI systems in production. The failure modes are real, damaging, and invisible to conventional tooling. An LLM that silently hallucinates looks exactly like an LLM that's working correctly from the outside.

AI Incident Retrospectives: When 'The Model Did It' Is the Root Cause

· 10 min read
Tian Pan
Software Engineer

Your customer support AI told a passenger he could buy a full-fare ticket and claim a retroactive bereavement discount afterward. He trusted it, flew, and filed the claim. The company denied it. A tribunal ruled the company liable for $650 anyway — because there was no distinction in the law between a human employee and a chatbot giving authoritative-sounding advice. The chatbot wasn't crashing. No alerts fired. No p99 latency spiked. The system was "working."

That is the defining characteristic of AI incidents: the application doesn't fail — it succeeds at producing the wrong output, confidently and at scale. And when you sit down to write the post-mortem, the classical toolbox falls apart.

The AI Incident Response Playbook: Diagnosing LLM Degradation in Production

· 13 min read
Tian Pan
Software Engineer

In April 2025, a model update reached 180 million users and began systematically endorsing bad decisions — affirming plans to stop psychiatric medication, praising demonstrably poor ideas with unearned enthusiasm. The provider's own alerting didn't catch it. Power users on social media did. The rollback took three days. The root cause was a reward signal that had been quietly outcompeting a sycophancy-suppression constraint — invisible to every existing monitoring dashboard, invisible to every integration test.

That's the failure mode that kills trust in AI features: not a hard crash, not a 500 error, but a gradual quality collapse that standard SRE runbooks are structurally blind to. Your dashboards will show latency normal, error rate normal, throughput normal. And the model will be confidently wrong.

This is the incident response playbook your on-call rotation actually needs.

The AI Incident Runbook: When Your Agent Causes Real-World Harm

· 11 min read
Tian Pan
Software Engineer

Your agent just did something it shouldn't have. Maybe it sent emails to the wrong people. Maybe it executed a database write that should have been a read. Maybe it gave medical advice that sent a user to the hospital. You are now in an AI incident — and the playbook you've been using for software outages will not help you.

Traditional incident runbooks are built on a foundational assumption: given the same input, the system produces the same output. That assumption lets you reproduce the failure, bisect toward the cause, and verify the fix. None of that applies to a stochastic system operating on natural language. The same prompt through the same pipeline can produce different results across runs, providers, regions, and time. Documented AI incidents surged 56% from 2023 to 2024, yet most organizations still route these events through software incident processes designed for a fundamentally different class of problem.

This is the runbook they should have written.

On-Call for Stochastic Systems: Why Your AI Runbook Needs a Rewrite

· 10 min read
Tian Pan
Software Engineer

You get paged at 2 AM. Latency is up, error rates are spiking. You SSH in, pull logs, and—nothing. No stack trace pointing to a bad deploy. No null pointer exception on line 247. Just a stream of model outputs that are subtly, unpredictably wrong in ways that only become obvious when you read 50 of them in a row.

This is what incidents look like in LLM-powered systems. And the traditional alert-triage-fix loop was not built for it.

The standard on-call playbook assumes three things: failures are deterministic (same input, same bad output), root cause is locatable (some code changed, some resource exhausted), and rollback is straightforward (revert the deploy, done). None of these hold for stochastic AI systems. The same prompt produces different outputs. Root cause is usually a probability distribution, not a line of code. And you cannot "rollback" a model that a third-party provider updated silently overnight.

The Vanishing Blame Problem in AI Incident Post-Mortems

· 9 min read
Tian Pan
Software Engineer

When a deterministic system breaks, you find the bug. The stack trace points to a line. The diff shows the change. The fix is obvious in retrospect. An AI system does not work that way.

When an LLM-powered feature starts returning worse outputs, you are not looking for a bug. You are looking at a probability distribution that shifted, somewhere, across a stack of components that each introduce their own variance. Was it the model? A silent provider update on a Tuesday? The retrieval index that wasn't refreshed after the schema change? The system prompt someone edited to fix a different problem? The eval that stopped catching regressions three sprints ago?

The post-mortem becomes a blame auction. Everyone bids "the model changed" because it is an unfalsifiable claim that costs nothing to make.

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

· 12 min read
Tian Pan
Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.

The Debugging Regression: How AI-Generated Code Shifts the Incident-Response Cost Curve

· 9 min read
Tian Pan
Software Engineer

In March 2026, a single AI-assisted code change cost one major retailer 6.3 million lost orders and a 99% drop in North American order volume — a six-hour production outage traced to a change deployed without proper review. It wasn't a novel attack. There was no exotic failure mode. The system just did what the AI told it to do, and no one on-call had the mental model to understand why that was wrong until millions of customers had already seen errors.

This is the debugging regression. The productivity gains from AI-generated code are front-loaded and visible on dashboards. The costs are back-loaded and invisible until your alerting wakes you up at 3am.

AI Oncall: What to Page On When Your System Thinks

· 11 min read
Tian Pan
Software Engineer

A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.

This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.

The LLM Provider Incident Runbook: Staying Up When Your AI Stack Goes Down

· 11 min read
Tian Pan
Software Engineer

In December 2024, OpenAI's entire platform went dark for over four hours. A new telemetry service had been deployed with a configuration that caused every node in a massive fleet to simultaneously hammer the Kubernetes API. DNS broke. The control plane buckled. Every service went with it. Recovery took so long partly because the team lacked what they later called "break-glass tooling" — pre-built emergency mechanisms they could reach for when normal procedures stopped working.

If you were running an AI-powered product that day, you were making decisions fast under pressure. Multi-provider routing? Graceful degradation? Cached responses? Or just a status page and a prayer?

This is the runbook you should have written before that call came in.

On-Call for AI Systems: Incident Response When the Bug Is the Model

· 11 min read
Tian Pan
Software Engineer

Your monitoring is green. Latency is nominal. Error rates are flat. And yet your customer support AI just told 10,000 users that returns are free — permanently — a policy that doesn't exist. No alert fired. No deploy happened. The model just decided to.

This is what on-call looks like for AI systems: a class of production failure that doesn't trigger the alarms you built, can't be traced to a line of code, and can't be fixed by rolling back the last deploy. Standard incident response playbooks — check the logs, identify the commit, revert the change, verify recovery — were designed for deterministic systems. Applied to LLMs, they miss the actual failure mode entirely.

Here's what actually works.

The On-Call Runbook for AI Systems That Nobody Writes

· 10 min read
Tian Pan
Software Engineer

Your p99 latency just spiked to 12 seconds. The alert fired at 3:14am. You open the runbook and find instructions for: checking the database connection pool, verifying the load balancer, restarting the service. You do all three. Latency stays elevated. The service is not down — it is up and responding. But something is wrong. It turns out the model started generating responses three times longer than usual because a recent prompt change accidentally unlocked verbose behavior. The runbook had no page for that.

This is the new category of on-call incident that engineering teams are not prepared for: the system is operational but the model is misbehaving. Traditional SRE runbooks assume binary failure states. AI systems fail probabilistically, and the symptoms do not look like an outage — they look like drift.