Skip to main content

38 posts tagged with "incident-response"

View all tags

The Vanishing Blame Problem in AI Incident Post-Mortems

· 9 min read
Tian Pan
Software Engineer

When a deterministic system breaks, you find the bug. The stack trace points to a line. The diff shows the change. The fix is obvious in retrospect. An AI system does not work that way.

When an LLM-powered feature starts returning worse outputs, you are not looking for a bug. You are looking at a probability distribution that shifted, somewhere, across a stack of components that each introduce their own variance. Was it the model? A silent provider update on a Tuesday? The retrieval index that wasn't refreshed after the schema change? The system prompt someone edited to fix a different problem? The eval that stopped catching regressions three sprints ago?

The post-mortem becomes a blame auction. Everyone bids "the model changed" because it is an unfalsifiable claim that costs nothing to make.

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

· 12 min read
Tian Pan
Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.

The Debugging Regression: How AI-Generated Code Shifts the Incident-Response Cost Curve

· 9 min read
Tian Pan
Software Engineer

In March 2026, a single AI-assisted code change cost one major retailer 6.3 million lost orders and a 99% drop in North American order volume — a six-hour production outage traced to a change deployed without proper review. It wasn't a novel attack. There was no exotic failure mode. The system just did what the AI told it to do, and no one on-call had the mental model to understand why that was wrong until millions of customers had already seen errors.

This is the debugging regression. The productivity gains from AI-generated code are front-loaded and visible on dashboards. The costs are back-loaded and invisible until your alerting wakes you up at 3am.

AI Oncall: What to Page On When Your System Thinks

· 11 min read
Tian Pan
Software Engineer

A team running a multi-agent market research pipeline spent eleven days watching their system run normally — green dashboards, zero errors, normal latency — while four LangChain agents looped against each other in an infinite cycle. By the time someone glanced at the billing dashboard, the week's projected cost of $127 had become $47,000. The agents had never crashed. The API never returned an error. Every infrastructure alert stayed silent.

This is the defining problem of AI oncall: your system can be operationally green while failing catastrophically at the thing it's supposed to do. Traditional monitoring was built to detect crashes, latency spikes, and error rates. AI systems can hit all their infrastructure SLOs while silently producing wrong outputs, looping on a task indefinitely, or spending thousands of dollars on computation that produces nothing useful. The absence of errors is not evidence of correctness.

The LLM Provider Incident Runbook: Staying Up When Your AI Stack Goes Down

· 11 min read
Tian Pan
Software Engineer

In December 2024, OpenAI's entire platform went dark for over four hours. A new telemetry service had been deployed with a configuration that caused every node in a massive fleet to simultaneously hammer the Kubernetes API. DNS broke. The control plane buckled. Every service went with it. Recovery took so long partly because the team lacked what they later called "break-glass tooling" — pre-built emergency mechanisms they could reach for when normal procedures stopped working.

If you were running an AI-powered product that day, you were making decisions fast under pressure. Multi-provider routing? Graceful degradation? Cached responses? Or just a status page and a prayer?

This is the runbook you should have written before that call came in.

On-Call for AI Systems: Incident Response When the Bug Is the Model

· 11 min read
Tian Pan
Software Engineer

Your monitoring is green. Latency is nominal. Error rates are flat. And yet your customer support AI just told 10,000 users that returns are free — permanently — a policy that doesn't exist. No alert fired. No deploy happened. The model just decided to.

This is what on-call looks like for AI systems: a class of production failure that doesn't trigger the alarms you built, can't be traced to a line of code, and can't be fixed by rolling back the last deploy. Standard incident response playbooks — check the logs, identify the commit, revert the change, verify recovery — were designed for deterministic systems. Applied to LLMs, they miss the actual failure mode entirely.

Here's what actually works.

The On-Call Runbook for AI Systems That Nobody Writes

· 10 min read
Tian Pan
Software Engineer

Your p99 latency just spiked to 12 seconds. The alert fired at 3:14am. You open the runbook and find instructions for: checking the database connection pool, verifying the load balancer, restarting the service. You do all three. Latency stays elevated. The service is not down — it is up and responding. But something is wrong. It turns out the model started generating responses three times longer than usual because a recent prompt change accidentally unlocked verbose behavior. The runbook had no page for that.

This is the new category of on-call incident that engineering teams are not prepared for: the system is operational but the model is misbehaving. Traditional SRE runbooks assume binary failure states. AI systems fail probabilistically, and the symptoms do not look like an outage — they look like drift.

AI-Assisted Incident Response: How LLMs Change the SRE Playbook Without Replacing It

· 11 min read
Tian Pan
Software Engineer

Here is the paradox that nobody in the AIOps vendor space is advertising: organizations that invested over $1M in AI tooling for incident response saw their operational toil rise to 30% of engineering time—up from 25%, the first increase in five years. Teams expected the automation to replace manual work. Instead, they got a new job: verifying what the AI said before acting on it. The old tasks didn't go away. A verification layer appeared on top.

This is not an argument against AI in incident response. The same data shows a 40% reduction in mean time to resolution when AI is integrated well, and some teams report cutting investigation time from two hours to under thirty minutes. The argument is more precise: the failure modes of AI copilots are qualitatively different from the failure modes of traditional SRE tooling, and most teams aren't set up to catch them.

Debugging AI at 3am: Incident Response for LLM-Powered Systems

· 10 min read
Tian Pan
Software Engineer

You're on-call. It's 3am. Your alert fires: customer satisfaction on the AI chat feature dropped 18% in the last hour. You open the logs and see... nothing. Every request returned HTTP 200. Latency is normal. No errors anywhere.

This is the AI incident experience. Traditional on-call muscle memory — grep for stack traces, find the exception, deploy the fix — doesn't work here. The system isn't broken. It's doing exactly what it was designed to do. The outputs are just wrong.

The Public Hallucination Playbook: What to Do When Your AI Says Something Stupid in Public

· 10 min read
Tian Pan
Software Engineer

You'll find out through a screenshot. A customer will post it, a journalist will quote it, or someone on your team will Slack you a link at 11pm. Your AI system said something confidently wrong — wrong enough that it's funny, or wrong enough that it could hurt someone — and now it's public.

Most engineering teams spend months hardening their AI pipelines against this moment, then discover they never planned for what happens after it arrives. They know how to iterate on evals and tune prompts. They don't know who should post the response tweet, what that response should say, or how to tell the difference between a one-off unlucky sample and a latent failure mode that's been running in production for weeks.

This is the playbook for that moment.

The AI Rollback Ritual: Post-Incident Recovery When the Damage Is Behavioral, Not Binary

· 11 min read
Tian Pan
Software Engineer

In April 2025, OpenAI deployed an update to GPT-4o. No version bump appeared in the API. No changelog entry warned developers. Within days, enterprise applications that had been running stably for months started producing outputs that were subtly, insidiously wrong — not crashing, not throwing errors, just enthusiastically agreeing with users about terrible ideas. A model that had been calibrated and tested was now validating harmful decisions with polished confidence. OpenAI rolled it back three days later. By then, some applications had already shipped those outputs to real users.

This is the failure mode that traditional SRE practice has no template for. There was no deploy to revert. There was no diff to inspect. There was no test that failed, because behavioral regressions don't fail tests — they degrade silently across distributions until someone notices the vibe is off.

AI-Assisted Incident Response: Giving Your On-Call Agent a Runbook

· 9 min read
Tian Pan
Software Engineer

Operational toil in engineering organizations rose to 30% in 2025 — the first increase in five years — despite record investment in AI tooling. The reason is not that AI failed. The reason is that teams deployed AI agents without the same rigor they use for human on-call: no runbooks, no escalation paths, no blast-radius constraints. The agent could reason about logs, but nobody told it what it was allowed to do.

The gap between "AI that can diagnose" and "AI that can safely mitigate" is not a model capability problem. It is a systems engineering problem. And solving it requires the same discipline that SRE teams already apply to human operators: structured runbooks, tiered permissions, and mandatory escalation points.