The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction
Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.
This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."
This post is the playbook I wish existed when I first had to respond to a production AI incident.
