The AI Incident Runbook: When Your Agent Causes Real-World Harm
Your agent just did something it shouldn't have. Maybe it sent emails to the wrong people. Maybe it executed a database write that should have been a read. Maybe it gave medical advice that sent a user to the hospital. You are now in an AI incident — and the playbook you've been using for software outages will not help you.
Traditional incident runbooks are built on a foundational assumption: given the same input, the system produces the same output. That assumption lets you reproduce the failure, bisect toward the cause, and verify the fix. None of that applies to a stochastic system operating on natural language. The same prompt through the same pipeline can produce different results across runs, providers, regions, and time. Documented AI incidents surged 56% from 2023 to 2024, yet most organizations still route these events through software incident processes designed for a fundamentally different class of problem.
This is the runbook they should have written.
