Skip to main content

The AI Incident Runbook: When Your Agent Causes Real-World Harm

· 11 min read
Tian Pan
Software Engineer

Your agent just did something it shouldn't have. Maybe it sent emails to the wrong people. Maybe it executed a database write that should have been a read. Maybe it gave medical advice that sent a user to the hospital. You are now in an AI incident — and the playbook you've been using for software outages will not help you.

Traditional incident runbooks are built on a foundational assumption: given the same input, the system produces the same output. That assumption lets you reproduce the failure, bisect toward the cause, and verify the fix. None of that applies to a stochastic system operating on natural language. The same prompt through the same pipeline can produce different results across runs, providers, regions, and time. Documented AI incidents surged 56% from 2023 to 2024, yet most organizations still route these events through software incident processes designed for a fundamentally different class of problem.

This is the runbook they should have written.

Why Your Existing Runbook Will Mislead You

A standard software runbook works by asking: what changed? Find the deployment, the config change, the dependency update. Roll it back. Verify.

AI incidents resist this framing for several reasons.

First, the system may not have changed at all. Your retrieval index drifted because upstream data was updated. Your model provider silently updated the model behind a stable API version. Your context window grew past the threshold where your system prompt gets truncated — not because of a code change, but because a conversation grew longer than usual. The "what changed" question often has no answer you can point to.

Second, the failure may not be reproducible. LLM outputs are sampled from probability distributions. The harmful completion you're investigating may be a low-probability event that occurred once and may never occur again in testing. Running the same prompt returns normal output. Your test suite passes. This is not exculpatory — it means your evaluation methodology needs to change, not that the system is safe.

Third, the blast radius is harder to bound. In a deterministic system, you can enumerate every execution path that touched the bug. In an AI system, every interaction with a misbehaving agent is a unique event. You don't know which users received bad outputs unless you logged every completion — and many teams don't.

Step One: Stop the Bleeding Before You Understand the Cause

In a traditional incident, you might tolerate investigation time before acting because you can bound the ongoing damage. In an AI incident where the system is actively taking actions — sending messages, writing records, making API calls — every minute of continued operation potentially expands the harm.

The first decision is binary: does this system need to come down right now, or can you narrow the blast radius without a full shutdown?

To make that call without a stack trace, reason from signals you do have:

Access scope: What data, systems, and users can this agent touch? A customer-facing chatbot that only reads from a FAQ database has a narrow scope. An agent with write access to production records and outbound communication has a wide one.

Operating velocity: How many operations per minute is the system executing? An agent processing 10 requests per day in a low-stakes workflow gives you time to investigate. An agent handling 10,000 requests per hour cannot wait.

Detection window: How long might this have been happening before you noticed? If monitoring caught this within minutes, the damage is probably contained. If an anomaly that started last week only surfaced today, assume the worst and audit backward.

Blast radius estimation in this context is a product: scope × velocity × detection window. A clinical documentation agent accessing two million patient records and processing a thousand records per day, if undetected for thirty days, has a theoretical blast radius of thirty thousand affected interactions. That calculation tells you whether to take the system down immediately or proceed cautiously.

If in doubt, take it down. The reputational and legal cost of continued harm exceeds the cost of an unnecessary outage by a wide margin in most domains.

Step Two: Preserve Evidence Before It Disappears

The biggest mistake teams make in AI incidents is letting evidence age out before they've captured it. Logs get rotated. Model provider trace data expires. Context windows that seemed reproducible stop being reproducible once the underlying model updates.

Within the first fifteen minutes of declaring an incident, freeze the following:

Full prompt and completion logs with timestamps, not just inputs and outputs. The system prompt matters. The conversation history matters. Anything that was in context when the harmful output was generated is potential evidence.

Model version metadata: What model were you calling, at what temperature, with what sampling parameters? If you were calling a provider API with a mutable model alias (like "gpt-4" instead of a specific version string), you may not be able to recover what model version was actually serving requests. This is a critical gap — prefer pinned version strings in production systems.

Tool invocation traces: If the agent called external tools, preserve every call and return value. Which tool was called, in what order, with what arguments, and what was returned? This is often where you find the actual failure — not in the model's reasoning, but in what it was given to reason about.

Identity and delegation chains: Who or what authorized this action? If your agent operates on behalf of users, which user triggered the chain of events? This matters for both technical remediation and legal disclosure.

Retrieval context: If you use RAG, what documents were retrieved and ranked? A harmful output often traces back to what the model was given, not what it invented. Preserve the retrieval inputs, query, and ranked results.

Once you have this data captured, it cannot be unwritten. Do this before you begin remediation, because remediation steps (rollbacks, data fixes, prompt changes) alter the system state you're trying to reconstruct.

Step Three: Communicate with Honest Uncertainty

The hardest part of AI incident communication is that you usually cannot explain what happened. You cannot point to a line of code. You cannot produce a diff. You know the system behaved incorrectly, but you cannot give a crisp technical explanation of why — at least not in the first hour.

Most teams respond to this uncertainty by saying nothing, or saying something vague enough to be meaningless. Both strategies backfire. Users and regulators interpret silence as concealment. Vague statements like "we're investigating a technical issue" invite speculation that is usually worse than the reality.

The better framework: communicate what you know, what you don't know, and what you're doing about the uncertainty.

Within the first hour: internal coordination. Ensure your team has a single, shared understanding of the situation. Conflicting statements from different team members reaching users before your official response are worse than saying nothing at all.

Within a few hours: user notification if harm has occurred or may have occurred. The message does not need to explain the root cause. It needs to acknowledge what happened from the user's perspective, what actions they should take (if any), and that you're actively working on it. Specificity builds trust even when the news is bad.

What to avoid: claiming the model "hallucinated" as an explanation to users. This phrase has become a socially acceptable way to disclaim responsibility for AI failures, and sophisticated users recognize it as such. It explains nothing and implies the failure was random and unpreventable — which it almost certainly wasn't.

Step Four: Investigate Systematically, Not Intuitively

"The model hallucinated" is never the root cause. It is a description of a symptom. The actual root cause is always in the system you built around the model.

Productive investigation follows layers, not intuitions:

Retrieval layer first: If you use RAG, did the system retrieve relevant, accurate documents? Did the query that went to the retrieval system accurately capture what the user needed? Hallucinations frequently occur when the model is forced to generate from an empty or misleading context.

Prompt layer second: What instructions did the system give the model? Were there conflicting instructions? Were there instructions that, when combined with the specific user input, created an unexpected interaction? Prompt failures are systemic — if it happened once, it will happen again under similar conditions.

Model layer third: Only after ruling out retrieval and prompt failures should you examine the model itself. Was this a known failure mode of the model family? Did you observe the failure at a temperature or context length that is documented to degrade quality? Model-layer failures are real, but they're rarely random — they have structure you can find.

Safety layer fourth: Where were your guardrails, and why didn't they catch this? If you have output filters, what did they score this completion? If you don't have output filters, that is the root cause.

Meta's production root cause analysis platform — running across 300+ teams — finds that genuine root cause analysis requires systematic investigation through each layer, with anomaly detection and dimension analysis at each step. Single-factor explanations like "the model hallucinated" are shortcuts that produce no actionable remediation.

Step Five: Write the Post-Mortem That Will Actually Help

Traditional post-mortems end with action items: add a test, fix the bug, improve monitoring. AI incident post-mortems require a different structure because the failure mode often can't be unit-tested away.

The document needs to capture what was preserved from the incident and why it was or wasn't sufficient. If you couldn't reconstruct the timeline because you lacked prompt logs, that gap is an action item. If you couldn't determine blast radius because you didn't have per-user completion logs, that's an action item.

The "what went wrong" section should refuse to accept layer-skipping explanations. If the draft says "the model produced a harmful output," that's not complete. The complete version says which retrieval or prompt condition created the context that made the harmful output likely, and why the safety layer didn't catch it.

The remediation should specify what observability you're adding — not to detect the same incident again, but to detect the class of failures that this incident belongs to. AI systems fail in patterns. An agent that sent emails to the wrong people likely did so because of an identity confusion failure. That pattern will recur in different forms unless you add monitoring for identity-related anomalies across the system.

Finally: establish a watch period. Unlike software fixes, AI system remediations don't provide a clean pass/fail signal. After changing a prompt or adding a guardrail, monitor the system's behavior distribution across varied conditions for a meaningful period — not a single test run. Stochastic systems require sustained observation, not point-in-time validation.

The Underlying Shift in Responsibility

The incidents that matter most are the ones where users trusted your system and were harmed as a result. A chatbot that gave dangerous medical advice. An agent that executed a financial transaction on incorrect reasoning. A document-processing system that silently corrupted records for weeks before anyone noticed.

These aren't edge cases in AI development. They are the central engineering problem. Building a useful AI system and building a safe one are the same project, and the incident runbook is where that becomes concrete.

The operational discipline required — logging everything, bounding access scope, making systems interruptible without their cooperation, communicating honestly under uncertainty — is not glamorous. It doesn't appear in demos. But it is what separates teams that can operate AI in production from teams that can only demo it.

When your agent causes harm, your first priority is limiting that harm. Your second priority is understanding it. Your third priority is building the system that makes the next incident easier to contain and understand. The teams doing this work now are building something that will matter more as AI systems gain more access and more autonomy. Starting with a runbook that takes stochastic failures seriously is the right place to begin.

References:Let's stay in touch and Follow me for more thoughts and updates