Skip to main content

Production AI Incident Response: When Your Agent Goes Wrong at 3am

· 11 min read
Tian Pan
Software Engineer

A multi-agent cost-tracking system at a fintech startup ran undetected for eleven days before anyone noticed. The cause: Agent A asked Agent B for clarification. Agent B asked Agent A for help interpreting the response. Neither had logic to break the loop. The $127 weekly bill became $47,000 before a human looked at the invoice.

No errors were thrown. No alarms fired. Latency was normal. The system was running exactly as designed—just running forever.

This is what AI incidents actually look like. They're not stack traces and 500 errors. They're silent behavioral failures, runaway loops, and plausible wrong answers delivered at production scale with full confidence. Your existing incident runbook almost certainly doesn't cover any of them.

Why Your Runbook Isn't Ready

Traditional incident response is built around binary failure modes. A service is up or down. A database is reachable or not. A request either returns 200 or throws an exception. The fix cycle—reproduce the failure, write a test, fix the code, verify the test passes—is deterministic.

LLM systems break all three assumptions.

First, failure is non-binary. An agent can execute every step successfully and still produce the wrong outcome. Air Canada's chatbot returned 200 OK when it invented a refund policy that didn't exist. A legal AI generated citations to real-looking but fabricated court cases. These systems were working correctly from a systems perspective. They were just wrong.

Second, failures don't reproduce reliably. A user reports that your agent misclassified their request and took a destructive action. You rerun the exact same input. The agent behaves correctly. The failure was real—it happened in production with a specific context window state, retrieval result, and token sampling path—but you can't reproduce it on demand.

Third, blast radius crosses service boundaries. When an agent fails, it may have already written to your database, sent emails, called external APIs, or spawned sub-agents. The failure isn't contained within a service; it propagates to every system the agent touched during the failing session. One agent deleting 10,000 emails while ignoring stop commands is not a hypothetical—it's a documented 2025 incident.

The implication: incident response for AI requires a different detection model, a different triage methodology, and a different post-mortem format.

Detection: Stop Watching Error Rates

The monitoring gap in most production LLM systems is structural. Traditional observability watches system performance metrics: error rate, latency, throughput. These metrics are silent when an AI system produces confident wrong answers.

Research on 3 million user reviews across 90 AI apps found that roughly 1.75% of reviews explicitly flagged hallucinations—representing millions of users who experienced failures that threw zero system-level errors. Whisper's medical transcription model produced harmful or concerning hallucinations in nearly 40% of cases. None of this triggered an alert.

The metrics that actually detect LLM quality degradation are behavioral:

Token-length distribution drift. Build a rolling 7-day baseline of output length distributions in 25-token bins. Calculate KL divergence against that baseline daily. A divergence threshold of ≥0.15 maps to user-perceived quality drops in roughly 87% of cases in empirical studies. Unusually short outputs signal truncation or refusals; unusually long outputs often signal verbose hallucination or context confusion.

Embedding centroid drift. Collect daily output embeddings, compute their centroid, reduce dimensionality, measure cosine similarity against your historical baseline. When similarity drops below 0.82, your model's response distribution has shifted semantically. Embedding drift analysis detects degradation an average of eleven days before the first user complaint reaches your support queue.

LLM-as-judge scoring. Run a secondary model (smaller and cheaper than your primary) that evaluates a sampled percentage of production responses across dimensions like groundedness, task completion, and factual consistency. Score drops of 0.3 points on any dimension are actionable signals. This is the only reliable way to catch the plausibility-but-wrong failure mode.

Canary prompts. Maintain a fixed set of inputs that run on a schedule and whose correct outputs you know. These are your smoke tests for silent model updates from providers. When a provider updates the underlying model—often without announcement—your canary scores will change before any user-facing metric does.

Refusal rate fingerprinting. Track what percentage of requests trigger safety layers or refusals. Sudden changes indicate that your provider has updated model behavior. This signal catches safety-layer modifications in 3–5 days, faster than any other detection method.

These five signals together—weighted and combined into a composite health score—achieve far more reliable detection than any combination of traditional infrastructure metrics.

Triage: What Actually Goes Wrong

A 2025 UC Berkeley study analyzed 1,600+ annotated execution traces across seven multi-agent frameworks. The failure taxonomy that emerged is instructive: 43.8% of failures were system design issues (step repetition, unaware of termination conditions, misspecified task), 32% were inter-agent misalignment (reasoning-action mismatch, task derailment), and 23.5% were task verification failures (incorrect verification, premature termination).

The headline finding: 79% of agent failures stem from specification and coordination problems, not model limitations or infrastructure outages. Your model is fine. Your architecture is the problem.

This changes triage priority order. When an agent incident fires, don't start by assuming the LLM is broken. Start by asking:

  • What was the retrieval context? Low retrieval scores, wrong document counts, or missing context in your traces indicates a retrieval failure, not a model failure.
  • What were the tool call parameters? Inspect each tool invocation. A tool being called with identical arguments repeatedly is a loop. A tool being called with parameters that weren't in the original request is a prompt injection or context leakage.
  • What was the token budget? Check finish_reason: length in your traces. A model hitting the context limit mid-reasoning will return a truncated, often incoherent response that looks like hallucination but is actually information loss.
  • What model version ran? Provider model updates often change behavior without breaking API compatibility. Compare the gen_ai.response.model attribute in your spans against what you expected. They can differ.
  • What prompt version ran? If you don't version your prompts, this question is unanswerable—which is its own problem.

For multi-agent systems specifically, triage the coordination layer before the model layer. Inter-agent conflicts—where one agent's output creates an invalid state for another agent's input—increase 487% in systems exhibiting behavioral drift. When multiple agents seem to be going wrong simultaneously, the problem is usually the message schema between them, not the models themselves.

Containment: The Kill Switch You Need Before You Need It

Containment for AI systems operates at three layers.

Infrastructure containment is the kill switch. It must operate outside the agent's own reasoning path—through orchestration layers, access controls, or infrastructure policy—because an agent in a degraded state cannot be relied on to self-terminate correctly. The OpenClaw incident, where an agent deleted 10,000 emails while ignoring stop commands, happened because the stop mechanism was itself a prompt-level instruction. A sufficiently confused agent will ignore prompts.

Your kill switch needs: global hard stop (revoke tool permissions, halt queued jobs, lock deployment pipelines in under 30 seconds), per-agent circuit breakers (so one runaway agent doesn't exhaust token budgets for all others), and a documented activation path that doesn't require the on-call engineer to know your internal architecture.

Cost containment is non-negotiable and chronically underprioritized. A single Claude Opus agent in a worst-case 100K-loop scenario costs thousands of dollars in a single session. The default cost runaway pattern: Agent A spawns Agent B, neither has loop-detection logic, the bill grows exponentially until a human looks at the invoice days later. Required guardrails: hard token budgets per session, iteration caps, action hash deduplication (detect when the same tool is being called with the same arguments repeatedly), and cost-based circuit breakers that kill a session when spend crosses a threshold.

Functional containment is graceful degradation. Define a fallback ladder for your system: full agentic mode → RAG-only mode → keyword search → static responses. When the full agent fails, you need a way to serve something useful while you investigate. Most teams don't think about this until they're in an incident. The time to design the fallback is before the incident, not during it.

Root Cause Analysis for Non-Deterministic Systems

The standard RCA process breaks down for LLM systems because the failure may not be reproducible and the evidence may not exist unless you instrumented for it before the incident.

The most important architectural decision you can make for incident response is this: log every production LLM call, with full context, before you need it. Prompt version, model version, retrieval results, tool calls and responses, token counts, finish reasons, session ID, user ID, request ID. If you don't have this data at incident time, you're reconstructing blindly.

Given that data, effective RCA follows a component isolation methodology:

  • Compare the failing session's traces against a baseline of healthy sessions
  • Isolate by dimension: same user, same query type, same model version, same time window
  • Check whether the failure correlates with any deployment event (prompt change, model update, retrieval index refresh, tool schema change)
  • For tool call failures, examine whether the tool's API contract changed without the agent's tool description being updated

The structural limitation of LLM RCA is that you cannot reliably reproduce the failing execution. Your post-mortem must be grounded in what actually ran in production—the actual traces—not in reproduced behavior. This is why log retention is not optional, and why sampling your logs below 100% creates gaps that will matter exactly when you need them.

Post-Mortem Format That Actually Produces Learning

Most AI post-mortems fail because they use the wrong level of abstraction. "The model hallucinated" is not a root cause. "The model provided incorrect information" is not a root cause. These are descriptions of symptoms.

Actionable AI post-mortems anchor on answerable questions:

  • What retrieval context was provided to the model at the moment of failure? Was it sufficient?
  • What was the prompt version? Had it changed recently?
  • Did the provider update the underlying model between the last-known-good behavior and the incident?
  • Were there token budget pressures that could have caused truncation?
  • What tool calls were made, in what order, and what did each return?
  • Was there a recent change to any tool's API that the agent's description didn't reflect?

The corrective actions that close AI post-mortems are engineering artifacts: new eval coverage for the failing case, a canary prompt that would have caught the regression, a circuit breaker configuration, a retrieval quality check, a model version pin. "Improve our prompts" without a specific test that would detect regression is not a corrective action.

One practice that consistently improves post-mortem quality: attach the full session trace to the post-mortem document. Every LLM call, every tool invocation, every token count. When you review the post-mortem six months later—or when a new engineer is debugging a similar failure—the trace is the only way to understand what actually happened.

Building a Response Culture That Matches the Problem

The technical patterns matter less than the organizational response to the first few incidents. Teams that treat "the AI did something unexpected" as a one-time anomaly rather than a category of failure to invest in will repeat the same incidents indefinitely.

Three practices that change this:

Make behavioral health a first-class SLI. Define service level indicators for output quality—groundedness scores, task completion rates, LLM-as-judge scores—alongside traditional latency and error rate SLIs. When a behavioral SLI degrades, that's an incident, even if no errors fired.

Instrument before you need it. Full session tracing, canary prompts, model version pinning, token budget logging—none of these are useful after an incident if they weren't in place before it. The cost of this instrumentation is low. The cost of investigating a production AI incident without it is high.

Run table-top exercises on your AI failure modes. What happens if your provider silently updates the base model? What happens if a tool API changes and your agent's description is now wrong? What happens if your retrieval index returns stale or adversarially poisoned documents? These failure modes are predictable; the gaps in your incident response process are not obvious until you walk through them.

Production AI systems are not more reliable than the teams that operate them. The teams operating them well have simply accepted that "it works" and "it works correctly" are different properties—and built their incident response process accordingly.

References:Let's stay in touch and Follow me for more thoughts and updates