Skip to main content

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

· 12 min read
Tian Pan
Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.

Why Your Existing Runbook Is Useless Here

Traditional incident response is built around deterministic failure. A service crashes, returns 503, or times out. The blast radius is bounded: you can trace which service failed, which calls it affected, and restore to a known-good state. Recovery is a deployment rollback or a restart.

LLM failures violate every assumption in that model.

Silent confidence: The model returns a plausible, well-formatted response. There's no exception to catch. The system processes it successfully and serves it to the user. From an infrastructure perspective, everything worked.

Non-reproducibility: The same input may produce different outputs across calls. A failing request often can't be reproduced on demand, which makes "reproduce the bug" — the first step in any traditional runbook — actively misleading. You might spend hours chasing a ghost.

Diffuse blast radius: In agentic systems, a single bad prediction cascades. If an agent decided to delete a record or send an email based on a hallucinated fact, the damage has already propagated through every downstream tool the agent touched. You're not rolling back code; you're auditing side effects.

Delayed detection: Research suggests 42% of teams only discover production AI incidents through support tickets or Slack messages — long after the issue started. Dashboards that track request success rates and latency miss quality degradation entirely.

The result: most teams applying traditional runbooks to AI incidents are playing the wrong game. They're looking for error codes in a system that only fails in meaning.

The Four Root Causes (and How to Distinguish Them Fast)

When "outputs feel wrong," your first job is to avoid the most common mistake: blaming the model. In practice, model behavior changes are responsible for a minority of AI incidents. The triage question isn't "did the model break?" — it's "which layer changed?"

There are four distinct layers that can cause output quality to degrade:

Data drift happens when the distribution of inputs shifts away from what the model was designed to handle. A customer support bot trained on English queries starts receiving significant Spanish input. A document classifier encounters a new document format. The model hasn't changed, but the world has. Data drift is detected by monitoring input distributions — language codes, document lengths, entity types, schema patterns — and alerting on statistical divergence. Tools like Evidently AI make this operational.

Prompt regression is subtler. Every time you modify a prompt — adding examples, tweaking instructions, adjusting tone — you change the implicit supervision signal. The prompt can drift from its original intent through accumulated changes, each of which seemed reasonable in isolation. Prompt regression shows up as consistent directional shifts in output: the model starts truncating responses, becoming overly cautious, or misunderstanding a specific input pattern. Detection requires a regression test suite: a fixed set of inputs with baseline outputs you compare against after every prompt change.

Model regression occurs when the model itself changes — and this can happen without any action on your part. LLM API providers ship silent updates. GPT-4o in January is not identical to GPT-4o in April. The behavior shifts without version bumps or changelogs. Model regression manifests as output distribution changes: different default lengths, different refusal rates, different formatting conventions. The best defense is behavioral fingerprinting: a test suite that captures key output characteristics on a fixed prompt set, run continuously in production.

Infrastructure failure is often the actual culprit when teams assume model failure. Retrieval components that return stale or irrelevant context, vector databases with corrupted indexes, tool schemas that no longer match what the model was told to expect, context assembly bugs that silently truncate important information — any of these produces outputs that look like model failure but aren't. Research indicates 42% of production hallucinations stem from retrieval and context assembly failures, not the model itself.

Triage sequence: Check infra first (new deployments, config changes, dependency updates in the last 24 hours). Then check data (input distribution shift). Then check prompts (any prompt changes in the last 48 hours). Save model regression for last — it's real but requires more evidence to confirm.

The Escalation Decision Tree

Once you've identified the root cause, the next question is where to fix it. The decision is more consequential than it seems: escalating to your LLM provider is slow (24-48 hours minimum), expensive in goodwill, and often unnecessary.

Fix locally first. The majority of incidents — across all four root cause categories — can be resolved without provider involvement:

  • Prompt regression → revert the prompt change or apply a targeted patch
  • Data drift → add input validation, normalize edge-case inputs, or add a routing layer that handles the new distribution separately
  • Infrastructure failure → restore the retrieval component, correct the schema, fix the context assembly bug
  • Mild model regression → adjust examples, add explicit constraints, or change temperature

For most teams, a prompt change can be tested in minutes and deployed in under 30 minutes. This is almost always the right first move.

Fall back within the same provider. If local fixes don't resolve it and you need to restore service quickly, switch model tiers. GPT-4o → GPT-4o-mini keeps prompt compatibility high while providing a different behavioral profile. Claude Opus → Claude Sonnet is a similar swap. This buys 3-5 hours of recovery time while you investigate the root cause without user impact.

Route to an alternative provider. If the primary provider is experiencing degraded performance or a partial outage, an automatic fallback to a backup provider is appropriate at a sustained error rate above 5-10%. This requires pre-built prompt compatibility across providers — a worthwhile investment to make before an incident, not during one.

Escalate to the provider. File a support ticket only after exhausting the above tiers. Provide: model version, temperature settings, a minimal reproducible example (or your closest approximation given non-determinism), and the specific behavioral shift you're observing with before/after examples. Provider engineering teams can investigate silent model updates or infrastructure issues on their side. Expect 24-48 hours for non-trivial issues.

The key insight: most escalation decisions are made prematurely. Teams that jump to "contact OpenAI" when they could fix the prompt in 20 minutes waste hours waiting for a response to a problem they already had the tools to solve.

Evals as Production Monitoring

Traditional monitoring tells you the system is running. You need a separate layer to tell you the system is working well. This is where evaluations — evals — become operational infrastructure, not just a testing tool.

The pattern: sample 1-5% of live production traffic and run it through an automated scoring pipeline. An LLM-as-a-judge setup uses a separate model to assess your primary model's outputs against defined criteria — correctness, factual grounding, tone, safety compliance, adherence to format. You define the scorecard before launch. The eval scores become time-series metrics you alert on.

This gives you something powerful: a quality-based alert that fires when output accuracy drops below a threshold, before users start filing tickets. Instead of discovering the incident at 2 AM from a Slack message, you get paged by a metric crossing a line you deliberately drew.

The operational setup requires:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates