Skip to main content

The AI On-Call Playbook: Incident Response When the Bug Is a Bad Prediction

· 12 min read
Tian Pan
Software Engineer

Your pager fires at 2 AM. The dashboard shows no 5xx errors, no timeout spikes, no unusual latency. Yet customer support is flooded: "the AI is giving weird answers." You open the runbook—and immediately realize it was written for a different kind of system entirely.

This is the defining failure mode of AI incident response in 2026. The system is technically healthy. The bug is behavioral. Traditional runbooks assume discrete failure signals: a stack trace, an error code, a service that won't respond. LLM-based systems break this assumption completely. The output is grammatically correct, delivered at normal latency, and thoroughly wrong. No alarm catches it. The only signal is that something "feels off."

This post is the playbook I wish existed when I first had to respond to a production AI incident.

Why Your Existing Runbook Is Useless Here

Traditional incident response is built around deterministic failure. A service crashes, returns 503, or times out. The blast radius is bounded: you can trace which service failed, which calls it affected, and restore to a known-good state. Recovery is a deployment rollback or a restart.

LLM failures violate every assumption in that model.

Silent confidence: The model returns a plausible, well-formatted response. There's no exception to catch. The system processes it successfully and serves it to the user. From an infrastructure perspective, everything worked.

Non-reproducibility: The same input may produce different outputs across calls. A failing request often can't be reproduced on demand, which makes "reproduce the bug" — the first step in any traditional runbook — actively misleading. You might spend hours chasing a ghost.

Diffuse blast radius: In agentic systems, a single bad prediction cascades. If an agent decided to delete a record or send an email based on a hallucinated fact, the damage has already propagated through every downstream tool the agent touched. You're not rolling back code; you're auditing side effects.

Delayed detection: Research suggests 42% of teams only discover production AI incidents through support tickets or Slack messages — long after the issue started. Dashboards that track request success rates and latency miss quality degradation entirely.

The result: most teams applying traditional runbooks to AI incidents are playing the wrong game. They're looking for error codes in a system that only fails in meaning.

The Four Root Causes (and How to Distinguish Them Fast)

When "outputs feel wrong," your first job is to avoid the most common mistake: blaming the model. In practice, model behavior changes are responsible for a minority of AI incidents. The triage question isn't "did the model break?" — it's "which layer changed?"

There are four distinct layers that can cause output quality to degrade:

Data drift happens when the distribution of inputs shifts away from what the model was designed to handle. A customer support bot trained on English queries starts receiving significant Spanish input. A document classifier encounters a new document format. The model hasn't changed, but the world has. Data drift is detected by monitoring input distributions — language codes, document lengths, entity types, schema patterns — and alerting on statistical divergence. Tools like Evidently AI make this operational.

Prompt regression is subtler. Every time you modify a prompt — adding examples, tweaking instructions, adjusting tone — you change the implicit supervision signal. The prompt can drift from its original intent through accumulated changes, each of which seemed reasonable in isolation. Prompt regression shows up as consistent directional shifts in output: the model starts truncating responses, becoming overly cautious, or misunderstanding a specific input pattern. Detection requires a regression test suite: a fixed set of inputs with baseline outputs you compare against after every prompt change.

Model regression occurs when the model itself changes — and this can happen without any action on your part. LLM API providers ship silent updates. GPT-4o in January is not identical to GPT-4o in April. The behavior shifts without version bumps or changelogs. Model regression manifests as output distribution changes: different default lengths, different refusal rates, different formatting conventions. The best defense is behavioral fingerprinting: a test suite that captures key output characteristics on a fixed prompt set, run continuously in production.

Infrastructure failure is often the actual culprit when teams assume model failure. Retrieval components that return stale or irrelevant context, vector databases with corrupted indexes, tool schemas that no longer match what the model was told to expect, context assembly bugs that silently truncate important information — any of these produces outputs that look like model failure but aren't. Research indicates 42% of production hallucinations stem from retrieval and context assembly failures, not the model itself.

Triage sequence: Check infra first (new deployments, config changes, dependency updates in the last 24 hours). Then check data (input distribution shift). Then check prompts (any prompt changes in the last 48 hours). Save model regression for last — it's real but requires more evidence to confirm.

The Escalation Decision Tree

Once you've identified the root cause, the next question is where to fix it. The decision is more consequential than it seems: escalating to your LLM provider is slow (24-48 hours minimum), expensive in goodwill, and often unnecessary.

Fix locally first. The majority of incidents — across all four root cause categories — can be resolved without provider involvement:

  • Prompt regression → revert the prompt change or apply a targeted patch
  • Data drift → add input validation, normalize edge-case inputs, or add a routing layer that handles the new distribution separately
  • Infrastructure failure → restore the retrieval component, correct the schema, fix the context assembly bug
  • Mild model regression → adjust examples, add explicit constraints, or change temperature

For most teams, a prompt change can be tested in minutes and deployed in under 30 minutes. This is almost always the right first move.

Fall back within the same provider. If local fixes don't resolve it and you need to restore service quickly, switch model tiers. GPT-4o → GPT-4o-mini keeps prompt compatibility high while providing a different behavioral profile. Claude Opus → Claude Sonnet is a similar swap. This buys 3-5 hours of recovery time while you investigate the root cause without user impact.

Route to an alternative provider. If the primary provider is experiencing degraded performance or a partial outage, an automatic fallback to a backup provider is appropriate at a sustained error rate above 5-10%. This requires pre-built prompt compatibility across providers — a worthwhile investment to make before an incident, not during one.

Escalate to the provider. File a support ticket only after exhausting the above tiers. Provide: model version, temperature settings, a minimal reproducible example (or your closest approximation given non-determinism), and the specific behavioral shift you're observing with before/after examples. Provider engineering teams can investigate silent model updates or infrastructure issues on their side. Expect 24-48 hours for non-trivial issues.

The key insight: most escalation decisions are made prematurely. Teams that jump to "contact OpenAI" when they could fix the prompt in 20 minutes waste hours waiting for a response to a problem they already had the tools to solve.

Evals as Production Monitoring

Traditional monitoring tells you the system is running. You need a separate layer to tell you the system is working well. This is where evaluations — evals — become operational infrastructure, not just a testing tool.

The pattern: sample 1-5% of live production traffic and run it through an automated scoring pipeline. An LLM-as-a-judge setup uses a separate model to assess your primary model's outputs against defined criteria — correctness, factual grounding, tone, safety compliance, adherence to format. You define the scorecard before launch. The eval scores become time-series metrics you alert on.

This gives you something powerful: a quality-based alert that fires when output accuracy drops below a threshold, before users start filing tickets. Instead of discovering the incident at 2 AM from a Slack message, you get paged by a metric crossing a line you deliberately drew.

The operational setup requires:

  • A defined eval scorecard with criteria that map to user value (not just generic "helpfulness")
  • A sampling mechanism on production traffic
  • An async scoring pipeline (adding latency to the critical path is counterproductive)
  • Alert thresholds calibrated on baseline performance, not arbitrary percentages

Platforms like LangSmith, Langfuse, Arize, and Datadog's LLM observability module make this composable. The important thing isn't the platform — it's the discipline of defining quality criteria before you need them.

The AI Incident Postmortem

AI postmortems fail when they're adapted from software postmortem templates. The root cause section of a software postmortem asks "which code changed?" An AI postmortem needs to ask six things simultaneously:

  1. Which model version was in production?
  2. Which prompt version?
  3. Which retrieval component version, with which index?
  4. Which tool schema version?
  5. What was the input distribution doing in the 48 hours before the incident?
  6. Were there any silent upstream changes (API updates, data pipeline changes)?

Without all six, you have an incomplete picture and will miss the actual cause.

The structure that works in practice:

Scope section — precise numbers: how many requests were affected, which user cohort, the time window, and the first detection method (automated alert vs. support ticket is itself a signal about your monitoring maturity).

Request-level evidence — pull 5-10 specific failing request IDs with their full session context: every LLM call, every tool invocation, the exact prompt that was sent, the exact output received, and what a correct output would have looked like.

Failure classification — which layer failed, the supporting evidence, and — importantly — what you ruled out and why. The negative evidence prevents future confusion.

Timeline — delta times matter: when was the change deployed, when was the first bad output (if you can determine it), when was detection, when was escalation, when was mitigation, when was recovery confirmed. The gap between deployment and detection is where most teams have room to improve.

Process improvement — the one question most postmortems skip: "what would have caught this faster?" Not just "what would have prevented it," but what monitoring, eval, or alerting gap allowed this to run undetected for as long as it did.

Run the postmortem within 48 hours of resolution. Memory degrades. The conversation people had in the incident channel at 3 AM contains crucial context that won't be recalled accurately two weeks later.

What Canary Deployments Look Like for LLMs

One operational pattern worth building before you need it: canary deployments for model and prompt changes.

The mechanics are similar to service canary releases — route 1% of traffic to the new version, monitor key metrics, expand to 5%, 20%, 50%, 100% over hours. What's different is the rollback trigger: instead of error rate and latency, you're watching eval scores, hallucination rate, and output distribution.

Automatic rollback should fire when:

  • Quality eval scores drop more than 5% versus the baseline
  • Latency increases more than 20% at p99
  • Any safety policy violation is detected in the sampled traffic
  • Cost per query increases more than 15%

Shadow mode is the precursor step: route production traffic to both old and new versions simultaneously, but only serve the old version's outputs to users. Log and score both. This gives you real production data on the new version without any user impact before you begin the canary rollout.

Both patterns require that your eval infrastructure is running in production and producing scores fast enough to trigger rollback within minutes. If scoring is too slow or too expensive, you lose the safety net.

What This Means for Your On-Call Rotation

The engineers on AI on-call need different skills than traditional SREs. They need to understand prompt mechanics well enough to make a hotfix under pressure. They need to know how the retrieval pipeline works, what a distribution shift looks like in a scatter plot, and how to read a session trace across multiple LLM calls.

This is a gap most organizations haven't closed. The traditional SRE runbook ("restart the service," "roll back the deployment," "page the DBA") doesn't translate. The skill set is closer to a data scientist who knows how to ship code, or a software engineer who understands statistical distributions.

For teams building this capability: start with session tracing (log every LLM call with full context), add a regression test suite for prompts, and instrument one eval metric in production. Those three investments make the first AI incident vastly more navigable than flying blind in the middle of the night.

The goal isn't perfect prevention — AI systems will behave unexpectedly. The goal is cutting the time from "something is wrong" to "I know exactly what and why" from hours to minutes.


Key Takeaways

  • Traditional runbooks assume error codes and bounded blast radius — AI failures are silent, diffuse, and often non-reproducible.
  • Triage in sequence: infrastructure → data drift → prompt regression → model regression. Don't blame the model first.
  • Most incidents are fixable locally; escalate to your LLM provider only after exhausting prompt fixes, tier switches, and provider fallback.
  • Evals are production monitoring infrastructure, not just a testing tool. Define your scorecard before launch, sample live traffic, and alert on quality degradation.
  • AI postmortems need six versions tracked (model, prompt, retrieval, index, tool schema, data pipeline) and should run within 48 hours of resolution.
  • Canary deployments for LLMs use eval scores as rollback triggers, not just latency and error rate.
References:Let's stay in touch and Follow me for more thoughts and updates