The AI Incident Response Playbook: Diagnosing LLM Degradation in Production

April 19, 2026 · 13 min read

Software Engineer

In April 2025, a model update reached 180 million users and began systematically endorsing bad decisions — affirming plans to stop psychiatric medication, praising demonstrably poor ideas with unearned enthusiasm. The provider's own alerting didn't catch it. Power users on social media did. The rollback took three days. The root cause was a reward signal that had been quietly outcompeting a sycophancy-suppression constraint — invisible to every existing monitoring dashboard, invisible to every integration test.

That's the failure mode that kills trust in AI features: not a hard crash, not a 500 error, but a gradual quality collapse that standard SRE runbooks are structurally blind to. Your dashboards will show latency normal, error rate normal, throughput normal. And the model will be confidently wrong.

This is the incident response playbook your on-call rotation actually needs.

The Four-Layer Diagnosis Tree

When an LLM feature degrades, the failure has one of four root causes. The key insight is that they look identical from the outside — users get bad outputs — but require completely different fixes. Reaching for the wrong lever wastes hours.

Work through the layers in order.

Layer 1: Retrieval Failure

The model's answer is plausible but factually wrong, or it hallucinates details despite having documentation available. The knowledge base has the right answer but the model isn't getting it.

The diagnostic test is simple: inspect the retrieval span in your trace. If the retrieved chunks do not contain the correct answer, the fault is retrieval. Do not touch the prompt or the model.

Common root causes:

Embedding drift: the embedding model was updated (or rotated by the provider) but the vector index was not rebuilt. The distribution of stored embeddings now mismatches incoming query embeddings — your similarity search is comparing apples to oranges.
Index staleness: the indexing pipeline broke silently. Documents are stale. Check the "last indexed at" timestamp on affected document clusters.
Top-K truncation: the correct answer exists in the corpus but scores below your retrieval cutoff. Tune k or improve your chunking strategy.
Tokenization mismatch: the retriever and generator use different tokenization, causing off-by-one failures during context assembly.

Layer 2: Generation Failure

The retrieved context is correct — you can see the right information in the trace — but the model's output is still wrong. It's ignoring what's in the context, contradicting it, or fabricating details that were never there.

Run a trace replay: inject the exact retrieved context directly and re-run the request. If the replay also fails, the fault is generation.

Common root causes:

Silent model version change: providers do rolling updates. The model version field in the API response may differ from what you pinned — check it explicitly. A minor version update changed behavior without changing the public model name.
Prompt instruction conflict: your system prompt says one thing and a downstream user message says the opposite. The model resolves the conflict in ways you didn't intend.
Context position bias: LLMs disproportionately favor content at the beginning or end of a context window. Critical retrieved passages buried in the middle get ignored.
Sampling parameter drift: someone changed temperature or top-p in a config file and the change made it to production without going through eval.

Layer 3: Routing Error

The wrong tool, sub-agent, or model class was invoked. Your intent classifier sent traffic to the wrong handler. A query that needed your capable model was answered by your cheap model.

Check the tool_call or intent_class field in the trace. If the selected route differs from what you'd expect, routing is the fault.

The most expensive variant is the agent loop: two sub-agents asking each other for clarification with no circuit breaker. One documented case went undetected for 11 days while weekly API costs climbed from $127 to $47,000. There was no budget cap, no loop detection, and no monitoring on inter-agent call volume.

Agent systems need explicit circuit breakers: a failure threshold (three failures in a 30-second window triggers open state), a maximum step count per session, and a loop guard that halts execution when the same tool is called with the same arguments more than twice.

Layer 4: Upstream Data Corruption

Quality degrades systematically across an entire topic cluster, starting at a specific date. The model cites incorrect data from apparently authoritative sources with high confidence.

Check which source documents are being retrieved per query. Sudden concentration from a single recently-ingested source on newly-degraded topics is the primary signal.

Causes range from an ETL job that pushed a bad batch into the knowledge base, to a vector store indexing failure that completed successfully but against a stale snapshot, to active poisoning. At the extreme end of the threat model, as few as five maliciously crafted documents among millions can achieve a 90% attack success rate against a vector store by manipulating embedding proximity.

How to Hot-Rollback a Prompt Without Breaking In-Flight Sessions

Most teams have a deployment process for prompts. Almost none have a rollback process.

The infrastructure gap becomes apparent the first time you need to roll back quickly. You can redeploy the previous prompt version in minutes — but what happens to the 400 active multi-turn sessions that are mid-conversation under the broken version? Do you hard-cutover and corrupt their context? Do you let them finish under the broken prompt? Do you have any mechanism to tell them apart?

Build on immutable prompt versions before you need to roll back. Once a prompt version is deployed, treat it as an immutable artifact — like a container image tag. Assign a version ID that can never be mutated, only superseded. Label promotion (production label reassigned to an earlier version) should require no code redeploy — it should be a configuration operation that propagates within seconds.

A complete AI agent artifact ID should encode four versioning layers:

ALV (Agent Logic Version): the ReAct loop structure and tool selection logic
PPV (Prompt and Policy Version): the system prompt, persona, and guardrails
MRV (Model Runtime Version): the pinned model ID, e.g. gpt-4o-2024-08-06
TAV (Tool and API Interface Version): the tool schemas and external API contracts

A rollback in response to a prompt-level failure is a PPV change. It should not touch ALV, MRV, or TAV.

Stateless agents: rollback is instant. Each turn is independent. Reassign the production label and all new requests immediately start on the previous version. In-flight requests complete under the old version and the next turn uses the rollback.

Stateful agents (multi-turn sessions with memory): this is where the problem lives. The required infrastructure is:

Session version tagging: every active session stores which PPV it was started under. This is the metadata field that enables selective rollback.
Session pinning with graceful drain: when you roll back the production label, existing sessions continue running under the version they were started under. New sessions immediately use the rollback version. Sessions drain naturally as conversations complete. No hard cutover, no context corruption.
Point-in-time memory snapshots: so that if a bad prompt already corrupted session memory, you can revert to a pre-incident checkpoint.

The graceful drain pattern is the key primitive. Implement it at the session ID level with a version metadata field stored alongside conversation state. It costs almost nothing to implement upfront and eliminates the hardest part of emergency rollback.

For prompt changes that aren't emergencies, use canary deployments. Route 1% of traffic (keyed on user ID or session ID, not randomly, for reproducibility) to the new PPV. Run automated evaluators — hallucination rate, format compliance, groundedness score — against canary vs. production traces for one to two hours. If evals degrade below threshold, kill the canary flag. Production users are unaffected. If evals pass, promote the canary label to production. The sycophancy incident postmortem named exactly this gap: "standard checks weren't specifically looking for sycophantic behavior." Your canary eval suite must include behavioral checks, not just quality metrics.

What AI Postmortems Must Capture That SRE Runbooks Miss

Traditional incident postmortems were designed for infrastructure failures: a service went down, you found the cause, you fixed it. The timeline was discrete and causal. LLM incidents don't work this way.

An LLM feature can degrade gradually over six hours while all your SLO metrics stay green. The failure may be behavioral (the model became subtly more agreeable) rather than functional (the model returned errors). The root cause may be a reward signal conflict, a data pipeline issue from three weeks ago, or a provider update that your vendor didn't announce. Standard runbooks have no categories for any of these.

What traditional postmortems capture but AI postmortems must extend:

Artifact versioning: A traditional postmortem records the deployed commit SHA. An AI postmortem needs the full version snapshot at incident start: model version, prompt version, retrieval index version, tool schema version, eval dataset version, and guardrail version. Without the complete picture, you cannot reproduce the incident or confirm that your fix actually addresses the root cause.

Failure layer classification: A traditional postmortem says "service degraded." An AI postmortem must name the specific layer: retrieval outage, quality degradation, performance degradation, cost spike, or data incident. These require different response playbooks. Combining them under a generic "AI degradation" label guarantees you'll apply the wrong fix.

Trace evidence: Traditional postmortems reconstruct timelines from logs. LLM incidents require attaching the actual failing traces — the rendered prompt, the retrieved chunks, and the generated output — to the postmortem document. Without the concrete artifacts, it's nearly impossible to reproduce the failure or study whether future changes would have prevented it.

Eval gap analysis: Standard postmortems ask "why didn't our tests catch this?" AI postmortems ask a more precise version: "Which eval cases were missing? What behavioral dimension had no coverage?" This feeds directly back into the eval dataset. If sycophancy was the failure mode and no eval case tests for sycophancy, that's the actionable finding.

Reward signal analysis: Two years of AI-powered postmortem analysis at a major retailer found a persistent ~10% attribution error rate — the model blamed technologies simply because they were mentioned in the incident thread. The sycophancy incident identified that conflicting reward signals drove the failure. AI postmortems must document which optimization signals were active at incident time and whether they were in conflict. No traditional SRE runbook has a field for this.

Drift vs. acute failure: Traditional postmortems treat incidents as discrete events with a clear onset time. AI systems exhibit gradual silent degradation — hallucination scores declining from 94% to 82% over six hours before anyone notices. These are different incident types requiring different corrective actions. A drift event isn't fixed with a rollback; it requires identifying the source of distribution shift: input distribution change, data pipeline staleness, index drift, or reward signal creep.

The sections an AI postmortem needs that standard templates omit:

Request-level evidence: all version IDs (prompt, model, index, tool schema) plus sample failing request IDs that can be replayed in a sandbox
Failure layer classification with evidence for and against each candidate layer
Why existing evals and alerts missed this failure, and which specific eval cases need to be added
New guardrails: circuit breakers, alerts, canary gates, and rollback rules to add before the next release
Proof of recovery: before-and-after metrics, sample trace review comparing pre- and post-fix outputs, and an explicit statement of residual risk

The Datadog experience building an LLM-powered postmortem generator is instructive: LLMs performed well on event reconstruction (timeline from logs) but poorly on root cause determination and infrastructure causation. AI draft postmortems also suffer from recency bias — they fail to distinguish superseded mid-incident hypotheses from final conclusions in communication threads. Human review remains required for any causal claim. Use the AI to draft the timeline; write the root cause yourself.

Building the Monitoring Infrastructure Before You Need It

The diagnosis tree only works if you have the data to traverse it. Most teams realize they don't when they're already in an incident.

The minimum instrumentation for each layer:

Retrieval layer: log which document chunks were retrieved per query, their relevance scores, their ingestion timestamps, and which source they came from. Alert on sudden concentration from a single source and on retrieved chunk age exceeding a threshold.
Generation layer: log the full rendered prompt (not just the template) and the model version field from the API response. Alert on model version changes. Continuously run a lightweight hallucination eval against a sample of production outputs.
Routing layer: log tool selections and intent classifications. Alert on traffic share deviation — if one tool normally gets 20% of traffic and suddenly gets 80%, something broke.
Cost and loop layer: track token counts per session with hard budget caps. Alert on sessions exceeding a threshold. Log inter-agent call graphs and alert on cycles.

For silent quality degradation, embed incoming prompts with a sentence transformer and compare the distribution against a baseline using statistical distance. Distribution shift in inputs predicts quality degradation days before user complaints arrive.

The Five-Step Triage Loop

When an alert fires or a user reports degraded behavior, the first fifteen minutes determine whether you spend one hour or one week on the incident.

Trace it: find the failing span. Locate the retrieval, prompt rendering, and generation steps separately. Identify which layer produced unexpected output.
Isolate it: classify the fault using the diagnosis tree. Do not attempt to fix anything until you have a layer classification. Fixing generation when the fault is retrieval is worse than doing nothing — you'll mask the symptom without addressing the cause.
Evaluate it: run automated evaluators on the failing trace. Groundedness, format compliance, retrieval relevance. Establish a numeric baseline to confirm that your fix actually works.
Simulate it: replay the exact failed request in a sandbox with controlled overrides. Swap the prompt version, the model, or the retrieved context one variable at a time. The variable whose swap fixes the failure is the root cause.
Fix and regress: test the fix against a golden dataset before deploying. Add the failing case to the eval suite. Deploy via canary, not directly to production.

Where the Gap Is

The industry has invested heavily in model evaluation and prompt engineering. It has invested almost nothing in the operational layer: versioned rollbacks, session-safe canary deploys, layer-specific alert routing, and postmortem formats that capture behavioral failures.

A 2025 survey of 1,200 production LLM deployments found that model drift causes 40% of production agent failures and tool versioning issues cause 60%. These aren't model quality problems — they're operational discipline problems. They're fixed with infrastructure and process, not better prompts.

The teams that handle LLM incidents well treat every deployed prompt as a versioned artifact, run behavioral evals as part of every canary gate, and have a written incident classification scheme before the first incident fires. The teams that don't are still grepping logs and hoping their next model update is better-behaved.

Build the runbook before you need it. The first incident where you do will be dramatically shorter than the one where you didn't.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The AI Incident Response Playbook: Diagnosing LLM Degradation in Production

The Four-Layer Diagnosis Tree

How to Hot-Rollback a Prompt Without Breaking In-Flight Sessions

What AI Postmortems Must Capture That SRE Runbooks Miss

Building the Monitoring Infrastructure Before You Need It

The Five-Step Triage Loop

Where the Gap Is

Recommended Reading

About Tian Pan

The Four-Layer Diagnosis Tree​

How to Hot-Rollback a Prompt Without Breaking In-Flight Sessions​

What AI Postmortems Must Capture That SRE Runbooks Miss​

Building the Monitoring Infrastructure Before You Need It​

The Five-Step Triage Loop​

Where the Gap Is​

Recommended Reading

About Tian Pan

The Four-Layer Diagnosis Tree

How to Hot-Rollback a Prompt Without Breaking In-Flight Sessions

What AI Postmortems Must Capture That SRE Runbooks Miss

Building the Monitoring Infrastructure Before You Need It

The Five-Step Triage Loop

Where the Gap Is