The On-Call Burden Shift: How AI Features Break Your Incident Response Playbook
Your monitoring dashboard is green. Latency is normal. Error rates are flat. And your AI feature has been hallucinating customer account numbers for the last six hours.
This is the new normal for on-call engineers at companies shipping AI features. The playbooks that worked for deterministic software — check the logs, find the stack trace, roll back the deploy — break down when "correct execution, wrong answer" is the dominant failure mode. A 2025 industry report found operational toil rose from 25% to 30% for the first time in five years, even as organizations poured millions into AI tooling. The tools got smarter, but the incidents got weirder.
The Failure Modes Your Runbook Doesn't Cover
Traditional incident response assumes a binary world: the service is up or it's down. Requests succeed or they fail. When something breaks, there's an error code, a stack trace, a deploy to correlate with. AI features shatter every one of these assumptions.
The new incident categories that AI introduces include:
- Silent semantic degradation. The model returns syntactically valid responses that are factually wrong, off-topic, or subtly harmful. Your health checks pass. Your latency SLOs are met. But the AI just told a customer their refund was processed when it wasn't.
- Provider-side silent changes. Your LLM provider updates their model weights or deprecates an endpoint, and your carefully tuned prompts start producing different outputs. OpenAI retired GPT-4o, GPT-4.1, and several other models in February 2026 — teams that hardcoded model names discovered their pipelines breaking with 404s, sometimes days after the change.
- Prompt injection attacks. Malicious content enters your system through user input or retrieval pipelines and hijacks the model's behavior. Unlike SQL injection, there's no WAF rule that reliably catches it. The attack surface is the natural language interface itself.
- Embedding index corruption. A schema change upstream silently degrades your RAG pipeline's retrieval quality. The system keeps responding, but it's now pulling from stale or irrelevant context. This can persist for days before anyone connects a data pipeline change to degraded AI output.
- Stochastic regression. A prompt change that improved average output quality also introduced a long tail of terrible responses for edge cases. Your A/B test showed a statistically significant improvement, but it missed the 2% of queries where the new prompt catastrophically fails.
None of these produce a stack trace. None trigger a traditional alert. And none can be fixed by rolling back a deployment — because the "deployment" might be a model update that happened on someone else's infrastructure.
Why "The AI Is Being Weird" Is Now a Legitimate Incident Report
Every on-call engineer who supports AI features has received this ticket: "The AI is being weird." In a deterministic system, you'd close this as insufficient information. In an AI system, this vague report might be the only signal you get before a silent degradation becomes a customer-facing disaster.
The challenge is triage. When a user reports that the AI is "being weird," you need to distinguish between at least four possibilities:
- Normal variance. LLMs are non-deterministic by design. Some outputs will be surprising but within acceptable bounds.
- Prompt regression. A recent change to system prompts, retrieval context, or guardrails shifted the output distribution in ways that users notice but metrics don't capture.
- Model-level degradation. The underlying model's behavior changed — either through a provider update or through data drift in fine-tuned models.
- Active attack. Someone is exploiting the system through prompt injection or adversarial inputs.
Traditional escalation paths don't work here. You can't hand this to the database team or the infrastructure team. You need someone who understands the full AI stack: the prompt engineering, the retrieval pipeline, the model behavior, and the evaluation framework. Most SRE teams in 2026 still don't have this expertise, and that's the core of the on-call burden shift.
The Verification Tax: More AI, More Toil
The intuition is that AI should reduce on-call burden — automate triage, suggest root causes, draft postmortems. These capabilities are real. But the data tells a different story: organizations that deployed AI features saw operational toil increase, not decrease.
The industry has started calling this the "verification tax." Shipping an AI feature doesn't just add the feature's operational burden — it adds a new layer of monitoring, evaluation, and human verification on top of everything else. A responsible AI deployment requires:
- Continuous evaluation pipelines that run golden test sets against production outputs and alert on drift.
- Human review queues for edge cases that automated evaluation flags as ambiguous.
- Shadow deployments for prompt or model changes, comparing semantic outputs before routing real traffic.
- Behavioral logging that captures not just inputs and outputs but the full chain of tool calls, retrieval results, and intermediate reasoning.
Each of these is operationally expensive. And because 69% of AI-powered decisions still require human verification, you haven't actually removed the human from the loop — you've added AI to the loop and kept the human there too. For a 250-engineer organization, this overhead translates to millions in annual productivity cost.
Teams that manage this well treat AI monitoring as its own discipline, not something bolted onto existing infrastructure monitoring. They invest in dedicated evaluation infrastructure, separate on-call rotations for AI-specific issues, and runbooks that account for non-deterministic behavior.
Rebuilding the Playbook: What Actually Works
After two years of watching teams struggle with AI incident response, clear patterns have emerged.
Version everything, not just code. Every change to prompts, system instructions, retrieval configurations, model versions, and guardrail rules needs to be versioned and deployable with rollback capability. When an AI incident occurs, your first question should be "what changed?" — and you need the tooling to answer that across the entire AI stack, not just the application code.
Build semantic baselines, not just performance baselines. Traditional monitoring tracks latency, error rate, and throughput. AI monitoring needs to additionally track output quality metrics: hallucination rate, factual accuracy against known-good answers, safety compliance scores, and task completion rates. These metrics need baselines and alerting thresholds, just like your p99 latency.
Implement the 30-day alert rule. Teams that practiced deleting any alert not acted upon within 30 days reduced their mean time to acknowledge (MTTA) by 40%. This is even more critical for AI systems, where alert fatigue from noisy semantic evaluations can mask genuine degradation. If an alert fires constantly and nobody acts on it, it's not monitoring — it's noise.
Create AI-specific runbooks. Your existing runbooks assume deterministic failure modes. AI incidents need their own decision trees:
- Is this a model-level issue or an application-level issue?
- Can we reproduce the bad output with the same input, or is it stochastic?
- Did the model provider make any recent changes?
- Are our evaluation metrics showing drift, or is this an isolated report?
- Can we mitigate by falling back to a cached/deterministic response?
Train your on-call rotation on the AI stack. The most common failure in AI incident response is the handoff gap: the SRE who gets paged understands infrastructure but not prompt engineering, while the ML engineer who understands the model isn't in the on-call rotation. Bridge this by either cross-training SREs on AI fundamentals or creating a dedicated AI on-call rotation that works alongside the infrastructure rotation.
The Provider Dependency Problem
A unique aspect of AI incident response: a significant class of incidents originates outside your infrastructure entirely. When your LLM provider pushes a model update, deprecates an API version, or experiences a partial outage, your AI features degrade in ways you cannot directly fix.
The mitigation strategies are straightforward but rarely implemented until after the first provider-caused incident:
- Pin model versions explicitly and test before upgrading, treating model version changes like database migrations.
- Implement provider fallback through an API gateway that can route to backup providers when your primary hits rate limits, outages, or unexpected behavior changes.
- Monitor provider changelogs as part of your operational practice, not as an afterthought. Subscribe to status pages, API changelogs, and deprecation notices. Treat a model deprecation announcement like a CVE — it needs a response plan.
- Maintain a local evaluation suite that you can run against any model version to detect behavioral regressions before they reach production. This is your canary — if your eval suite shows a quality drop after a provider update, you block the rollout rather than discovering the problem through user reports.
The Postmortem Gap
Traditional postmortems document what went wrong, why, and how to prevent recurrence. AI incidents expose a gap: the root cause is often "the model produced a bad output," which is simultaneously true and useless.
Effective AI postmortems go deeper. Instead of "the model hallucinated," document:
- Which specific inputs triggered the hallucination
- Whether the retrieval pipeline provided adequate context
- Whether the guardrails should have caught the output
- Whether the evaluation pipeline had coverage for this failure mode
The actionable output of an AI postmortem is almost always one of three things: a new test case in your evaluation suite, a guardrail rule that catches the specific failure pattern, or an architectural change that reduces the blast radius when the model produces unexpected output. If your postmortem doesn't produce one of these, it hasn't actually reduced future risk.
As AI features become table stakes rather than differentiators, incident response maturity for non-deterministic systems will separate the teams that ship reliably from the ones stuck in perpetual firefighting. The playbook needs rewriting — and the teams doing it now will have a compounding advantage over those who wait for their first AI-specific SEV-1 to force the change.
- https://runframe.io/blog/state-of-incident-management-2025
- https://dev.to/delafosse_olivier_f47ff53/silent-degradation-in-llm-systems-detecting-when-your-ai-quietly-gets-worse-4gdm
- https://spectrum.ieee.org/ai-reliability
- https://www.coalitionforsecureai.org/defending-ai-systems-a-new-framework-for-incident-response-in-the-age-of-intelligent-technology/
- https://www.armosec.io/blog/how-to-detect-prompt-injection-in-production-ai-agent-workloads/
- https://www.splunk.com/en_us/blog/learn/llm-observability.html
- https://kissapi.ai/blog/openai-model-deprecation-migration-guide-2026.html
