The On-Call Burden Shift: How AI Features Break Your Incident Response Playbook
Your monitoring dashboard is green. Latency is normal. Error rates are flat. And your AI feature has been hallucinating customer account numbers for the last six hours.
This is the new normal for on-call engineers at companies shipping AI features. The playbooks that worked for deterministic software — check the logs, find the stack trace, roll back the deploy — break down when "correct execution, wrong answer" is the dominant failure mode. A 2025 industry report found operational toil rose from 25% to 30% for the first time in five years, even as organizations poured millions into AI tooling. The tools got smarter, but the incidents got weirder.
The Failure Modes Your Runbook Doesn't Cover
Traditional incident response assumes a binary world: the service is up or it's down. Requests succeed or they fail. When something breaks, there's an error code, a stack trace, a deploy to correlate with. AI features shatter every one of these assumptions.
The new incident categories that AI introduces include:
- Silent semantic degradation. The model returns syntactically valid responses that are factually wrong, off-topic, or subtly harmful. Your health checks pass. Your latency SLOs are met. But the AI just told a customer their refund was processed when it wasn't.
- Provider-side silent changes. Your LLM provider updates their model weights or deprecates an endpoint, and your carefully tuned prompts start producing different outputs. OpenAI retired GPT-4o, GPT-4.1, and several other models in February 2026 — teams that hardcoded model names discovered their pipelines breaking with 404s, sometimes days after the change.
- Prompt injection attacks. Malicious content enters your system through user input or retrieval pipelines and hijacks the model's behavior. Unlike SQL injection, there's no WAF rule that reliably catches it. The attack surface is the natural language interface itself.
- Embedding index corruption. A schema change upstream silently degrades your RAG pipeline's retrieval quality. The system keeps responding, but it's now pulling from stale or irrelevant context. This can persist for days before anyone connects a data pipeline change to degraded AI output.
- Stochastic regression. A prompt change that improved average output quality also introduced a long tail of terrible responses for edge cases. Your A/B test showed a statistically significant improvement, but it missed the 2% of queries where the new prompt catastrophically fails.
None of these produce a stack trace. None trigger a traditional alert. And none can be fixed by rolling back a deployment — because the "deployment" might be a model update that happened on someone else's infrastructure.
Why "The AI Is Being Weird" Is Now a Legitimate Incident Report
Every on-call engineer who supports AI features has received this ticket: "The AI is being weird." In a deterministic system, you'd close this as insufficient information. In an AI system, this vague report might be the only signal you get before a silent degradation becomes a customer-facing disaster.
The challenge is triage. When a user reports that the AI is "being weird," you need to distinguish between at least four possibilities:
- Normal variance. LLMs are non-deterministic by design. Some outputs will be surprising but within acceptable bounds.
- Prompt regression. A recent change to system prompts, retrieval context, or guardrails shifted the output distribution in ways that users notice but metrics don't capture.
- Model-level degradation. The underlying model's behavior changed — either through a provider update or through data drift in fine-tuned models.
- Active attack. Someone is exploiting the system through prompt injection or adversarial inputs.
Traditional escalation paths don't work here. You can't hand this to the database team or the infrastructure team. You need someone who understands the full AI stack: the prompt engineering, the retrieval pipeline, the model behavior, and the evaluation framework. Most SRE teams in 2026 still don't have this expertise, and that's the core of the on-call burden shift.
The Verification Tax: More AI, More Toil
The intuition is that AI should reduce on-call burden — automate triage, suggest root causes, draft postmortems. These capabilities are real. But the data tells a different story: organizations that deployed AI features saw operational toil increase, not decrease.
The industry has started calling this the "verification tax." Shipping an AI feature doesn't just add the feature's operational burden — it adds a new layer of monitoring, evaluation, and human verification on top of everything else. A responsible AI deployment requires:
- Continuous evaluation pipelines that run golden test sets against production outputs and alert on drift.
- Human review queues for edge cases that automated evaluation flags as ambiguous.
- Shadow deployments for prompt or model changes, comparing semantic outputs before routing real traffic.
- Behavioral logging that captures not just inputs and outputs but the full chain of tool calls, retrieval results, and intermediate reasoning.
Each of these is operationally expensive. And because 69% of AI-powered decisions still require human verification, you haven't actually removed the human from the loop — you've added AI to the loop and kept the human there too. For a 250-engineer organization, this overhead translates to millions in annual productivity cost.
- https://runframe.io/blog/state-of-incident-management-2025
- https://dev.to/delafosse_olivier_f47ff53/silent-degradation-in-llm-systems-detecting-when-your-ai-quietly-gets-worse-4gdm
- https://spectrum.ieee.org/ai-reliability
- https://www.coalitionforsecureai.org/defending-ai-systems-a-new-framework-for-incident-response-in-the-age-of-intelligent-technology/
- https://www.armosec.io/blog/how-to-detect-prompt-injection-in-production-ai-agent-workloads/
- https://www.splunk.com/en_us/blog/learn/llm-observability.html
- https://kissapi.ai/blog/openai-model-deprecation-migration-guide-2026.html
