Your On-Call Rotation Needs an AI-Literacy Prerequisite Before It Pages Anyone at 2am
A platform engineer with eight years of incident-response experience opens a 2am page that says "AI assistant degraded — error rate 12%." She checks the model latency dashboard: green. She checks the model API status page: green. She checks the deploy log: nothing shipped in the last 72 hours. She does what any competent on-call does next — she pages the AI team. The AI engineer wakes up, opens the trace dashboard the platform engineer didn't know existed, sees that a single retrieval tool has been timing out for the last four hours because a downstream search index lost a replica, and resolves the incident in eleven minutes. The AI engineer goes back to bed at 3:14am. The retrospective the next morning records "AI feature outage, resolved by AI team." Nobody writes down the actual lesson, which is that the on-call engineer could have triaged this in five minutes if she had ever been taught what an AI feature's failure surface looks like.
This is the rotation tax that AI features quietly impose on every engineering org I've worked with in the last two years. The shared on-call rotation that worked beautifully for a stack of stateless services and a few databases breaks down the moment one of those "services" is an LLM-backed feature. The on-call playbook your SRE team built across a decade of post-mortems is calibrated for a world where "something is broken" decomposes into CPU, memory, network, deploys, and dependency timeouts. AI features add three more axes — the model, the prompt, the retrieval pipeline — and four more shapes of failure that don't show up on the dashboards your on-call was trained to read.
The failure axes your existing on-call training never modeled
When a traditional service degrades, the on-call walks a small, well-worn decision tree: deploy → infra → upstream dependency → noisy neighbor → bug. Every branch has a dashboard. Every dashboard has a runbook. AI features force three additional branches that most rotation playbooks haven't named:
Model behavior degraded but model latency green. A vendor pushes a silent point release of the underlying model — refusal behavior changes, structured output formatting drifts, a tool-call argument that worked yesterday gets nullified today. The latency dashboard is fine. The error-rate dashboard is fine if you only count HTTP 5xx. The actual signal is in the eval pass-rate or in a 2× spike in user thumbs-down, neither of which is wired into the alert. As one observability writeup put it, "an LLM feature can degrade gradually while all SLO metrics stay green — the failure may be behavioral rather than functional."
Tool dependency failure masquerading as model failure. The agent's retrieval tool is timing out, so the model is reasoning over an empty context window and generating confidently-wrong answers. The trace shows a clean model call returning a clean response. The on-call sees "AI feature broken" and routes the page upward. The actual fix is a five-minute database-replica failover the platform on-call could have done in their sleep — if they had been told that the trace anatomy includes a tool layer they should check first.
Prompt-layer regression from a feature flag the AI team didn't ship. A growth experiment toggled a personalization variable into the system prompt's context. The model's behavior changed. The AI team didn't ship anything. The growth team didn't realize their flag was upstream of an AI feature. The on-call sees "AI feature degraded after 14:00 UTC" and pages the AI team, who spends ninety minutes diff-ing prompts before someone thinks to check feature flags.
These three shapes account for a meaningful fraction of after-hours AI pages. They're all triagable by a generalist on-call, but only if the rotation has trained that on-call to recognize them. Most rotations haven't.
What "AI literacy" actually means for an on-call engineer
The phrase "AI literacy" gets used as a HR-deck buzzword in ways that tell you nothing. For an on-call rotation, it's a concrete skills list. The on-call who answers an AI-feature page should be able to do five things without paging anyone:
- Read a model trace end-to-end. Identify the prompt, the model and version, the tool calls in order, the tool responses, and the final assistant response. This is not the same skill as reading an HTTP trace. Token boundaries, message roles, and tool-call IDs all matter.
- Distinguish the model layer from the tool layer from the retrieval layer. A failure in one looks superficially identical to a failure in another in any dashboard that hasn't been deliberately broken out by layer.
- Read an eval result. Pass rate, regression delta against the previous prompt version, and which slice failed. If your eval is a black box to the on-call, the eval's signal is wasted at exactly the moment it's most needed.
- Diff a prompt manifest. Find the last commit that touched the prompt, the system instructions, the tool descriptions, or the retriever config. Most AI-feature regressions are upstream of the model and inside a config file.
- Recognize the top five AI-incident shapes for your specific product, with the dashboard URL and the first triage step for each one. Generic AI training doesn't help here; the runbook has to be specific to what your AI features actually do.
That's it. Five skills. Two to four hours of focused training, plus a written runbook the on-call can read while groggy. The org that hasn't done this is paying for the gap in the AI team's pager rate.
The dashboard hygiene investment your alerts are quietly demanding
A page that says "AI assistant degraded" tells the on-call nothing about which layer to investigate. It is the equivalent of paging on "service is broken" — a label, not a signal. The first investment any rotation should make before adding AI features to its scope is to refuse to ship alerts at this granularity.
The minimum viable refactor: every AI-feature alert names the failure layer in its title. "AI assistant: tool layer (search) — 12% error rate." "AI assistant: model layer — refusal rate spike." "AI assistant: retrieval layer — index freshness > 4h." The on-call who reads the alert title knows which dashboard to open and which section of the runbook to follow. This is dashboard hygiene, not dashboard innovation, and yet I have audited rotations at three different orgs in the last year that page on a single composite "AI feature health" gauge with no layer attribution. Every single page from those rotations gets escalated to the AI team because the on-call has no way to do otherwise.
The second investment, slightly more involved: a per-tool health dashboard that the on-call can land on directly from the alert. Tool-call timeouts, tool-call error rates, tool result freshness, and tool dependency status. If your agent has six tools, you need a dashboard that answers "is any of the six tools degraded right now" in under thirty seconds. That dashboard is not optional once the AI feature is in a shared rotation; it is a precondition.
The co-on-call shadow period: cheaper than burnout
- https://www.ai-infra-link.com/ai-augmented-on-call-revolutionizing-incident-response-in-2025/
- https://runframe.io/blog/state-of-incident-management-2025
- https://galileo.ai/blog/agent-failure-modes-guide
- https://arize.com/blog/common-ai-agent-failures/
- https://latitude.so/blog/ai-agent-failure-detection-guide
- https://uptimelabs.io/learn/reduce-on-call-burnout/
- https://www.elastic.co/observability-labs/blog/sre-troubleshooting-ai-assistant-observability-runbooks
- https://last9.io/blog/what-is-ai-sre/
- https://mrkaran.dev/posts/pi-sre-mode/
- https://dzone.com/articles/agentic-aiops-human-in-the-loop-workflows
