Your On-Call Rotation Needs an AI-Literacy Prerequisite Before It Pages Anyone at 2am
A platform engineer with eight years of incident-response experience opens a 2am page that says "AI assistant degraded — error rate 12%." She checks the model latency dashboard: green. She checks the model API status page: green. She checks the deploy log: nothing shipped in the last 72 hours. She does what any competent on-call does next — she pages the AI team. The AI engineer wakes up, opens the trace dashboard the platform engineer didn't know existed, sees that a single retrieval tool has been timing out for the last four hours because a downstream search index lost a replica, and resolves the incident in eleven minutes. The AI engineer goes back to bed at 3:14am. The retrospective the next morning records "AI feature outage, resolved by AI team." Nobody writes down the actual lesson, which is that the on-call engineer could have triaged this in five minutes if she had ever been taught what an AI feature's failure surface looks like.
This is the rotation tax that AI features quietly impose on every engineering org I've worked with in the last two years. The shared on-call rotation that worked beautifully for a stack of stateless services and a few databases breaks down the moment one of those "services" is an LLM-backed feature. The on-call playbook your SRE team built across a decade of post-mortems is calibrated for a world where "something is broken" decomposes into CPU, memory, network, deploys, and dependency timeouts. AI features add three more axes — the model, the prompt, the retrieval pipeline — and four more shapes of failure that don't show up on the dashboards your on-call was trained to read.
The failure axes your existing on-call training never modeled
When a traditional service degrades, the on-call walks a small, well-worn decision tree: deploy → infra → upstream dependency → noisy neighbor → bug. Every branch has a dashboard. Every dashboard has a runbook. AI features force three additional branches that most rotation playbooks haven't named:
Model behavior degraded but model latency green. A vendor pushes a silent point release of the underlying model — refusal behavior changes, structured output formatting drifts, a tool-call argument that worked yesterday gets nullified today. The latency dashboard is fine. The error-rate dashboard is fine if you only count HTTP 5xx. The actual signal is in the eval pass-rate or in a 2× spike in user thumbs-down, neither of which is wired into the alert. As one observability writeup put it, "an LLM feature can degrade gradually while all SLO metrics stay green — the failure may be behavioral rather than functional."
Tool dependency failure masquerading as model failure. The agent's retrieval tool is timing out, so the model is reasoning over an empty context window and generating confidently-wrong answers. The trace shows a clean model call returning a clean response. The on-call sees "AI feature broken" and routes the page upward. The actual fix is a five-minute database-replica failover the platform on-call could have done in their sleep — if they had been told that the trace anatomy includes a tool layer they should check first.
Prompt-layer regression from a feature flag the AI team didn't ship. A growth experiment toggled a personalization variable into the system prompt's context. The model's behavior changed. The AI team didn't ship anything. The growth team didn't realize their flag was upstream of an AI feature. The on-call sees "AI feature degraded after 14:00 UTC" and pages the AI team, who spends ninety minutes diff-ing prompts before someone thinks to check feature flags.
These three shapes account for a meaningful fraction of after-hours AI pages. They're all triagable by a generalist on-call, but only if the rotation has trained that on-call to recognize them. Most rotations haven't.
What "AI literacy" actually means for an on-call engineer
The phrase "AI literacy" gets used as a HR-deck buzzword in ways that tell you nothing. For an on-call rotation, it's a concrete skills list. The on-call who answers an AI-feature page should be able to do five things without paging anyone:
- Read a model trace end-to-end. Identify the prompt, the model and version, the tool calls in order, the tool responses, and the final assistant response. This is not the same skill as reading an HTTP trace. Token boundaries, message roles, and tool-call IDs all matter.
- Distinguish the model layer from the tool layer from the retrieval layer. A failure in one looks superficially identical to a failure in another in any dashboard that hasn't been deliberately broken out by layer.
- Read an eval result. Pass rate, regression delta against the previous prompt version, and which slice failed. If your eval is a black box to the on-call, the eval's signal is wasted at exactly the moment it's most needed.
- Diff a prompt manifest. Find the last commit that touched the prompt, the system instructions, the tool descriptions, or the retriever config. Most AI-feature regressions are upstream of the model and inside a config file.
- Recognize the top five AI-incident shapes for your specific product, with the dashboard URL and the first triage step for each one. Generic AI training doesn't help here; the runbook has to be specific to what your AI features actually do.
That's it. Five skills. Two to four hours of focused training, plus a written runbook the on-call can read while groggy. The org that hasn't done this is paying for the gap in the AI team's pager rate.
The dashboard hygiene investment your alerts are quietly demanding
A page that says "AI assistant degraded" tells the on-call nothing about which layer to investigate. It is the equivalent of paging on "service is broken" — a label, not a signal. The first investment any rotation should make before adding AI features to its scope is to refuse to ship alerts at this granularity.
The minimum viable refactor: every AI-feature alert names the failure layer in its title. "AI assistant: tool layer (search) — 12% error rate." "AI assistant: model layer — refusal rate spike." "AI assistant: retrieval layer — index freshness > 4h." The on-call who reads the alert title knows which dashboard to open and which section of the runbook to follow. This is dashboard hygiene, not dashboard innovation, and yet I have audited rotations at three different orgs in the last year that page on a single composite "AI feature health" gauge with no layer attribution. Every single page from those rotations gets escalated to the AI team because the on-call has no way to do otherwise.
The second investment, slightly more involved: a per-tool health dashboard that the on-call can land on directly from the alert. Tool-call timeouts, tool-call error rates, tool result freshness, and tool dependency status. If your agent has six tools, you need a dashboard that answers "is any of the six tools degraded right now" in under thirty seconds. That dashboard is not optional once the AI feature is in a shared rotation; it is a precondition.
The co-on-call shadow period: cheaper than burnout
The fastest way to transfer the AI team's tacit knowledge into the broader rotation is also the oldest trick in the on-call playbook: shadowing. For the first month an engineer is on the rotation that includes AI features, every AI-feature page that fires also notifies an AI-experienced engineer in a shadow capacity. The shadow doesn't take the page; the shadow watches the on-call work and steps in only if the on-call asks.
The cost is real but bounded. The AI engineer's sleep is interrupted, but they don't have to act — they can mostly observe and go back to sleep. The benefit compounds: every shadowed page produces a more-literate on-call and, often, a runbook update that captures the tacit move the AI engineer would have made. Three or four shadowed pages is usually enough to turn an AI-illiterate on-call into an AI-competent one for the modal failure modes.
Without this, the rotation enters a known anti-pattern: the AI team gets paged at 2am for problems the on-call could have triaged, the AI team's burnout rises, two engineers leave within six months citing on-call burden, the rotation frequency tightens for everyone remaining, and the cycle accelerates. This trajectory is well-documented in SRE literature and it does not require AI features to start; AI features just give it a faster engine.
The runbook structure for the top five AI-incident shapes
Generic on-call training is the wrong granularity. The runbook the on-call actually needs is the specific top-five list for your product. Here is a template I've found works across teams:
For each of the top five AI-incident shapes:
- The alert title that fires (use the literal string the on-call will see)
- The single dashboard URL that confirms or rules out the diagnosis
- The first three queries (a SQL or trace query) the on-call should run, with expected output
- The mitigation that does not require the AI team — restart the tool, fail over the index, disable a feature flag, switch to a fallback model
- The escalation criterion — the specific condition under which paging the AI team is the right move
The discipline is brutal: if you can't write the runbook for an AI-incident shape, the on-call can't triage it, and you're shipping a feature whose burden lands on the AI team's sleep. The act of writing the runbook is what forces the dashboard hygiene, the layer attribution, and the mitigation tooling to exist.
The teams that have done this well end up with runbooks that look surprisingly mundane. "Tool: search index — if error rate > 5% for 10 min, fail over to the read replica using this dashboard button, then verify pass rate recovered, then file an INC ticket for the AI team to investigate root cause during business hours." That last clause is the magic. It distinguishes "the on-call mitigated and went back to sleep" from "the on-call escalated and the AI team is awake." The former is the goal of every AI-feature on-call investment.
What the org failure mode actually costs
A rotation that hasn't been updated for AI features doesn't fail loudly. It fails in three quiet ways that show up in the wrong dashboards.
The AI team's on-call rate looks higher than the rest of engineering. Leadership reads this as "AI is hard" or "AI features are immature" and budgets accordingly — which means more headcount on the AI team rather than the cheaper intervention of upskilling the existing rotation. The platform on-call rate looks unremarkable because the pages they would have handled are getting routed past them.
The AI team's retention looks worse than the rest of engineering. Exit interviews surface on-call burden as a primary factor. The org assumes this is the cost of doing business in AI. The actual cost is the cost of the rotation not adapting to a category of failure modes the on-call training hasn't named.
The AI feature's customer-perceived reliability looks worse than the AI team's own dashboards say. The 90-minute escalation chain from generalist on-call to AI on-call to actual fix is invisible to the SLO dashboard but very visible to customers. The org concludes that AI features are inherently flaky. The actual conclusion is that the rotation's MTTR is being dragged down by an unnecessary handoff.
The literacy prerequisite is a policy, not a wish
If you treat AI literacy as something that will happen organically — engineers will pick it up by osmosis, eng-wide brown bags will fill the gap, the AI team's documentation will eventually get read — it will not happen. Every org I have watched try the organic approach has watched the AI team's pager rate quietly grow until somebody senior looks at the on-call dashboard and asks why one team is handling 40% of pages.
The intervention is a written prerequisite in the rotation policy. Before you join the rotation that includes AI features, you complete the AI on-call onboarding module. The module is two to four hours, ends in a paired exercise on a real (anonymized) past incident, and is signed off by an AI-experienced engineer. The first month on rotation is shadowed. The runbook for the top five AI-incident shapes is in the same repo as the platform runbook and is reviewed quarterly.
This is unglamorous infrastructure. It does not appear on a roadmap. It does not show up in a launch deck. It is the kind of work that pays back in retention, in MTTR, and in the AI team's ability to ship the next feature instead of debugging the last one at 2am. The org that names it as work and funds it as work is the org whose AI features get treated like every other production system — which is the only treatment that scales.
The architectural realization is small but consequential: AI features are not exotic. They are production systems with a different failure surface, and the rotation that doesn't update its training to match that surface is making a staffing decision it didn't price. The price comes due in the AI team's burnout rate, and by the time it shows up in the exit interviews, the rotation has been undercharging for a year.
- https://www.ai-infra-link.com/ai-augmented-on-call-revolutionizing-incident-response-in-2025/
- https://runframe.io/blog/state-of-incident-management-2025
- https://galileo.ai/blog/agent-failure-modes-guide
- https://arize.com/blog/common-ai-agent-failures/
- https://latitude.so/blog/ai-agent-failure-detection-guide
- https://uptimelabs.io/learn/reduce-on-call-burnout/
- https://www.elastic.co/observability-labs/blog/sre-troubleshooting-ai-assistant-observability-runbooks
- https://last9.io/blog/what-is-ai-sre/
- https://mrkaran.dev/posts/pi-sre-mode/
- https://dzone.com/articles/agentic-aiops-human-in-the-loop-workflows
