AI On-Call Psychology: Rebuilding Operator Intuition for Non-Deterministic Alerts
The first time an on-call engineer closes a page with "the model was just being weird again," the team has quietly crossed a line. That phrase does three things at once: it declares the issue un-investigable, it classifies future similar alerts as noise, and it absolves the rotation of documenting what happened. A week later the same signature will fire, someone else will see "already dismissed once," and a real regression will live in production until a customer tweets about it.
This pattern is not laziness. It is the predictable outcome of running standard SRE intuition on a system that no longer behaves deterministically. Classical on-call training teaches engineers to treat identical inputs producing different outputs as a bug in the observability stack — it cannot be a bug in the system, because systems don't do that. LLM-backed systems do exactly that, every request, by design. An on-call rotation built without internalizing this will drift toward either paralysis (every stochastic wobble is a P2) or nihilism (the model is always weird, stop paging me).
The teams I have seen handle this well treat AI on-call as a distinct discipline from traditional reliability work. They redesign the alert taxonomy, rebuild the rotation incentive structure, and invest in a training curriculum that explicitly unlearns a few deeply ingrained SRE instincts. What follows is a synthesis of what that looks like in practice.
"The Model Felt Like It" Is Not a Root Cause
The operational premise of traditional SRE is that every incident has a root cause reachable through disciplined investigation — a bad config push, a race condition, a dependency degradation. The postmortem culture built around this premise (Five Whys, causal chains, blameless retrospectives) only works when a why actually exists at the bottom of the tree.
Stochastic systems break this premise at the substrate. A single bit-level difference in the first logit calculation can flip token selection, and once the trajectory diverges, the model produces a factual answer one minute and a confident hallucination the next with no change to inputs. When batching, caching, and request scheduling interact with floating-point nondeterminism, even replaying the same prompt at the same temperature on the same model can yield different outputs. The "why" at the bottom of the tree is often just the joint entropy of the sampling process.
The dangerous move is to import this fact into on-call behavior as permission to stop investigating. If "the model felt like it" becomes the accepted terminal node of your incident tree, you have redefined every production bug in the AI path as unanalyzable. Teams that resist this do something subtle: they make stochasticity a category in the taxonomy, not an escape valve. A finding of "cannot distinguish from sampling noise at current N=1" is a conclusion, but it triggers a different workflow — usually an automatic request to replay the trace at higher N, run a targeted eval slice, or escalate if the signature keeps appearing. The rotation produces either a real root cause or enough statistical evidence to say the failure is within the expected noise floor. What it never produces is a shrug.
An Alert Taxonomy Built for Stochastic Systems
Most AI observability stacks bolt LLM metrics onto existing alert infrastructure, which produces a fundamentally confused signal. An engineer gets paged, opens the dashboard, and sees a "quality score dropped 8%" alongside "p99 latency up 40ms" and "error rate 0.3%" — three metrics with three completely different reproducibility guarantees, routed through the same pager.
A taxonomy that survives contact with production separates alerts into at least four distinct families, each with its own response protocol.
The first is deterministic infrastructure: the usual timeouts, 5xx rates, dependency health, queue depth. Same instincts as before. If a GPU host is wedged, it is wedged; classical debugging applies.
The second is policy and contract violations: safety classifier fires, JSON schema validation failures, refused tool calls, prompt injection detections. These are deterministic given a fixed trace, even though they happen inside a stochastic system. A single reproduction is meaningful; they belong on the pager at high severity.
The third is quality regressions with statistical signal: a sustained drop in faithfulness score, a shift in output length distribution, a drift in tool-call success rates over thousands of requests. These are real but require population-level analysis, not request-level debugging. They should never wake someone up — they should cut a ticket for tomorrow morning with a pre-attached eval slice. Pages here produce learned helplessness fast.
The fourth is stochastic noise: one weird output in a low-volume code path. These should not be alerts at all. If they leak through, the correct on-call action is to log the trace to a replay corpus and close the page. Training engineers to do this without guilt is harder than it sounds, because the trace often looks investigable — the model really did produce a wrong answer, and the usual instinct is that a wrong answer demands a fix.
The discipline is to reserve "fix" for regressions visible in aggregate. Individual stochastic failures are data points, not bugs. Teams that conflate the two either rewrite their prompt every Tuesday based on a single anecdote (prompt drift through overfitting to the latest noisy sample) or conclude that nothing is ever actionable and stop responding.
Rotation Design That Resists the Dismissal Pattern
The "AI being weird again" dismissal has a social structure, not just a cognitive one. It emerges when one engineer's shrug during a 3 a.m. page becomes another engineer's precedent during the next page. Stopping the pattern requires building the rotation to propagate investigation, not dismissal.
Three design choices help.
- https://incident.io/blog/alert-fatigue-solutions-for-dev-ops-teams-in-2025-what-works
- https://runframe.io/blog/state-of-incident-management-2025
- https://oneuptime.com/blog/post/2026-03-05-alert-fatigue-ai-on-call/view
- https://blog.promptlayer.com/understanding-intermittent-failures-in-llms/
- https://www.confident-ai.com/knowledge-base/top-5-llm-monitoring-tools-for-ai
- https://latitude.so/blog/ai-agent-failure-detection-guide
- https://insightfinder.com/blog/hidden-cost-llm-drift-detection/
- https://dev.to/optyxstack/the-ai-incident-report-template-i-actually-use-for-wrong-answers-and-tool-failures-174l
- https://sre.google/workbook/postmortem-culture/
- https://incident.io/blog/what-is-ai-sre-complete-guide-2026
