Skip to main content

AI On-Call Psychology: Rebuilding Operator Intuition for Non-Deterministic Alerts

· 11 min read
Tian Pan
Software Engineer

The first time an on-call engineer closes a page with "the model was just being weird again," the team has quietly crossed a line. That phrase does three things at once: it declares the issue un-investigable, it classifies future similar alerts as noise, and it absolves the rotation of documenting what happened. A week later the same signature will fire, someone else will see "already dismissed once," and a real regression will live in production until a customer tweets about it.

This pattern is not laziness. It is the predictable outcome of running standard SRE intuition on a system that no longer behaves deterministically. Classical on-call training teaches engineers to treat identical inputs producing different outputs as a bug in the observability stack — it cannot be a bug in the system, because systems don't do that. LLM-backed systems do exactly that, every request, by design. An on-call rotation built without internalizing this will drift toward either paralysis (every stochastic wobble is a P2) or nihilism (the model is always weird, stop paging me).

The teams I have seen handle this well treat AI on-call as a distinct discipline from traditional reliability work. They redesign the alert taxonomy, rebuild the rotation incentive structure, and invest in a training curriculum that explicitly unlearns a few deeply ingrained SRE instincts. What follows is a synthesis of what that looks like in practice.

"The Model Felt Like It" Is Not a Root Cause

The operational premise of traditional SRE is that every incident has a root cause reachable through disciplined investigation — a bad config push, a race condition, a dependency degradation. The postmortem culture built around this premise (Five Whys, causal chains, blameless retrospectives) only works when a why actually exists at the bottom of the tree.

Stochastic systems break this premise at the substrate. A single bit-level difference in the first logit calculation can flip token selection, and once the trajectory diverges, the model produces a factual answer one minute and a confident hallucination the next with no change to inputs. When batching, caching, and request scheduling interact with floating-point nondeterminism, even replaying the same prompt at the same temperature on the same model can yield different outputs. The "why" at the bottom of the tree is often just the joint entropy of the sampling process.

The dangerous move is to import this fact into on-call behavior as permission to stop investigating. If "the model felt like it" becomes the accepted terminal node of your incident tree, you have redefined every production bug in the AI path as unanalyzable. Teams that resist this do something subtle: they make stochasticity a category in the taxonomy, not an escape valve. A finding of "cannot distinguish from sampling noise at current N=1" is a conclusion, but it triggers a different workflow — usually an automatic request to replay the trace at higher N, run a targeted eval slice, or escalate if the signature keeps appearing. The rotation produces either a real root cause or enough statistical evidence to say the failure is within the expected noise floor. What it never produces is a shrug.

An Alert Taxonomy Built for Stochastic Systems

Most AI observability stacks bolt LLM metrics onto existing alert infrastructure, which produces a fundamentally confused signal. An engineer gets paged, opens the dashboard, and sees a "quality score dropped 8%" alongside "p99 latency up 40ms" and "error rate 0.3%" — three metrics with three completely different reproducibility guarantees, routed through the same pager.

A taxonomy that survives contact with production separates alerts into at least four distinct families, each with its own response protocol.

The first is deterministic infrastructure: the usual timeouts, 5xx rates, dependency health, queue depth. Same instincts as before. If a GPU host is wedged, it is wedged; classical debugging applies.

The second is policy and contract violations: safety classifier fires, JSON schema validation failures, refused tool calls, prompt injection detections. These are deterministic given a fixed trace, even though they happen inside a stochastic system. A single reproduction is meaningful; they belong on the pager at high severity.

The third is quality regressions with statistical signal: a sustained drop in faithfulness score, a shift in output length distribution, a drift in tool-call success rates over thousands of requests. These are real but require population-level analysis, not request-level debugging. They should never wake someone up — they should cut a ticket for tomorrow morning with a pre-attached eval slice. Pages here produce learned helplessness fast.

The fourth is stochastic noise: one weird output in a low-volume code path. These should not be alerts at all. If they leak through, the correct on-call action is to log the trace to a replay corpus and close the page. Training engineers to do this without guilt is harder than it sounds, because the trace often looks investigable — the model really did produce a wrong answer, and the usual instinct is that a wrong answer demands a fix.

The discipline is to reserve "fix" for regressions visible in aggregate. Individual stochastic failures are data points, not bugs. Teams that conflate the two either rewrite their prompt every Tuesday based on a single anecdote (prompt drift through overfitting to the latest noisy sample) or conclude that nothing is ever actionable and stop responding.

Rotation Design That Resists the Dismissal Pattern

The "AI being weird again" dismissal has a social structure, not just a cognitive one. It emerges when one engineer's shrug during a 3 a.m. page becomes another engineer's precedent during the next page. Stopping the pattern requires building the rotation to propagate investigation, not dismissal.

Three design choices help.

Forced second opinions on ambiguous dismissals. When a primary on-call classifies an alert as "stochastic noise, no action," the system logs that classification with trace IDs and requires the secondary to spot-check a random subset on the next shift. This is not adversarial review — it is a way to keep the classification boundary calibrated. If the secondary disagrees often, the primary's threshold is drifting. If they rarely disagree, the taxonomy is working. Either way the team sees the signal instead of each engineer privately accumulating their own bias.

Shift handoffs that include open behavioral threads, not just incidents. Classical on-call handoffs cover active incidents and known brittle systems. AI-aware handoffs add a third bucket: behavioral observations that are below alert threshold but trending. "Tool X has been hallucinating arguments at about twice last week's rate, I opened an eval slice." These threads are how you catch quality regressions that individually look like noise but aggregate into a real problem. Without explicit handoff, each shift rediscovers them from scratch and dismisses them one at a time.

Rotating the eval custodian. On top of the standard pager rotation, designate an engineer per cycle whose job is to review the replayed corpus from dismissed alerts and promote any that show up repeatedly. This role fixes the structural asymmetry where dismissing an alert is faster than investigating one. If dismissals have no downstream reader, the path of least resistance is to dismiss everything. If a teammate will read your dismissals tomorrow, the bar rises naturally.

Teams without some version of these controls eventually converge on a bimodal rotation: new engineers over-investigate every stochastic wobble and burn out, while tenured engineers dismiss everything and miss the real regressions. Neither group is wrong individually; the system is missing the feedback loop that would let them calibrate.

The Curriculum: What On-Call Training Has to Unlearn

A classical SRE onboarding checklist assumes a set of intuitions: reproduce the issue, narrow the variables, find the single change, fix it, confirm by re-reproducing, close. Every step of that checklist either partially breaks or needs a different implementation when the system under load is probabilistic.

Four unlearning exercises tend to matter most.

Reproduction is statistical, not binary. "I ran it and it worked" is not evidence the bug is gone — it is one sample from a distribution. New on-call engineers should practice replaying traces at N=20 or N=100 before concluding anything, and should learn to read confidence intervals on rates rather than treating single reproductions as diagnostic. This is uncomfortable at first because every prior instinct says a failing test that now passes means the fix worked.

Root cause often lives outside the model call. Stochastic systems tempt engineers to blame the model because it is the visible source of variance. In practice, most production AI incidents are still caused by retrieval, tool schemas, prompt versioning, or upstream data changes — the deterministic parts of the pipeline. The model is often behaving correctly on degraded inputs. The curriculum has to drill this: before blaming sampling noise, diff the inputs the model actually saw against inputs it saw yesterday.

Metrics have a noise floor you must learn by feel. Every eval metric has a level of variance that is intrinsic to the sample size and task. Training new on-call engineers to recognize the noise floor for each production metric — "this score wobbles by 2 points every day regardless of what we ship, an 8-point drop is meaningful" — is not documentable in a runbook. It is acquired through deliberate exposure to historical alert streams with hindsight labels attached.

The fix is a distribution shift, not a change. When a traditional bug is fixed, the failure rate for that path goes to zero. When an AI failure is mitigated, the failure rate goes from some percent to a lower percent. The "zero bugs in production" mental model actively harms on-call judgment here. A better framing is that every live failure mode has a rate, and your job is to know which rates are moving and which direction.

The Postmortem Shape That Fits Probabilistic Systems

Google-style postmortems organized around "what broke, what was the root cause, what will prevent recurrence" have a shape that assumes the bug lived somewhere in the code or config. For AI incidents, the shape that works better centers on versioned state and distributional evidence.

The fields that earn their keep: the exact model, prompt, retrieval index, tool schema, and policy versions at the time of the incident; a link to the offending traces with N=many replays attached; the aggregate metric movement across the affected cohort, not just the one paging request; and an explicit statement of whether the mitigation targets the root cause, reduces the rate, or raises the bar for detection. Postmortems that conflate these three categories of mitigation — fixes, mitigations, and detection improvements — tend to overpromise and cause the next incident to look like a repeat when it is actually a different failure at a similar rate.

The other useful field is a stochasticity verdict. A one-line judgment: "reproducible under same versions," "reproducible only under similar prompts," or "within sampling noise for current N." This single field keeps the team honest about what the incident actually told you and prevents the archive from becoming a pile of ambiguous anecdotes cited out of context six months later.

The Quiet Cultural Shift

The teams that will operate AI systems well over the next few years are going through a quiet cultural shift, and on-call is where it shows up first. The default SRE identity — the person who finds the single root cause, the one who can reproduce anything with enough persistence, the one who never settles for "it's flaky" — runs into a system where reproducibility is probabilistic, root causes are distributions, and "it's flaky" is sometimes the correct scientific answer.

The answer is not to lower the bar. It is to upgrade the toolkit: statistical reproduction instead of binary, distribution shifts instead of changes, rotation incentives that reward calibrated dismissals instead of rewarding shrugs. The teams that invest here stop burning out their tenured engineers on noise and stop missing real regressions in the dismissal pile. The ones that don't end up with the worst of both worlds: a rotation that dismisses real problems and over-indexes on sampling noise, with nobody quite sure which alerts actually matter.

"The model was just being weird again" is a symptom. The work is to build a culture where that sentence is never a terminal answer — always a hypothesis that triggers the next measurement.

References:Let's stay in touch and Follow me for more thoughts and updates