4 posts tagged with "on-call"

On-Call at 3am for an AI Feature That Didn't 500

May 14, 2026 · 12 min read

Software Engineer

The pager goes off at 3:02 AM. You squint at your phone expecting the usual: a database failover, a CDN edge that wandered off, a 500 spike from a service nobody touched in eight months. Instead the alert reads: summarizer.eval-on-traffic.helpfulness rolling-1h: 4.21 → 4.05 (Δ -0.16). No HTTP error. No latency spike. No service is down. Every request the system served in the last hour returned a 200 with a body that parsed cleanly. And yet something is unmistakably worse than it was at midnight, and the rotation expects you to figure out what.

This is the on-call shift the standard runbook wasn't written for. The thing that broke didn't break — it regressed. The error budget you've been tracking for years is denominated in availability and latency, and the failure mode that paged you isn't visible in either. The page is real, the customer impact is real, and your usual diagnostic loop — check the deploy log, check the dependency graph, find the bad release, roll it back — runs into a wall the moment you realize that "the bad release" might be a 30-line system-prompt diff that landed at 4 PM yesterday and looked completely innocuous in code review.

Your On-Call Rotation Needs an AI-Literacy Prerequisite Before It Pages Anyone at 2am

April 28, 2026 · 12 min read

Tian Pan

Software Engineer

A platform engineer with eight years of incident-response experience opens a 2am page that says "AI assistant degraded — error rate 12%." She checks the model latency dashboard: green. She checks the model API status page: green. She checks the deploy log: nothing shipped in the last 72 hours. She does what any competent on-call does next — she pages the AI team. The AI engineer wakes up, opens the trace dashboard the platform engineer didn't know existed, sees that a single retrieval tool has been timing out for the last four hours because a downstream search index lost a replica, and resolves the incident in eleven minutes. The AI engineer goes back to bed at 3:14am. The retrospective the next morning records "AI feature outage, resolved by AI team." Nobody writes down the actual lesson, which is that the on-call engineer could have triaged this in five minutes if she had ever been taught what an AI feature's failure surface looks like.

This is the rotation tax that AI features quietly impose on every engineering org I've worked with in the last two years. The shared on-call rotation that worked beautifully for a stack of stateless services and a few databases breaks down the moment one of those "services" is an LLM-backed feature. The on-call playbook your SRE team built across a decade of post-mortems is calibrated for a world where "something is broken" decomposes into CPU, memory, network, deploys, and dependency timeouts. AI features add three more axes — the model, the prompt, the retrieval pipeline — and four more shapes of failure that don't show up on the dashboards your on-call was trained to read.

The Agent Paged Me at 3 AM: Blast-Radius Policy for Tools That Reach Humans

April 23, 2026 · 12 min read

Tian Pan

Software Engineer

The first time an agent pages your on-call four times in an hour because it's looping on a malformed alert signal, leadership learns something the security team already knew: "tool access" and "ability to create human work" were the same permission, and you granted it without either a safety review or a product-ownership review. Nobody owned the question of who's allowed to interrupt a human at 3 AM, because nobody framed it as a question. It was framed as a Slack integration.

The 2026 agent stack has made this failure mode cheap to reach. Anthropic's MCP servers, OpenAI's Agents SDK, and the whole class of vendor-shipped action tools have collapsed the distance between "the model decided to do a thing" and "a human got woken up." Most teams ship those integrations the same way they ship a database client: scope a token, drop in the SDK, write a system prompt, ship. The blast radius of a database client is a row count. The blast radius of a PagerDuty client is a person's sleep.

AI On-Call Psychology: Rebuilding Operator Intuition for Non-Deterministic Alerts

April 16, 2026 · 11 min read

Tian Pan

Software Engineer

The first time an on-call engineer closes a page with "the model was just being weird again," the team has quietly crossed a line. That phrase does three things at once: it declares the issue un-investigable, it classifies future similar alerts as noise, and it absolves the rotation of documenting what happened. A week later the same signature will fire, someone else will see "already dismissed once," and a real regression will live in production until a customer tweets about it.

This pattern is not laziness. It is the predictable outcome of running standard SRE intuition on a system that no longer behaves deterministically. Classical on-call training teaches engineers to treat identical inputs producing different outputs as a bug in the observability stack — it cannot be a bug in the system, because systems don't do that. LLM-backed systems do exactly that, every request, by design. An on-call rotation built without internalizing this will drift toward either paralysis (every stochastic wobble is a P2) or nihilism (the model is always weird, stop paging me).

About Tian Pan