Skip to main content

The On-Call Rotation Your Agent Platform Forgot to Staff

· 11 min read
Tian Pan
Software Engineer

The AI platform team has four engineers. The internal agent they shipped seven months ago is now answering questions for 200 employees a day. For the first month the founding engineer answered every Slack ping personally — Tuesday at 11pm, Sunday morning, the night of the company offsite. Then she got promoted to staff engineer for the impact she had on adoption, and three weeks later she stopped checking the channel after 6pm because that is what staff engineers do. The on-call rotation that was supposed to replace her was never formalized, because the operating model was always going to be figured out "after the pilot."

The day the agent silently degrades for a quarter of users — a retrieval index that quietly fell behind, or a model version flip that shifted refusal behavior, or a tool whose schema rotated and is now returning empty arrays — the complaints do not land on the platform team's pager. They land in the help desk queue, staffed by people who do not have access to the agent's traces, do not know what a system prompt is, and have been told by IT that the agent is "owned by the AI team." Sixteen hours pass between the first user complaint and the first engineer who looks at a trace. Nobody on the platform team is asleep at the wheel; there is no wheel.

This is the failure mode every AI platform team is one promotion away from. The work to build the agent was budgeted; the work to operate it was not. The team that knows how to debug it has the wrong shift structure to do so sustainably. And the team that owns the customer relationship cannot see the system well enough to triage it. The gap between those two facts is where user trust quietly dies, and it is a staffing problem dressed up as a tooling problem.

Build Teams and Operate Teams Were the Same Team Until They Weren't

The pattern is recognizable from a decade of SRE history, recompressed into eighteen months. In year one of any new platform, the people who built it are the people who operate it, because they are the only people who can. They handle pages on goodwill. They write a Slack triage thread for tracking. They internalize the failure modes faster than they can document them. It looks like a working operating model because the platform is small enough that goodwill scales.

Then the platform grows. Adoption hits a knee. The founding engineers get visible internally and start getting recruited to other surfaces, or promoted into roles that explicitly de-prioritize tactical response. The new hires onboard into the codebase but not into the tacit knowledge of how the system breaks. The Slack triage thread that worked for six users no longer works for two hundred. And nobody on the team has run the calculation: if the founding engineer stops responding, what is the time to first eyes on an incident?

Google's SRE workbook makes the math explicit and has for over a decade. A sustainable 24/7 on-call rotation needs a minimum of eight engineers at a single site, or six per site across two well-separated geographies for follow-the-sun. Fewer than that and the rotation frequency turns toxic — engineers carry the pager often enough that recovery time between shifts collapses and turnover accelerates. The "25% on-call rule" — one in four engineers paging at any moment — exists because that is roughly the largest fraction of attention a healthy team can divert from project work without the project work dying.

An AI platform team of four does not satisfy this math. It cannot. The choice is not between "good on-call" and "great on-call" — it is between "no on-call" and "an on-call structure that explicitly acknowledges the team is too small and either hires, federates, or scopes back the surface it covers."

Staff the Rotation Before the First Incident, Not After

The discipline that fixes this is the same one SRE orgs learned in the 2010s: the operating model ships with the system, not after it. For an internal AI agent moving from pilot to production, that means three concrete things land before the first 50-user cohort goes live, not after the first incident.

An explicit rotation with shadow weeks. Even on a team of four, a documented primary/secondary rotation with one-week shifts is dramatically better than the informal "whoever sees it first" model, because it makes the unfairness of the load visible. If your four-engineer team would have each person primary every other week — a load that fails the Google SRE recommendation by a factor of two — that itself is the artifact you need to take to leadership for headcount. The rotation is a forcing function for the staffing conversation; the absence of one lets the team paper over the gap with goodwill until the goodwill runs out.

Shadow rotations for the first six weeks. Pair every primary shift with a shadow from a partner team — an SRE, a platform engineer from an adjacent group, or a senior support engineer with shell access. The shadow does not have to debug the agent. They have to learn the runbook against a live incident so that month two does not depend on the same four people who shipped it. The half-life of tacit knowledge in a small team is short; the way you extend it is by giving someone outside the team forty-five minutes of supervised incident time.

A paging SLA in writing. Not "we'll get to it." A documented commitment — "P1 within 15 minutes, P2 within 4 business hours, P3 next-business-day" — that the platform team makes to the rest of the org. The number itself matters less than the act of writing it down, because the act of writing it down is the act of staffing for it. A team that cannot honor the SLA it would have to commit to in writing knows, before the first incident, that it is understaffed.

The Help Desk Is Where Silent Degradations Surface First

The escalation tree is the second half of the staffing problem, and the part most platform teams under-invest in. Customer-facing complaints about an internal AI agent route to the help desk by default, because that is where IT support routes for everything else. The help desk has triage scripts for VPN, single sign-on, and laptop hardware. They do not have a triage script for "the agent gave me an answer that contradicts last quarter's policy doc."

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates