The On-Call Problem Is an Engineering Leadership Problem
Let me be direct: if your on-call rotation is burning out your engineers, it’s not a tool problem or a monitoring problem or a staffing problem. It’s a leadership problem. Sustainable on-call is an investment decision, and most teams are underfunding it.
I’ve seen the pattern repeatedly: a team scales up, adds more services, adds more alerts “just in case,” and suddenly the on-call rotation becomes something people dread rather than accept as a normal part of the job. Engineers start taking extra vacation before their on-call week. Attrition increases among your best engineers first — they have options. And you lose the people who knew the system best, which makes on-call worse, which accelerates attrition.
This is a failure mode with a very predictable shape, and it’s entirely preventable.
What Broken On-Call Looks Like
- 3am pages for issues that can wait until morning: if it’s not actively degrading user experience, it should not page someone at 3am
- No runbooks: engineers get paged, spend 45 minutes debugging something that could have been diagnosed in 5 minutes with a clear procedure
- No escalation path: single point of failure, no one to call when you’re stuck
- Alert fatigue: high page rate trains engineers to assume pages are noise; they start checking Slack before waking up to investigate
- Engineers afraid to take vacation: they know whoever is on-call while they’re gone will be miserable, and the reciprocal guilt prevents time off
- No feedback loop: pages go out, issues get resolved, but nobody asks “should this have paged? should we fix this so it doesn’t happen again?”
The Business Cost Nobody Is Calculating
Every burnout-driven attrition event from a senior engineer costs you 6-12 months of productivity to replace, plus the institutional knowledge loss. If you’re paying engineers $200-300k and the fully loaded cost of replacement is 1.5x salary — you’re looking at $300-450k per attrition event.
The toil reduction investment to fix your on-call rotation is almost certainly less than one attrition event. The math should make this obvious, but it often doesn’t get framed this way.
What Good On-Call Design Looks Like
SLO-based alerting: Only page when actual user impact is occurring. “CPU is at 85%” is not a page — “error rate is above our SLO threshold” is. The shift from symptom-based to user-impact-based alerting is the single highest-leverage improvement most teams can make.
Runbooks for every alert: If an alert fires, there should be a runbook. If there’s no runbook, the alert should be downgraded or eliminated until a runbook exists. This sounds simple. It takes discipline to enforce.
Clear escalation paths: Who do you call when you’re stuck? What’s the decision authority for escalating to an incident commander? This should be documented and practiced, not improvised at 2am.
Compensated on-call: Engineers should be compensated for on-call duty and especially for incidents. This varies by company and jurisdiction, but the principle is: if you’re asking someone to be available and responsive outside business hours, that has value and should be recognized.
Psychological safety to push back on bad alerts: Engineers should feel empowered to file a ticket saying “this alert is noise and should be eliminated.” That ticket should be treated as high priority.
The Toil Reduction Imperative
If an alert fires repeatedly for the same reason, you have two choices: fix the underlying problem, or automate the response. Doing neither is only acceptable for 30 days. After that, it’s a management failure.
Track: what percentage of your pages require human judgment versus could be automated? What percentage of human-judgment pages resolve in under 15 minutes with a runbook step? Those are your highest-priority automation opportunities.
Measuring On-Call Health
- Mean time between pages (per engineer): if someone is being paged more than once per on-call shift, your rotation is broken
- % of pages that are actionable: low actionability = high noise = alert fatigue
- Responder acknowledgment time: proxy for on-call dread; slow ack times signal engineers aren’t engaging promptly
- On-call toil hours per week: the time spent on on-call activity that doesn’t improve the system
Review these monthly. Make them visible to the team. Commit publicly to improving them.
On-call should be a manageable part of the job, not the reason engineers leave your company. You get to choose which one it is. ![]()