On-Call That Doesn't Destroy Your Engineers' Lives (And Why Most Rotations Are Broken)

The On-Call Problem Is an Engineering Leadership Problem

Let me be direct: if your on-call rotation is burning out your engineers, it’s not a tool problem or a monitoring problem or a staffing problem. It’s a leadership problem. Sustainable on-call is an investment decision, and most teams are underfunding it.

I’ve seen the pattern repeatedly: a team scales up, adds more services, adds more alerts “just in case,” and suddenly the on-call rotation becomes something people dread rather than accept as a normal part of the job. Engineers start taking extra vacation before their on-call week. Attrition increases among your best engineers first — they have options. And you lose the people who knew the system best, which makes on-call worse, which accelerates attrition.

This is a failure mode with a very predictable shape, and it’s entirely preventable.

What Broken On-Call Looks Like

  • 3am pages for issues that can wait until morning: if it’s not actively degrading user experience, it should not page someone at 3am
  • No runbooks: engineers get paged, spend 45 minutes debugging something that could have been diagnosed in 5 minutes with a clear procedure
  • No escalation path: single point of failure, no one to call when you’re stuck
  • Alert fatigue: high page rate trains engineers to assume pages are noise; they start checking Slack before waking up to investigate
  • Engineers afraid to take vacation: they know whoever is on-call while they’re gone will be miserable, and the reciprocal guilt prevents time off
  • No feedback loop: pages go out, issues get resolved, but nobody asks “should this have paged? should we fix this so it doesn’t happen again?”

The Business Cost Nobody Is Calculating

Every burnout-driven attrition event from a senior engineer costs you 6-12 months of productivity to replace, plus the institutional knowledge loss. If you’re paying engineers $200-300k and the fully loaded cost of replacement is 1.5x salary — you’re looking at $300-450k per attrition event.

The toil reduction investment to fix your on-call rotation is almost certainly less than one attrition event. The math should make this obvious, but it often doesn’t get framed this way.

What Good On-Call Design Looks Like

SLO-based alerting: Only page when actual user impact is occurring. “CPU is at 85%” is not a page — “error rate is above our SLO threshold” is. The shift from symptom-based to user-impact-based alerting is the single highest-leverage improvement most teams can make.

Runbooks for every alert: If an alert fires, there should be a runbook. If there’s no runbook, the alert should be downgraded or eliminated until a runbook exists. This sounds simple. It takes discipline to enforce.

Clear escalation paths: Who do you call when you’re stuck? What’s the decision authority for escalating to an incident commander? This should be documented and practiced, not improvised at 2am.

Compensated on-call: Engineers should be compensated for on-call duty and especially for incidents. This varies by company and jurisdiction, but the principle is: if you’re asking someone to be available and responsive outside business hours, that has value and should be recognized.

Psychological safety to push back on bad alerts: Engineers should feel empowered to file a ticket saying “this alert is noise and should be eliminated.” That ticket should be treated as high priority.

The Toil Reduction Imperative

If an alert fires repeatedly for the same reason, you have two choices: fix the underlying problem, or automate the response. Doing neither is only acceptable for 30 days. After that, it’s a management failure.

Track: what percentage of your pages require human judgment versus could be automated? What percentage of human-judgment pages resolve in under 15 minutes with a runbook step? Those are your highest-priority automation opportunities.

Measuring On-Call Health

  • Mean time between pages (per engineer): if someone is being paged more than once per on-call shift, your rotation is broken
  • % of pages that are actionable: low actionability = high noise = alert fatigue
  • Responder acknowledgment time: proxy for on-call dread; slow ack times signal engineers aren’t engaging promptly
  • On-call toil hours per week: the time spent on on-call activity that doesn’t improve the system

Review these monthly. Make them visible to the team. Commit publicly to improving them.

On-call should be a manageable part of the job, not the reason engineers leave your company. You get to choose which one it is. :pager:

I’ve been the engineer dreading the pager. Happy to share the concrete experience.

There was a three-week stretch two years ago where I was paged at 3am four nights in a row during my on-call shift. Not the same issue — four different alerts, none of them actually requiring immediate action in retrospect. One was a scheduled job that always spiked memory briefly. One was a flapping health check with a known false-positive rate that nobody had fixed. One was a downstream dependency returning 503s for 90 seconds before recovering on its own.

By day four, I was not okay. I was making decisions at work from a cognitive deficit. I started genuinely evaluating other offers. I started counting the days until my rotation ended.

What turned it around was a manager who did something simple: she sat down with me, pulled the pager history for the previous 90 days, and we triaged every alert together. For each alert: “should this page at 3am? is there a runbook? did the last three pages require action?”

We eliminated or downgraded 60% of the alerts in that session. Just that. No new tooling, no process overhaul. A two-hour review meeting.

The rotation after that was genuinely fine. I slept. I didn’t start a job search.

The engineering work required to fix on-call is usually less than a week of focused effort. The blocker is almost always leadership prioritizing it. That two-hour investment my manager made probably retained me for another two years.

Security incident response on-call is a genuinely different beast, and I want to make sure it doesn’t get lost in a general reliability on-call discussion.

The differences that matter:

Triage requirements are different: a reliability incident is usually “is the system working.” A security incident is “what happened, how far did it spread, do we understand the blast radius, and do we need to notify customers or regulators.” That triage can take 4-6 hours even for a relatively contained incident.

The blast radius isn’t always obvious at 2am: reliability incidents usually have clear blast radius (service is down for these users). Security incidents may initially look small and expand significantly as you investigate. The runbook for a security incident is: assume worse than initial indicators suggest until proven otherwise.

Regulatory and legal clock starts ticking: GDPR breach notification is 72 hours. HIPAA breach notification has specific timelines. At 2am, you may be starting a clock you don’t realize is running.

What this means for runbooks: security on-call runbooks need to be separate from reliability runbooks and need to include:

  • Initial evidence preservation steps (before you start poking around and altering logs)
  • Legal/compliance notification checklist
  • Customer communication decision tree
  • Evidence of what actions the responder took (you’ll need this for the post-incident report)

My practical recommendation: security on-call should have a dedicated escalation path to legal/compliance that doesn’t require waking up 5 people to find the right person. That path should be tested at least once a quarter with a tabletop exercise, not just documented and forgotten.

I did a 3-month on-call rehabilitation project at a previous company. Here’s what actually worked.

Month 1: Measure and triage.

We instrumented PagerDuty to export weekly stats: number of pages per person, time-of-day distribution, resolution time, and — most importantly — whether each page resulted in any action or just an acknowledgment and close.

35% of our pages resulted in no action. They were noise. That’s the low-hanging fruit.

We did a triage session with the whole on-call rotation: for every alert that had fired in the past 30 days, we asked “is this worth waking someone up for at 3am?” A lot of alerts had been added years ago for issues that were since fixed, or for thresholds that made sense in an earlier era. We deleted or downgraded them.

Month 2: Runbooks.

We made a rule: no alert without a runbook. For every alert that survived the triage, we wrote a runbook. It didn’t have to be perfect — it just had to tell the responder what the alert means, what to check first, and what the remediation options are.

Responder acknowledgment time dropped 40%. Not because engineers were more engaged — because they weren’t spending 20 minutes figuring out what an alert even meant before they could start fixing it.

Month 3: Close the loop.

We started a weekly 30-minute “on-call retro” — review every page from the past week, ask what we can fix. Made it a team ritual, not a management review.

By end of month 3, pages per engineer per week had dropped 80%. Attrition on the team that year: zero.