I need to be honest about something that’s been keeping me up at night—and I don’t mean because of on-call alerts.
In the past six months, we’ve lost three senior engineers from our EdTech platform. All three were excellent performers, well-compensated, and working on meaningful problems. And all three cited the same primary reason in their exit interviews: they couldn’t handle the on-call stress anymore.
This isn’t just my company. According to 2026 research, over 60% of platform engineers and SREs report chronic exhaustion, with 30% actively seeking new roles due to unsustainable workloads. We have an industry-wide crisis, and I don’t think we’re talking about it enough.
The root causes we’re seeing
- Alert fatigue: Engineers are getting 15-20 alerts per shift, most of which are noise
- Sleep deprivation: Multiple interruptions per night, making it impossible to get quality rest
- Lack of recovery: Back to normal work the next day with no recovery time
- Constant context switching: Being on-call for 10+ services simultaneously
- Psychological burden: The anxiety of waiting for an alert never truly goes away
One engineer told me: “I spend my entire on-call week in a state of low-grade anxiety, waiting for my phone to ring. Even when it doesn’t, I’m exhausted.”
Traditional approaches aren’t working
We’ve tried the standard playbook:
- Rotation schedules to distribute the burden
- Comprehensive runbooks for faster resolution
- Escalation policies
- Post-mortems to prevent recurring issues
These help at the margins, but they don’t address the fundamental problem: we’re asking humans to be available 24/7 for systems that are growing more complex, not simpler.
What we’re trying now
I’m implementing some changes that feel controversial, but I believe are necessary:
1. Shadow on-call program
Junior engineers observe incident response without bearing primary responsibility. This has two benefits:
- Knowledge transfer happens naturally
- Primary on-call has a second pair of eyes without the pressure of training
Early results: burnout-related attrition has dropped.
2. Mandatory recovery time
After an on-call shift, engineers get:
- No meetings the day after shift ends
- Flex time to start late or leave early
- Explicitly communicated expectation that recovery is expected, not optional
This was surprisingly controversial with some senior leaders who felt we were “coddling” engineers. My response: I’d rather “coddle” them than replace them.
3. Questioning 24/7 human on-call for everything
Here’s the really controversial one: Do all services actually need human on-call 24/7?
We did an analysis and found:
- 40% of our services could tolerate 30 minutes of downtime outside business hours with minimal customer impact
- Another 30% could have automated remediation for common failure modes
- Only 30% truly need immediate human intervention at 3AM
We’re experimenting with tiered on-call: critical services get immediate pages, non-critical services get batched notifications that can wait until morning.
The business case
Some executives push back: “But what about reliability? What about our SLAs?”
Here’s my argument: The cost of engineer turnover exceeds the cost of strategic downtime tolerance.
Replacing a senior engineer costs:
- 6-9 months of recruiter fees, interviewing time, onboarding
- Loss of institutional knowledge
- Team morale impact when respected colleagues leave
- Risk during the knowledge gap period
Compare that to: Occasionally having a non-critical service down for 20 minutes at 2AM.
For most B2B SaaS products, customers don’t need 99.99% uptime on every feature. They need it on critical paths. Everything else can be 99.9% or even 99%, with batched notifications for non-critical alerts.
What I’m asking the industry
We need to normalize sustainable on-call practices. That means:
- Talking openly about burnout instead of treating it as individual weakness
- Questioning whether every system needs 24/7 human availability
- Building automation and AI to handle routine issues
- Treating on-call recovery as a legitimate business need, not a luxury
- Measuring on-call wellness metrics alongside traditional SLAs
Being vulnerable as a leader
I feel responsible for the engineers we lost. I should have seen the signs earlier. I should have pushed back harder on adding services to on-call rotation without removing others.
As engineering leaders, we need to protect our teams—even from our own ambitions to have perfect uptime. The human cost is real, and it’s unsustainable.
Questions for the community
- How is your company handling on-call burnout?
- Has anyone successfully reduced on-call burden without sacrificing reliability?
- What metrics do you use to track team wellness?
- For those who’ve left jobs due to on-call stress—what could have made you stay?
I’m genuinely looking for ideas here. This feels like one of the biggest challenges facing our industry in 2026.