We're losing engineers to on-call burnout - time to rethink SRE practices

I need to be honest about something that’s been keeping me up at night—and I don’t mean because of on-call alerts.

In the past six months, we’ve lost three senior engineers from our EdTech platform. All three were excellent performers, well-compensated, and working on meaningful problems. And all three cited the same primary reason in their exit interviews: they couldn’t handle the on-call stress anymore.

This isn’t just my company. According to 2026 research, over 60% of platform engineers and SREs report chronic exhaustion, with 30% actively seeking new roles due to unsustainable workloads. We have an industry-wide crisis, and I don’t think we’re talking about it enough.

The root causes we’re seeing

  1. Alert fatigue: Engineers are getting 15-20 alerts per shift, most of which are noise
  2. Sleep deprivation: Multiple interruptions per night, making it impossible to get quality rest
  3. Lack of recovery: Back to normal work the next day with no recovery time
  4. Constant context switching: Being on-call for 10+ services simultaneously
  5. Psychological burden: The anxiety of waiting for an alert never truly goes away

One engineer told me: “I spend my entire on-call week in a state of low-grade anxiety, waiting for my phone to ring. Even when it doesn’t, I’m exhausted.”

Traditional approaches aren’t working

We’ve tried the standard playbook:

  • Rotation schedules to distribute the burden
  • Comprehensive runbooks for faster resolution
  • Escalation policies
  • Post-mortems to prevent recurring issues

These help at the margins, but they don’t address the fundamental problem: we’re asking humans to be available 24/7 for systems that are growing more complex, not simpler.

What we’re trying now

I’m implementing some changes that feel controversial, but I believe are necessary:

1. Shadow on-call program

Junior engineers observe incident response without bearing primary responsibility. This has two benefits:

  • Knowledge transfer happens naturally
  • Primary on-call has a second pair of eyes without the pressure of training

Early results: burnout-related attrition has dropped.

2. Mandatory recovery time

After an on-call shift, engineers get:

  • No meetings the day after shift ends
  • Flex time to start late or leave early
  • Explicitly communicated expectation that recovery is expected, not optional

This was surprisingly controversial with some senior leaders who felt we were “coddling” engineers. My response: I’d rather “coddle” them than replace them.

3. Questioning 24/7 human on-call for everything

Here’s the really controversial one: Do all services actually need human on-call 24/7?

We did an analysis and found:

  • 40% of our services could tolerate 30 minutes of downtime outside business hours with minimal customer impact
  • Another 30% could have automated remediation for common failure modes
  • Only 30% truly need immediate human intervention at 3AM

We’re experimenting with tiered on-call: critical services get immediate pages, non-critical services get batched notifications that can wait until morning.

The business case

Some executives push back: “But what about reliability? What about our SLAs?”

Here’s my argument: The cost of engineer turnover exceeds the cost of strategic downtime tolerance.

Replacing a senior engineer costs:

  • 6-9 months of recruiter fees, interviewing time, onboarding
  • Loss of institutional knowledge
  • Team morale impact when respected colleagues leave
  • Risk during the knowledge gap period

Compare that to: Occasionally having a non-critical service down for 20 minutes at 2AM.

For most B2B SaaS products, customers don’t need 99.99% uptime on every feature. They need it on critical paths. Everything else can be 99.9% or even 99%, with batched notifications for non-critical alerts.

What I’m asking the industry

We need to normalize sustainable on-call practices. That means:

  • Talking openly about burnout instead of treating it as individual weakness
  • Questioning whether every system needs 24/7 human availability
  • Building automation and AI to handle routine issues
  • Treating on-call recovery as a legitimate business need, not a luxury
  • Measuring on-call wellness metrics alongside traditional SLAs

Being vulnerable as a leader

I feel responsible for the engineers we lost. I should have seen the signs earlier. I should have pushed back harder on adding services to on-call rotation without removing others.

As engineering leaders, we need to protect our teams—even from our own ambitions to have perfect uptime. The human cost is real, and it’s unsustainable.

Questions for the community

  • How is your company handling on-call burnout?
  • Has anyone successfully reduced on-call burden without sacrificing reliability?
  • What metrics do you use to track team wellness?
  • For those who’ve left jobs due to on-call stress—what could have made you stay?

I’m genuinely looking for ideas here. This feels like one of the biggest challenges facing our industry in 2026.

Keisha, thank you for being so vulnerable about this. Reading your post, I felt like you were describing my exact situation—except at a Fortune 500 financial services company, the problem is even harder to solve because of regulatory constraints.

We’re facing the same crisis

In our platform engineering team of 40+ engineers, we’re seeing similar patterns:

  • Engineers responsible for 10+ services each, some handling 20+
  • Average 12-15 pages per on-call shift
  • Two senior engineers left in the last quarter, both citing burnout

When I did exit interviews, one engineer told me: “I can’t remember the last time I slept through the night during an on-call week. I wake up at 2AM even when my phone doesn’t ring, just checking to make sure I didn’t miss an alert.”

What we’re implementing

Your initiatives resonate strongly, especially the tiered approach. Here’s what we’re doing:

1. “No meeting” days after on-call

Similar to your mandatory recovery time. The day after an on-call shift ends, that engineer has:

  • No meetings scheduled
  • Flex start time (can come in at noon if they want)
  • Explicit permission to work on low-stakes, creative work instead of critical path items

The pushback from product teams was real: “But we need everyone in the sprint planning meeting!” My response: “You need a burned-out engineer even less.”

2. Tiered on-call by criticality

This is where financial services makes it complicated. Regulatory requirements mean we genuinely can’t tolerate downtime on core banking transactions. But you’re absolutely right—not everything needs 24/7 human response.

Our breakdown:

  • Tier 1 (Critical): Core banking, payments, compliance systems - immediate page, 5-minute response SLA
  • Tier 2 (Important): Customer-facing features, internal tools - batched alerts every 30 minutes, 1-hour response SLA
  • Tier 3 (Monitor only): Analytics, reporting, non-critical services - email alerts, next-business-day response

Moving services from Tier 1 to Tier 2 required extensive discussions with product, legal, and compliance. But we got there by asking: “What’s the actual customer impact of this being down for 30 minutes at 3AM?”

Turns out, for many services, the answer was “minimal.”

3. Establishing SRE wellness standards

You mentioned measuring on-call wellness metrics. We track:

  • Sleep-hour interruptions per engineer per month (target: <3)
  • On-call load distribution variance (should be <20% across team)
  • Time-to-acknowledge during business hours vs sleep hours (if sleep-hour times are worse, people are genuinely exhausted)
  • Post-shift recovery time actually utilized (if people aren’t taking it, we have a culture problem)

The data is brutal but necessary. Last month, one engineer had 11 sleep-hour interruptions while another had 2. That’s a load-balancing failure on my part as a leader.

The executive conversation

You mentioned executives who worry about reliability. Here’s the data point that finally convinced our C-suite:

Cost of replacing a senior SRE: $250K-$350K (recruiting, onboarding, lost productivity, knowledge loss)

Cost of one hour of downtime on non-critical service at 2AM: ~$1,500 (very few users affected)

Even if we have 10 such incidents per year, that’s $15K vs $300K per engineer who leaves. The business case is clear when you put actual numbers on it.

What we’re still struggling with

The hardest part: How do you handle executives who want 99.99% uptime on every feature? I’ve had product leaders say “our competitors have better uptime” without understanding the human cost.

My counter: “Our competitors also have 30% higher turnover in their SRE teams. Is that the model we want?”

A proposal for the industry

I love your call for normalizing sustainable on-call practices. We need industry-wide standards, similar to how we have for security practices.

What if we, as engineering leaders, created and published:

  • Recommended maximum sleep-hour interruptions per engineer per month
  • Standard recovery time policies
  • Guidelines for service classification (what actually needs 24/7 human response)
  • Metrics for measuring team wellness

If enough companies adopted these, it would create industry pressure against unsustainable practices.

Thank you for this post

Seriously, thank you for talking about this openly. Too often, we treat engineer burnout as an individual problem (“they should be more resilient”) instead of a systemic failure of our practices.

The engineers we lost didn’t lack resilience—they made rational decisions that their health mattered more than our uptime metrics. And they were right.

This resonates so deeply. I’m one of those engineers you’re describing—currently on-call for six microservices, and honestly, I’m updating my resume because I can’t do this anymore.

The reality from an IC perspective

Last month I had four sleep-hour interruptions in a single week. By Friday, I was a zombie. I made a stupid mistake in production because I was too exhausted to think clearly. Then I felt terrible about it, which made the anxiety worse.

The thing nobody talks about: even when you don’t get paged, the stress is there. Every time my phone buzzes—even for a text message—I get a spike of adrenaline thinking it’s PagerDuty. I check my phone compulsively to make sure I haven’t missed an alert.

What would make me stay

You asked: “For those who’ve left jobs due to on-call stress—what could have made you stay?”

I haven’t left yet, but here’s what would change my mind:

1. Better alert classification

Right now, everything is treated as equally urgent. A database connection pool warning gets the same 3AM page as an actual outage. If we had:

  • Critical: Page immediately
  • Important: Slack notification (I’ll see it in the morning)
  • FYI: Email digest

That alone would cut my sleep interruptions by 70%.

2. Psychological safety to speak up

I’m genuinely scared to tell my manager I’m struggling with on-call. Will they think I’m not cut out for this? Will it hurt my performance review? Will I get passed over for promotions?

Reading @vp_eng_keisha’s post gives me hope that leadership actually cares. But I don’t know if my leadership feels the same way.

3. Visible recovery time

You mentioned mandatory recovery time—that’s huge. Right now, after a rough on-call week, I’m expected to show up to sprint planning on Monday morning at 9AM like nothing happened. Nobody acknowledges that I was up half the night Wednesday dealing with an incident.

If leadership explicitly said “take Monday morning off” or “skip this week’s meetings,” it would feel like the company actually values my wellbeing.

4. Seeing progress on reducing alerts

We keep saying we’ll “fix the alert noise problem,” but it never gets prioritized. Every sprint, it gets pushed for new features. After a year of empty promises, I don’t believe it anymore.

If I saw actual, concrete progress—even small wins like “we reduced alerts from service X by 30%”—it would give me hope.

The colleague who left

One of the senior engineers on my team left three months ago. In our 1:1 before he left, he said: “I make good money here, I like the team, but I can’t keep waking up at 3AM twice a week. My spouse is frustrated, I’m exhausted, and for what? So we can promise customers 99.99% uptime that they don’t actually need?”

That conversation haunts me because I’m starting to feel the same way.

A request for leaders

Please don’t treat on-call burnout as individual weakness. The engineers who are struggling aren’t lazy or not resilient enough—we’re operating in an unsustainable system.

When you lose engineers to burnout, don’t just replace them and move on. Fix the system that burned them out, or you’ll just lose the next person too.

One thing I appreciate

I want to end on a positive note: I really appreciate that @vp_eng_keisha is talking about this openly and implementing real changes. The shadow on-call program sounds amazing—junior engineers learn without the pressure, and I’d have backup during incidents.

If more leaders thought like this, I might not be updating my resume.

As VP of Product, I need to share a confession: I’ve been part of the problem.

Product’s role in on-call burnout

For years, I pushed for features without fully understanding the operational burden they created. Every new feature meant:

  • More code to monitor
  • More potential failure modes
  • More alerts for engineers to handle
  • More services in the on-call rotation

I didn’t connect the dots between “ship this feature” and “engineers getting paged at 3AM.”

The wake-up call

Last quarter, our engineering director showed me the data: three engineers left, all citing on-call stress. I asked for details, and he showed me:

  • Feature X that I pushed for: 15 new alerts per week, mostly false positives
  • Service Y that I insisted needed 99.99% uptime: Actually used by <5% of customers, could easily tolerate 99.5%
  • Integration Z: Generating 30% of all on-call pages, but driving <2% of revenue

I felt terrible. My product decisions were directly contributing to engineer burnout.

What I’m changing

1. Product owns understanding operational burden

Now, before we commit to building a feature, I ask engineering:

  • What’s the on-call impact?
  • How many new alerts will this generate?
  • Does this need 24/7 human monitoring?

If a feature significantly increases on-call burden, we either:

  • Build better automation first
  • Design it to be more resilient
  • Accept that it won’t have 24/7 SLAs

2. Customer SLOs ≠ Engineering SLOs

This was a revelation: our customers don’t actually need 99.99% uptime on every feature.

I surveyed our B2B customers and asked: “What’s your tolerance for downtime on [specific features] during off-hours (midnight-6AM)?”

The results shocked me:

  • Analytics dashboards: 80% said “downtime overnight is fine”
  • Reporting tools: 90% said “as long as it works during business hours”
  • Admin features: 95% said “who even uses this at night?”

Only our core transaction processing truly needed 24/7 uptime.

3. Factoring on-call cost into prioritization

Now when we prioritize features, we consider:

  • Revenue/user value (as always)
  • Development time (as always)
  • Operational burden (new: on-call impact, monitoring complexity)

Some features that looked good from a product perspective became much less attractive when we factored in the on-call cost.

The hard conversations

@vp_eng_keisha asked: “How do you handle executives who want 99.99% uptime?”

As someone who WAS that executive, here’s what changed my mind:

Show the human cost

Numbers are abstract. “99.99% uptime” sounds good. “Three engineers quit because they couldn’t sleep” is concrete and human.

When I saw the exit interview quotes—engineers saying they were anxious, exhausted, and their relationships were suffering—I couldn’t ignore it anymore.

Connect it to business outcomes

Replacing engineers is expensive. Burned-out engineers ship bugs. Low morale hurts velocity.

I calculated: the cost of occasional downtime on non-critical features was way lower than the cost of engineer turnover.

Reframe reliability

Instead of “we need 99.99% uptime on everything,” ask: “which user journeys are actually critical?”

For a B2B SaaS product, maybe only 20-30% of features truly need extreme uptime. The rest can be 99% or 99.5%, which is still pretty good but much easier on on-call engineers.

A commitment to product leaders

If you’re a PM or product leader reading this: please talk to your engineering teams about the on-call burden of features you’re requesting.

Your job is to balance user value with engineering cost—and “cost” includes the human toll on your team, not just development time.

Collaborating better

Now I work with eng to classify features before we build them:

  • Tier 1: Critical user flows, 99.99% uptime, 24/7 on-call
  • Tier 2: Important features, 99.9% uptime, batched alerts
  • Tier 3: Nice-to-have, 99% uptime, email notifications

This helps us make conscious trade-offs instead of defaulting to “everything is critical.”

Thank you @vp_eng_keisha for this post. It’s uncomfortable to realize I’ve been contributing to engineer burnout, but that discomfort is necessary for change.

This thread is so timely—at Uber, we’ve been experimenting with “follow-the-sun” on-call rotations across our global teams, and I think it offers a partial solution to the burnout problem.

The geographic advantage

Global teams have a unique opportunity: on-call doesn’t have to mean getting woken up at 3AM if you distribute coverage across time zones.

Our approach:

  • São Paulo team covers 9AM-5PM local time (which is afternoon/evening in SF)
  • San Francisco team covers 9AM-5PM Pacific (which is evening in São Paulo, morning in APAC)
  • No one gets paged during sleep hours

This isn’t perfect—we still need weekend coverage—but it eliminates the worst part of on-call: sleep interruptions.

The challenges we’ve encountered

1. Trust and documentation

For follow-the-sun to work, teams need to trust each other and have excellent handoff documentation. If the São Paulo team can’t debug a service because all the knowledge is in SF engineers’ heads, it doesn’t work.

We’ve invested heavily in:

  • Runbooks that actually work (tested regularly)
  • Service ownership clarity (every service has a README with architecture diagrams)
  • Handoff protocols (Slack channels for shift transitions)

2. Overlap hours

We found we need at least 2-3 hours of overlap between shifts for handoffs and critical collaboration. This means:

  • SF team starts a bit early (7AM) to overlap with São Paulo afternoon
  • Brazil team works slightly later to catch SF morning

It’s not ideal from a work-life balance perspective, but it’s better than being on-call 24/7.

3. Cultural differences

Different countries have different expectations around work-life balance and on-call responsibilities. We had to navigate:

  • Brazil has stronger labor laws around off-hours work
  • Some cultures are less comfortable with the escalation pressure of being primary on-call
  • Time zone math is surprisingly hard (daylight savings time complicates everything)

Weekend coverage is still hard

Follow-the-sun helps with weekdays, but weekends are still a challenge. We rotate weekend on-call across the global team, which means everyone has some sleep-hour exposure, just much less frequently.

Not every company can do this

I realize this only works if you:

  • Have engineering teams distributed globally
  • Can build trust and knowledge sharing across sites
  • Have services that aren’t overly localized to one region

For companies with a single-location engineering team, this isn’t an option.

But maybe it should be considered in hiring

If on-call burnout is an industry-wide crisis, maybe strategic geographic distribution of engineering teams is part of the solution?

Hiring in different time zones isn’t just about cost or accessing talent—it’s also about sustainable on-call practices.

The suggestion

If you’re a growing company and facing on-call burnout, consider:

  • Hiring remote engineers in complementary time zones
  • Building engineering hubs in different regions
  • Using distributed teams as a feature, not just a cost optimization

Not saying this solves everything—@vp_eng_keisha’s points about alert reduction, recovery time, and tiered on-call are all still critical. But geographic distribution can eliminate the single biggest source of stress: sleep interruptions.

Question for the group

Has anyone else experimented with follow-the-sun rotations? What worked? What didn’t?