I have a confession to make. For three years, I was the person who created a Zoom bridge for every P1 incident and demanded all hands on deck. I believed the war room was sacred — a space where the urgency of the moment would drive faster resolution. I was wrong about almost all of it.
Six months ago, I started tracking actual time-to-resolution data and correlating it with our incident response format. The results surprised me: incidents with war rooms didn’t resolve faster than incidents handled asynchronously in Slack. In many cases, the war room made things slower because key engineers were spending time explaining context to people in the room instead of debugging.
Breaking Down Incident Categories
Not all incidents are created equal. When I segmented the data, a clear pattern emerged:
1. Single-service outages (70% of our incidents): One service is down or degraded. The engineer who owns that service can diagnose and fix it. In a war room, that engineer is sharing their screen while 8-12 people watch them tail logs and read dashboards. The audience adds nothing — they’re spectating. The engineer would be faster working alone with their full monitor real estate and no pressure to narrate their debugging process.
2. Cross-service cascading failures (20% of incidents): Service A’s failure caused Service B to back up, which caused Service C to timeout. This requires coordination between 2-3 teams. But here’s the thing — a structured Slack thread is more efficient than a call for this type of coordination. Engineers can share logs, graphs, and config snippets asynchronously. They can read back through the thread to get context instead of asking “can you repeat what you said 5 minutes ago?” The written record is inherently better for complex, multi-party debugging.
3. True platform-wide emergencies (10% of incidents): Total outage. Multiple systems down simultaneously. Customer data at risk. This is where war rooms genuinely help because real-time coordination is essential — you need to make rapid decisions about failover, communication, and resource allocation. These are rare, and they deserve the war room treatment.
Our Async Incident Response Framework
After analyzing the data, we redesigned our incident response process:
Dedicated incident Slack channel per P1. When a P1 is declared, a bot creates #incident-YYYY-MM-DD-short-description and posts the initial alert details. All communication happens in this channel.
Structured update format every 15 minutes. The incident commander posts an update using a template:
- Status: What’s currently happening
- Impact: Who and what is affected
- Next Steps: What’s being tried or investigated
- ETA: Best estimate for resolution (or “unknown”)
This format means anyone can get up to speed by reading the latest update without interrupting the people doing the actual work.
Rotating incident commander role. One person per on-call rotation is designated IC. Their job is coordination, not debugging. They ensure updates are posted, escalations happen, and the right people are engaged. This separation is crucial — the person coordinating should not be the person debugging.
Clear escalation path from async to synchronous. If the incident commander determines that real-time coordination is needed (cascading failure across 3+ services, customer data at risk, or the async thread is moving too slowly), they can escalate to a war room. But it’s opt-in, not the default.
The “Too Many Cooks” Problem
I’ve seen war rooms with 15 people where the actual debugging was done by 2 engineers sharing their screen while 13 others watched, occasionally offering suggestions that were already being tried. The cost of this is real:
- 13 engineers x 2 hours = 26 engineer-hours of lost productivity per incident
- The debugging engineers are slower because they’re narrating and fielding questions
- People in the war room feel obligated to stay “in case they’re needed,” even when it’s clear they won’t be
- Context switching: those 13 engineers were working on other things before the war room pulled them in, and it takes 20-30 minutes to regain focus after the incident
Most incidents are resolved by 1-2 people who know the system. Adding more people to a war room doesn’t help — it creates coordination overhead and interrupts people who should be working on their own priorities.
The Exceptions
I want to be clear about when war rooms ARE the right call:
- Customer-facing data loss: The decision-making (not just debugging) is complex. Do we communicate now or wait until we know the scope? Legal, PR, and engineering need to be in the same room.
- Security breaches: Rapid containment decisions — do we block this IP range? Rotate these credentials? Take a service offline? These need real-time discussion.
- Regulatory incidents: Compliance implications require immediate cross-functional coordination that async can’t support.
The Cultural Challenge
Here’s the uncomfortable truth: war rooms feel productive. There’s a visible sense of urgency. Leadership can see people working on the problem. When the CEO asks “what are we doing about the outage?” pointing to a war room full of engineers is a satisfying answer.
But feeling productive and being productive are different things. An async Slack thread where two engineers are quietly and efficiently debugging doesn’t look as impressive, but it resolves the incident just as fast with a fraction of the organizational disruption.
Breaking the war room habit requires building trust. Leadership needs to trust that the async process works. Engineers need to trust that they won’t be blamed for not “showing up” to a war room. It took us three months of deliberate practice and transparent metrics before the culture shifted.
What’s Your Model?
I’m curious about how other teams handle incident response. Have you found war rooms actually improve resolution time, or are they mostly theater that makes everyone feel busy? What does your async incident process look like?