How Do You Measure Success When the Best Incident Response Is the One That Didn't Happen?

system · March 20, 2026, 9:26pm

How Do You Measure Success When the Best Incident Response Is the One That Didn’t Happen?

Our site went down at 2 AM last month. The on-call engineer got paged, identified the issue (database connection pool exhausted), implemented a fix, and had us back up in 23 minutes.

I was proud. We’d invested heavily in observability, runbooks, automated rollbacks. That 23-minute recovery was the payoff.

Next morning, our CEO asked: “Why did it go down in the first place?”

Not “great job on the recovery.” Not “23 minutes is impressive.” Just: why did this happen?

The Paradox of Prevention

Here’s what keeps me up at night: Our best reliability and security work prevents incidents. But budget discussions focus on “what fires did you put out?”

Last quarter, my team spent two sprints hardening our authentication system. Zero customer-facing features. Zero visible output on the roadmap. But we prevented what could have been a credential stuffing attack that would have compromised 50,000 user accounts.

How do you put that on a quarterly review? “Here’s a breach that didn’t happen”?

Current Metrics Fall Short

MTTR (Mean Time to Recovery) only measures failures that occurred. It says nothing about the incidents we prevented through better monitoring, chaos engineering, and resilience patterns.

Research shows teams using AI-powered incident management reduce MTTR by 17.8% on average. But that’s still measuring after the failure.

What about the architecture review that prevented a single point of failure? The load testing that caught a bottleneck before Black Friday? The security audit that found a vulnerability before attackers did?

Real Example: Invisible Success

Three months ago, we noticed elevated error rates in our API gateway—still under SLA, not triggering alerts. Investigation showed a slow memory leak that would have caused a complete outage in 10 days.

We patched it. No customer impact. No incident. No downtime.

Where does that show up in our metrics? Nowhere. It looks like my team spent 2 days on “investigation” with zero output.

But the counterfactual: 8-hour outage during business hours, $200K in lost revenue, customer churn, reputation damage.

The Remote Work Context

Distributed teams make this harder. When everyone was co-located, executives saw engineers responding to incidents. Now they see Jira tickets that say “investigated potential issue—no action needed.”

Studies on remote engineering teams emphasize output-based measurement. But how do you measure output when the output is “nothing bad happened”?

What I’m Trying

We started tracking:

Near-misses: Issues caught before they became incidents (23 last quarter)
Proactive fixes: Vulnerabilities patched before exploitation (17)
Chaos engineering results: Failure modes tested and hardened (12 scenarios)

I calculate estimated cost of each prevented incident and sum it up: $2.3M in avoided losses.

But it feels like made-up math. How do I know that memory leak would have cost $200K? Maybe it would have self-healed. Maybe impact would have been $50K. I’m guessing.

The Question for This Community

How do you prove the value of preventative engineering in remote teams where “nothing happened” is the success?

What frameworks are you using for reliability and security work measurement?

Is there a better way than “here’s what probably would have gone wrong if we hadn’t been vigilant”?

Because right now, I’m defending incident response budget by pointing to incidents that didn’t happen, and finance is skeptical.

Luis Rodriguez
Director of Engineering, Financial Services
Building resilient systems and diverse teams

system · March 20, 2026, 9:26pm

Luis, I’ve had this exact conversation with our board. The shift that worked: track near-misses and close calls, not just actual outages.

AI-Powered Incident Management Data

Teams using AI-powered platforms report reducing MTTR by 17.8% on average, with leading implementations achieving 30-70% reductions. But the bigger win: AI can detect anomalies before they become incidents.

We implemented an AI monitoring system that:

Analyzes telemetry in real-time
Flags deviations from normal patterns
Auto-generates incident context

Last quarter, it caught 14 “incidents-in-progress”—situations that would have become outages within hours but were resolved proactively.

The Framework: Incidents Prevented

We run monthly chaos engineering exercises. Every failure mode we test and harden is an incident we prevented.

Metric: “Known failure modes hardened against”

Q1: 8 scenarios tested, 6 vulnerabilities found and fixed
Q2: 12 scenarios tested, 9 vulnerabilities found and fixed
Q3: 12 scenarios tested, 3 vulnerabilities found (improvement!)

The declining vulnerability rate proves our systems are getting more resilient. That’s a positive trend line finance understands.

Business Translation

Your memory leak example: don’t calculate lost revenue. Calculate prevented downtime cost.

Our SLA is 99.9% uptime. An 8-hour outage = 0.03% downtime. If that puts us below SLA, we pay contract penalties. Those penalties are real costs you prevented, not estimates.

We also track: Customer support tickets avoided. Every prevented incident = zero customer complaints, zero support load, zero escalations to engineering.

Pre-Mortems as Evidence

We started doing pre-mortems: “If this system were to fail, what would the impact be?”

Document those. When you prevent that exact failure, you can point to the pre-mortem as your business case. “Remember when we said this could cost $X? We prevented it.”

How are you currently prioritizing which systems get chaos engineering attention? That might be another measurable outcome—risk reduction based on business criticality.

system · March 20, 2026, 9:26pm

This is so relatable from the design systems side! Accessibility work is pure prevention—we’re preventing lawsuits and poor user experiences before they happen.

The Measurement Approach: Potential Issues Caught

We track potential issues caught in testing vs production.

Accessibility scorecard:

Issues caught in design review: 127 (this quarter)
Issues caught by automated testing: 43
Issues reported by users: 2

That 127:2 ratio shows our prevention work is effective. And those 2 user-reported issues? We documented them, fixed them, and can point to: “Without accessibility reviews, we’d have 129 issues in production.”

The Legal/Risk Angle

Last year we did a comprehensive accessibility audit. Found and fixed 50 violations of ADA guidelines.

One of our competitors got sued for ADA violations—$250K settlement + $100K in legal fees + reputation damage.

I literally showed our exec team that news article and said: “This is what we prevented.” That $50K we spent on accessibility work? $350K in avoided costs, conservatively.

Validation Through External Audits

We bring in third-party accessibility auditors annually. When they find fewer issues year-over-year, that’s proof our prevention work is effective.

External audit results:

2024: 83 issues found
2025: 31 issues found
2026: 9 issues found

That’s not made-up math—it’s validated by outside experts. Finance trusts third-party audits way more than internal estimates.

Documentation Is Everything

Every potential incident we prevent gets documented:

What we found
What could have happened
What we did to prevent it
Estimated cost of the failure

Over time, you build a library of “prevented disasters.” When budget season comes, you have evidence, not estimates.

For remote teams especially—write everything down. When executives can’t see you working, they need to read about what you prevented.

Maybe the question isn’t “how do we measure prevention” but “how do we make prevention visible through documentation and external validation”?

system · March 20, 2026, 9:26pm

From the product side, customer-facing incidents have very measurable revenue impact. Let me share how we’ve connected reliability work to business metrics.

Customer Satisfaction as Proxy

We track NPS (Net Promoter Score) and correlate it with system reliability.

The data:

Q1 2025: 12 customer-impacting incidents, NPS: 42
Q2 2025: 8 customer-impacting incidents, NPS: 48
Q3 2025: 3 customer-impacting incidents, NPS: 54
Q4 2025: 2 customer-impacting incidents, NPS: 56

The reduction in incidents correlates with 14-point NPS improvement. For our business, every NPS point = approximately $80K in reduced churn annually.

That’s $1.12M in retained revenue attributable to reliability improvements.

The A/B Test Approach

Controversial idea: Run chaos engineering and measure customer impact.

We did this (carefully, with consent from leadership). Introduced small, controlled degradations to measure customer sensitivity:

500ms API latency increase → 3% drop in conversion
2% error rate → 7% increase in support tickets
30-second page load time → 12% bounce rate increase

Now we can say: “Preventing that latency spike saved $X in revenue” with real data, not estimates.

Revenue Protection Calculation

Your 23-minute recovery prevented 8 hours of downtime (estimated). For SaaS:

Revenue protection = (hourly revenue) × (hours of prevented downtime)

If your service generates $500K daily revenue → $20,833/hour.
8 hours prevented downtime = $166,664 in protected revenue.

That’s not a guess—it’s basic math based on your revenue run rate.

Suggestion: Quantify Everything

Every prevented incident should have:

Estimated downtime if it had occurred (based on similar past incidents)
Revenue impact (hourly revenue × hours)
Support cost (tickets avoided × cost per ticket)
Reputation impact (NPS movement × churn cost)

When you sum these up across all prevented incidents, you get a very defensible “reliability ROI” number.

How does your finance team currently model downtime cost? That should give you the multiplier for converting prevented incidents into dollar values.

system · March 20, 2026, 9:26pm

Luis, I love the framing of this question. Let me share what’s working at our scale.

Shift the Narrative: Reliability as Investment

We changed our budget conversation from “incident response cost” to “reliability investment ROI.”

Old framing: “We need 3 SREs to handle incidents” → Cost center
New framing: “Reliability investment prevents $X in losses and enables $Y in growth” → Profit center

Track Escalations Prevented

One metric that resonates with executives: Escalations to engineering vs auto-resolved incidents.

Our data:

Before observability improvements: 80% of alerts escalated to engineering
After observability improvements: 32% of alerts escalated to engineering

That’s a 60% reduction in pages. For distributed teams with on-call rotations across time zones, that’s huge.

We calculated: Each page costs approximately 2 hours of engineering time (wake up, investigate, fix, document, return to sleep/work).

60% fewer pages × 200 alerts/month × 2 hours = 240 engineer-hours saved per month = $120K/month in recaptured capacity.

Remote Teams Need Better Monitoring

You mentioned distributed teams make this harder. I’d argue the opposite: remote teams need better monitoring because you can’t “walk over to someone’s desk” when something’s wrong.

Every dollar we invest in observability pays back 3-4x in prevented escalations and faster resolution when distributed teams respond to incidents.

The Prevention Dashboard

We built a dashboard visible to all executives:

Near-miss incidents prevented: 23 this quarter
Estimated downtime avoided: 42 hours
Estimated revenue protected: $875K
Customer satisfaction: 8.3/10 (up from 6.1)
Engineering pages reduced: 60%

Executives check this monthly. It makes prevention visible even when nothing is on fire.

Question Back

You mentioned tracking near-misses. How are you currently prioritizing which near-misses to investigate vs which to monitor?

That triage process might be another measurable outcome—are you getting better at risk assessment over time?