How Do You Measure Success When the Best Incident Response Is the One That Didn’t Happen?
Our site went down at 2 AM last month. The on-call engineer got paged, identified the issue (database connection pool exhausted), implemented a fix, and had us back up in 23 minutes.
I was proud. We’d invested heavily in observability, runbooks, automated rollbacks. That 23-minute recovery was the payoff.
Next morning, our CEO asked: “Why did it go down in the first place?”
Not “great job on the recovery.” Not “23 minutes is impressive.” Just: why did this happen?
The Paradox of Prevention
Here’s what keeps me up at night: Our best reliability and security work prevents incidents. But budget discussions focus on “what fires did you put out?”
Last quarter, my team spent two sprints hardening our authentication system. Zero customer-facing features. Zero visible output on the roadmap. But we prevented what could have been a credential stuffing attack that would have compromised 50,000 user accounts.
How do you put that on a quarterly review? “Here’s a breach that didn’t happen”?
Current Metrics Fall Short
MTTR (Mean Time to Recovery) only measures failures that occurred. It says nothing about the incidents we prevented through better monitoring, chaos engineering, and resilience patterns.
Research shows teams using AI-powered incident management reduce MTTR by 17.8% on average. But that’s still measuring after the failure.
What about the architecture review that prevented a single point of failure? The load testing that caught a bottleneck before Black Friday? The security audit that found a vulnerability before attackers did?
Real Example: Invisible Success
Three months ago, we noticed elevated error rates in our API gateway—still under SLA, not triggering alerts. Investigation showed a slow memory leak that would have caused a complete outage in 10 days.
We patched it. No customer impact. No incident. No downtime.
Where does that show up in our metrics? Nowhere. It looks like my team spent 2 days on “investigation” with zero output.
But the counterfactual: 8-hour outage during business hours, $200K in lost revenue, customer churn, reputation damage.
The Remote Work Context
Distributed teams make this harder. When everyone was co-located, executives saw engineers responding to incidents. Now they see Jira tickets that say “investigated potential issue—no action needed.”
Studies on remote engineering teams emphasize output-based measurement. But how do you measure output when the output is “nothing bad happened”?
What I’m Trying
We started tracking:
- Near-misses: Issues caught before they became incidents (23 last quarter)
- Proactive fixes: Vulnerabilities patched before exploitation (17)
- Chaos engineering results: Failure modes tested and hardened (12 scenarios)
I calculate estimated cost of each prevented incident and sum it up: $2.3M in avoided losses.
But it feels like made-up math. How do I know that memory leak would have cost $200K? Maybe it would have self-healed. Maybe impact would have been $50K. I’m guessing.
The Question for This Community
How do you prove the value of preventative engineering in remote teams where “nothing happened” is the success?
What frameworks are you using for reliability and security work measurement?
Is there a better way than “here’s what probably would have gone wrong if we hadn’t been vigilant”?
Because right now, I’m defending incident response budget by pointing to incidents that didn’t happen, and finance is skeptical.
Luis Rodriguez
Director of Engineering, Financial Services
Building resilient systems and diverse teams