Last quarter, I presented our cloud migration strategy to the board. The architecture was elegant—optimized for cost efficiency, minimal redundancy, single-vendor for simplicity. The CFO loved the projected savings. Then Taiwan semiconductor shortages hit our primary cloud provider, our vendor had a multi-day outage, and suddenly that “efficient” design became a liability we’re still recovering from.
Here’s the paradox I’m wrestling with: The same executives who approved aggressive cost-cutting measures are now asking pointed questions about “operational resilience” after watching supply chain disruptions cascade through our tech stack. They want resilience, but they’re rewarding efficiency.
What Resilience Actually Requires
Building truly resilient systems demands exactly what finance teams call “waste”:
Slack capacity — Engineering teams can’t be at 100% utilization if we expect them to handle unexpected failures. When the Taiwan shortage hit, we needed engineers who had bandwidth to investigate alternatives and implement workarounds. Teams running at 95% utilization simply couldn’t pivot fast enough.
Redundant systems and vendors — Multi-cloud architecture costs more upfront. Maintaining relationships with backup suppliers requires ongoing investment. But when your primary vendor goes down, redundancy is the only thing standing between you and a complete service outage.
Optionality in architecture — Every decision that reduces vendor lock-in or enables graceful degradation carries additional complexity cost. We’re now retrofitting optionality we should have built from the start, at 3x the original price.
Buffer inventory and capacity — Maintaining spare hardware, excess API quotas, and unused compute capacity looks inefficient—until geopolitical instability disrupts your supply chain.
The 2026 Reality We’re Operating In
These aren’t theoretical concerns anymore:
- US tariffs are at 17% — the highest in nearly a century — fundamentally changing hardware procurement economics and timeline assumptions
- Taiwan produces 85% of advanced AI chips — creating unprecedented concentration risk in a geopolitically unstable region
- Cyber threats like Volt Typhoon are targeting critical infrastructure with the explicit goal of pre-positioning for future disruption
- Climate events are disrupting shipping routes and manufacturing facilities with increasing frequency
A project plan that worked six months ago may fail tomorrow. But planning for disruption requires resources that finance teams categorize as inefficiency.
The Conversation I Can’t Win
Last week, my CFO sent over a utilization analysis showing our infrastructure team at 78% capacity. She framed this as an optimization opportunity—“If we right-size the team to 90% utilization, we could reduce headcount by two FTEs.”
I tried explaining that the 22% “slack” is precisely what enables us to respond to incidents, investigate new technologies, and maintain our systems. She asked me to quantify the ROI of slack time.
How do you quantify the value of an outage that didn’t happen because you had redundant systems? How do you measure the cost avoided when engineers had time to investigate a vendor stability issue before it became critical? These benefits are invisible until they’re absent—and by then, it’s too late.
The Question I’m Asking This Community
How do you justify resilience investments to executives who measure success by utilization rates, cost reduction, and quarterly efficiency gains?
I’ve tried framing it as insurance, as technical debt prevention, as competitive differentiation. Sometimes these arguments work, sometimes they don’t. The deeper issue is that resilience and efficiency operate on different time horizons and value systems.
What frameworks, metrics, or storytelling approaches have actually worked for you? Especially interested in perspectives from other industries—financial services seems to have regulatory cover for resilience investments, but what about those of us in competitive markets where “lean” is celebrated?
The uncomfortable truth: I think we’re going to see major failures in 2026 from organizations that optimized for efficiency when they should have been building for resilience. I’d rather learn from your experiences than become the cautionary tale.