I’ve been in back-to-back architecture reviews for the past month, and I keep hearing the same word: resilience. Every team is building for resilience. Every roadmap prioritizes resilience. Every vendor pitch promises resilience.
But here’s the problem—nobody seems to agree on what resilience actually means.
The Problem: Resilience Means Everything (and Nothing)
At my company (financial services, 40+ engineers), I’ve seen three teams define “resilience” in completely different ways:
- Infrastructure team: Resilience = chaos engineering. They want to randomly kill pods in production to test recovery.
- Backend team: Resilience = redundancy. They want multi-region failover and backup databases.
- SRE team: Resilience = observability. They want distributed tracing and real-time alerting so we can respond faster.
All three are technically correct. But when I ask “What does building for resilience look like?” I get wildly different answers—and wildly different budget requests.
The Industry Isn’t Much Clearer
I’ve been reading up on this. Waydev says resilience is the new engineering principle for 2026. Engineering trends call it foundational. But when I dig into implementation guidance, I find:
- BMC defines resilience engineering as “the ability to adjust functioning during changes” (broad but vague)
- InformationWeek’s five pillars include monitoring, backup plans, eliminating single points of failure (good practices, but is that resilience specifically?)
- Resilium Labs says resilience is about learning from surprises and human factors (organizational, not just technical)
So… is resilience the same as reliability? Is it fault tolerance by another name? Is it about systems, or teams, or both?
The Questions I’m Wrestling With
I’m trying to define a resilience strategy for our org, and I keep hitting these questions:
- Is resilience just a rebrand of reliability? Or is there something genuinely new here?
- How do you measure it? MTTF and MTTR are reliability metrics. Do we need different metrics for resilience?
- Is chaos engineering resilience, or just expensive theater? We don’t have Netflix-scale systems—do we really need to randomly break things?
- Does organizational resilience matter more than technical resilience? If my team burns out from on-call, no amount of redundancy helps.
What I’m Looking For
I’m hoping this community can help me move beyond buzzwords. If you’re “building for resilience” at your company:
- What does that actually look like? (Specific practices, not principles)
- How do you define success? (What metrics or outcomes tell you resilience is working?)
- How did you sell it to leadership? (Especially if it competes with feature work)
I suspect the answer is “it depends”—but I’m hoping we can identify some common patterns or frameworks that cut through the hype.
Looking forward to hearing how other teams are approaching this.