I need to have an honest conversation about error budgets, because there’s a massive gap between how they work in Google’s SRE book and how they work in every other company I’ve seen.
The Theory Is Elegant
For those unfamiliar, error budgets are the operational counterpart to Service Level Objectives (SLOs). If your service’s SLO is 99.9% availability, you have a 0.1% “error budget” — roughly 43 minutes of downtime per month. The idea, popularized by Google’s SRE book, is beautifully simple:
- When the error budget is healthy, product teams ship features freely
- When the error budget is burning fast, teams slow down and invest in reliability
- When the error budget is spent, feature deployments stop and all engineering effort goes to reliability
The self-balancing mechanism is what makes it theoretically compelling. Product teams don’t have to be convinced that reliability matters — the error budget creates a natural feedback loop. SRE teams don’t have to be the “no” police — the budget speaks for itself. Everyone has shared accountability with clear, objective thresholds.
The Reality: 18 Months, 4 Budget Burns, Zero Freezes
We implemented error budgets at my company 18 months ago. We did it properly — defined SLOs collaboratively with product and engineering, built dashboards, automated budget calculations, set up alerting for burn-rate thresholds. The technical implementation was solid.
In those 18 months, our primary user-facing services have burned through their error budgets four times. The number of times we actually froze deployments? Zero.
Here’s what happened each time:
Burn #1 (March 2025): Database migration caused 3 hours of degraded performance. Error budget spent. I proposed a two-week feature freeze. VP of Product said: “We have a quarterly revenue commitment. We can’t stop shipping.” Freeze overridden.
Burn #2 (June 2025): Cascading failure during peak traffic. Budget burned in 2 hours. I escalated to the CTO. CTO said: “Let’s do a targeted reliability sprint instead of a full freeze.” Reliability sprint lasted 3 days before being deprioritized for a competitive feature launch.
Burn #3 (September 2025): Third-party payment provider outage drained our budget (counted against our SLO even though the root cause was external). The team argued — correctly — that a deploy freeze wouldn’t prevent external outages. Budget reset without consequence.
Burn #4 (January 2026): Memory leak in a new service caused gradual degradation over 5 days. By the time we caught it, 80% of the monthly budget was gone. This time I got a one-week freeze approved. It lasted 3 days before the CEO asked about a feature the board was expecting at the next meeting.
Why Freezes Never Stick
The pattern is consistent, and it’s not about bad faith. The people overriding the freezes aren’t villains — they’re responding to real business pressures:
- Revenue commitments are contractual. When the VP of Sales says “we promised this feature by Q2,” that’s not a preference — it’s a commitment with financial consequences.
- Competitive pressure is existential. When your main competitor ships a feature your prospects are asking about, “we’re in a reliability freeze” doesn’t satisfy the board.
- Error budgets are abstract, revenue is concrete. Telling leadership “we have 12% error budget remaining” doesn’t create urgency the way “we’re $500K behind on Q2 pipeline” does.
- The SRE team can’t enforce alone. We can declare the budget burned and recommend a freeze, but we don’t have organizational authority to stop deployments. That requires executive backing that evaporates under business pressure.
What I Tried
Attempt 1: Automatic deploy freezes. We configured our CI/CD pipeline to block production deployments when the error budget was exhausted. This lasted exactly one day before engineering leadership demanded an override mechanism. Within a week, every deploy had an approved override. The automation became theater.
Attempt 2: Manual freeze decisions. We created a formal process: when budget burns, a cross-functional group (engineering, product, SRE) decides whether to freeze. In practice, product always outvoted SRE because product owns the business metrics that leadership tracks.
Attempt 3: “Reliability sprints.” Instead of freezing, we’d dedicate 30% of engineering capacity to reliability work for two weeks. These sprints were consistently raided for “urgent” feature work and never achieved their reliability goals.
What Partially Worked
The one thing that moved the needle: making error budget status visible in weekly business reviews. When the CTO presents to the executive team and the dashboard shows “Error Budget: 12% remaining / SLA breach risk: HIGH” right next to “Revenue: on target / Churn: low,” it creates conversations that don’t happen in engineering-only meetings.
Leadership started asking: “What happens if we breach the SLA?” When the answer is “contractual penalties, customer escalations, potential churn,” error budgets suddenly feel less abstract. The budget hasn’t become an enforcement mechanism, but it’s become a planning input — leadership now factors reliability risk into prioritization decisions.
My Honest Assessment
Error budgets are useful as a measurement tool — they quantify reliability investment needed and make the trade-off between velocity and reliability visible. But they don’t work as an enforcement mechanism in most organizations because they require a level of organizational discipline that conflicts with how businesses actually operate under pressure.
The Google model works at Google because Google has the market position, revenue base, and engineering culture to absorb a feature freeze. Most companies don’t.
The Question
Has anyone here successfully implemented error budget-driven deploy freezes that actually stick? How did you get organizational buy-in that survived contact with quarterly targets and competitive pressure? I’m genuinely looking for models that work outside of FAANG-scale companies.