Error Budgets Sound Great in Theory — But How Many Teams Actually Freeze Deploys When the Budget Burns? SLO-Driven Development's Accountability Gap

eng_director_luis · February 12, 2026, 8:51am

I need to have an honest conversation about error budgets, because there’s a massive gap between how they work in Google’s SRE book and how they work in every other company I’ve seen.

The Theory Is Elegant

For those unfamiliar, error budgets are the operational counterpart to Service Level Objectives (SLOs). If your service’s SLO is 99.9% availability, you have a 0.1% “error budget” — roughly 43 minutes of downtime per month. The idea, popularized by Google’s SRE book, is beautifully simple:

When the error budget is healthy, product teams ship features freely
When the error budget is burning fast, teams slow down and invest in reliability
When the error budget is spent, feature deployments stop and all engineering effort goes to reliability

The self-balancing mechanism is what makes it theoretically compelling. Product teams don’t have to be convinced that reliability matters — the error budget creates a natural feedback loop. SRE teams don’t have to be the “no” police — the budget speaks for itself. Everyone has shared accountability with clear, objective thresholds.

The Reality: 18 Months, 4 Budget Burns, Zero Freezes

We implemented error budgets at my company 18 months ago. We did it properly — defined SLOs collaboratively with product and engineering, built dashboards, automated budget calculations, set up alerting for burn-rate thresholds. The technical implementation was solid.

In those 18 months, our primary user-facing services have burned through their error budgets four times. The number of times we actually froze deployments? Zero.

Here’s what happened each time:

Burn #1 (March 2025): Database migration caused 3 hours of degraded performance. Error budget spent. I proposed a two-week feature freeze. VP of Product said: “We have a quarterly revenue commitment. We can’t stop shipping.” Freeze overridden.

Burn #2 (June 2025): Cascading failure during peak traffic. Budget burned in 2 hours. I escalated to the CTO. CTO said: “Let’s do a targeted reliability sprint instead of a full freeze.” Reliability sprint lasted 3 days before being deprioritized for a competitive feature launch.

Burn #3 (September 2025): Third-party payment provider outage drained our budget (counted against our SLO even though the root cause was external). The team argued — correctly — that a deploy freeze wouldn’t prevent external outages. Budget reset without consequence.

Burn #4 (January 2026): Memory leak in a new service caused gradual degradation over 5 days. By the time we caught it, 80% of the monthly budget was gone. This time I got a one-week freeze approved. It lasted 3 days before the CEO asked about a feature the board was expecting at the next meeting.

Why Freezes Never Stick

The pattern is consistent, and it’s not about bad faith. The people overriding the freezes aren’t villains — they’re responding to real business pressures:

Revenue commitments are contractual. When the VP of Sales says “we promised this feature by Q2,” that’s not a preference — it’s a commitment with financial consequences.
Competitive pressure is existential. When your main competitor ships a feature your prospects are asking about, “we’re in a reliability freeze” doesn’t satisfy the board.
Error budgets are abstract, revenue is concrete. Telling leadership “we have 12% error budget remaining” doesn’t create urgency the way “we’re $500K behind on Q2 pipeline” does.
The SRE team can’t enforce alone. We can declare the budget burned and recommend a freeze, but we don’t have organizational authority to stop deployments. That requires executive backing that evaporates under business pressure.

What I Tried

Attempt 1: Automatic deploy freezes. We configured our CI/CD pipeline to block production deployments when the error budget was exhausted. This lasted exactly one day before engineering leadership demanded an override mechanism. Within a week, every deploy had an approved override. The automation became theater.

Attempt 2: Manual freeze decisions. We created a formal process: when budget burns, a cross-functional group (engineering, product, SRE) decides whether to freeze. In practice, product always outvoted SRE because product owns the business metrics that leadership tracks.

Attempt 3: “Reliability sprints.” Instead of freezing, we’d dedicate 30% of engineering capacity to reliability work for two weeks. These sprints were consistently raided for “urgent” feature work and never achieved their reliability goals.

What Partially Worked

The one thing that moved the needle: making error budget status visible in weekly business reviews. When the CTO presents to the executive team and the dashboard shows “Error Budget: 12% remaining / SLA breach risk: HIGH” right next to “Revenue: on target / Churn: low,” it creates conversations that don’t happen in engineering-only meetings.

Leadership started asking: “What happens if we breach the SLA?” When the answer is “contractual penalties, customer escalations, potential churn,” error budgets suddenly feel less abstract. The budget hasn’t become an enforcement mechanism, but it’s become a planning input — leadership now factors reliability risk into prioritization decisions.

My Honest Assessment

Error budgets are useful as a measurement tool — they quantify reliability investment needed and make the trade-off between velocity and reliability visible. But they don’t work as an enforcement mechanism in most organizations because they require a level of organizational discipline that conflicts with how businesses actually operate under pressure.

The Google model works at Google because Google has the market position, revenue base, and engineering culture to absorb a feature freeze. Most companies don’t.

The Question

Has anyone here successfully implemented error budget-driven deploy freezes that actually stick? How did you get organizational buy-in that survived contact with quarterly targets and competitive pressure? I’m genuinely looking for models that work outside of FAANG-scale companies.

alex_infrastructure · February 12, 2026, 8:51am

This is painfully relatable, @eng_director_luis. I manage the error budget implementation on our SRE team and I’ve lived through the exact same pattern — budget burns, freeze proposed, freeze overridden, rinse and repeat.

The Binary Freeze Was Our Biggest Mistake

The fundamental problem with the Google model is that it presents a binary choice: freeze or don’t freeze. That’s politically untenable in most organizations because a full freeze is a dramatic action with visible business consequences. It forces leadership to choose between “reliability” and “revenue” in a single high-stakes decision, and revenue wins every time.

Graduated Consequences Changed the Game

What finally worked for us was replacing the binary freeze with a graduated consequence model. Instead of one threshold that triggers a full stop, we have four levels tied to budget consumption:

Budget Remaining	Consequence
75%+	Normal operations. Ship freely.
50-75%	All deploys must include rollback capability. Canary deployments mandatory. No “YOLO” pushes to prod.
25-50%	Deploys require SRE review and approval. SRE can reject deploys that lack adequate testing or monitoring.
0-25%	New feature deploys require VP-level sign-off. SRE presents risk assessment to VP before each deploy.
Budget exhausted	Deploys require CTO sign-off with documented business justification.

The key insight: we never actually stop deployments. We make them progressively more expensive in terms of process overhead. Nobody wants to write a business justification memo for the CTO every time they want to push code. The friction is the enforcement mechanism, not a binary gate.

Why This Works Politically

The graduated model works because it doesn’t force leadership into a dramatic all-or-nothing decision. Instead, it creates natural deceleration:

At 50% budget, canary requirements slow deployments by about 30% (teams have to set up proper canary analysis instead of just pushing)
At 25%, the SRE review adds a 4-8 hour delay per deploy and catches risky changes before they hit production
Below 0%, the VP/CTO sign-off doesn’t stop deploys, but it makes the business explicitly accept the reliability risk

In practice, teams self-regulate. When they see the budget dropping below 50%, they voluntarily pull reliability work into the current sprint because they don’t want to deal with the additional process overhead. The error budget becomes a leading indicator that changes behavior before it becomes a crisis.

The Data After 12 Months

Since implementing graduated consequences:

Mean time to detect budget burn increased from 3 days to same-day (teams actually watch the dashboards now)
Voluntary reliability investment increased by 40% (teams prefer proactive work over reactive process)
We’ve only hit the “CTO sign-off” level once, and that was an external dependency issue

It’s not perfect — we still don’t have a true freeze mechanism. But the graduated approach creates enough friction to meaningfully change team behavior without requiring the organizational authority that most SRE teams don’t have.

The biggest win: framing it as “risk management” instead of “punishment.” Leadership understands risk management. They don’t understand why engineers want to stop shipping code.

cto_michelle · February 12, 2026, 8:52am

I’m going to offer a counterpoint here, @eng_director_luis, because we DO freeze — and it works. But the reason it works has nothing to do with engineering culture or organizational discipline. It’s about money.

The Secret: Tie Error Budgets to Financial Consequences

Our enterprise customers have SLAs that guarantee 99.95% uptime with contractual financial penalties for breaches. If we miss the SLA target in any given month, we owe credits. For our largest customers, a single SLA breach can cost us between $200K and $500K in service credits, plus the relationship damage and churn risk that comes with it.

When I implemented error budgets, I explicitly mapped them to our SLA obligations. The error budget isn’t an abstract engineering metric — it’s a countdown to a financial penalty. The dashboard doesn’t just say “Error Budget: 15% remaining.” It says “Error Budget: 15% remaining. Projected SLA breach in 4 days at current burn rate. Estimated financial exposure: $340K.”

That number gets attention in a way that “we might have reliability issues” never does.

How the Freeze Actually Works

When our error budget drops below 25%, the freeze is automatic and the override process is intentionally painful:

The deploy pipeline blocks. Not “warns” — blocks.
To override, a VP must submit a written request that includes: the business justification, the estimated risk to SLA, and explicit acknowledgment of the potential financial penalty.
That request goes to me (CTO) and our VP of Customer Success for approval.
If we approve, the override is logged and reported in the next board meeting.

The key mechanism: nobody wants their name on a board report next to “$500K SLA penalty because I overrode a reliability freeze to ship a feature.” The accountability is personal and visible.

Why This Wouldn’t Work Everywhere

I want to be honest about the limitations. This works because:

We have enterprise customers with real SLAs. If your revenue model is self-serve or consumer, you probably don’t have contractual uptime guarantees with financial teeth.
The penalties are large enough to matter. A $5K credit doesn’t change behavior. $500K does.
Our board cares about customer retention. In a growth-at-all-costs environment, the board might prioritize feature velocity even with SLA risk.

For companies without these conditions, I’d suggest @alex_infrastructure’s graduated model as a more pragmatic approach.

My Advice: Create Real Consequences

The core lesson from our experience: error budgets work as an enforcement mechanism only when burning the budget has consequences that leadership genuinely fears. At Google, the consequence is cultural — burning the budget is a mark of shame in an engineering-driven organization. At our company, the consequence is financial. At other companies, you might find a different lever.

Some options I’ve seen work:

Customer contractual penalties (our approach)
Public status page commitments — if you’ve published an SLA on your status page, breaching it is a PR problem, not just an engineering problem
Executive compensation ties — one company I know ties VP-level bonuses partially to SLO attainment. Burning error budgets affects their compensation
Customer advisory board visibility — presenting error budget status to your customer advisory board creates external accountability

The point is the same: abstract engineering metrics don’t drive organizational behavior. Real consequences do. If you can’t find a consequence that matters to your leadership team, error budgets will remain a measurement tool, not an enforcement mechanism — and honestly, that’s still valuable, just not what the SRE book promises.

product_david · February 12, 2026, 8:52am

I’m going to out myself here: I was the VP of Product who kept overriding error budget freezes. Not at Luis’s company, but I’ve done exactly what he describes — looked at the error budget burn, looked at the quarterly target, and chosen the target every single time. I want to explain why, and what finally changed my mind.

The Honest Confession

When your quarterly OKR is “ship features X, Y, and Z by March 31” and the CEO reviews progress every two weeks, reliability feels like an engineering concern that doesn’t directly affect your metrics. I know that sounds short-sighted. It is. But here’s the incentive structure I was operating in:

Ship the feature on time: I get recognized in the all-hands, my team gets praised, my performance review is strong.
Miss the feature for a reliability freeze: I have to explain to the CEO why we’re behind, my team’s velocity metrics drop, and I spend the next quarter rebuilding credibility.
The reliability incident happens: It’s an “engineering problem.” The post-mortem focuses on technical root causes. My name isn’t on it. The organizational consequence flows to the SRE team, not to product.

The incentive structure actively punishes prioritizing reliability and doesn’t punish ignoring it. Until that changes, no amount of error budget dashboards will change product leader behavior.

What Changed My Mind: A $2M Lesson

Six months ago, we were in a final demo for a $2M enterprise deal. The prospect had flown in their CTO and VP of Engineering. Midway through the live demo, our platform hit a reliability issue — elevated error rates, slow response times, intermittent failures. The demo was a disaster.

We lost the deal. $2M in annual revenue, gone because of a reliability issue that would have been caught by the work we deprioritized during the last error budget override. The prospect’s CTO told our sales team: “If this is what the product looks like when you’re trying to impress us, what does it look like in production?”

That single incident changed my perspective more than any dashboard, SRE presentation, or engineering argument ever could. Reliability isn’t an engineering concern — it’s a revenue concern that manifests unpredictably.

My Framework Now: Reliability Buffer

Instead of fighting about error budget freezes after the fact, I now build what I call a “reliability buffer” into quarterly planning:

15% of engineering capacity is pre-allocated for reliability work every quarter. This isn’t negotiable and isn’t tied to error budgets.
The SRE team owns the prioritization of that 15%. They decide what reliability work matters most.
The remaining 85% is for features, and product owns the prioritization.
If the error budget burns, we can pull additional capacity from the feature allocation — but the 15% baseline means we’re always investing in reliability, not just when things break.

This approach avoids the freeze-or-don’t-freeze binary entirely. Reliability investment is a constant, not a reaction. It’s baked into the plan from day one, so I never have to choose between “ship the feature” and “fix reliability” — both are happening simultaneously.

The Advice I’d Give to SRE Teams

If you’re struggling to get product buy-in for error budgets, stop making engineering arguments. Product leaders respond to:

Revenue risk. “This reliability gap could cost us a deal” hits differently than “our p99 latency is elevated.”
Customer stories. Put real customer complaints in front of leadership. “Customer X is evaluating competitors because of our reliability” is more persuasive than any dashboard.
Pre-allocated capacity. Don’t ask for capacity after the budget burns. Get it committed at the start of the quarter when planning goodwill is highest.
Personal accountability. Make the product leader co-own the SLO. When my name is on the SLO alongside the engineering lead, I have skin in the game.

The error budget isn’t the problem. The incentive structure around it is. Fix the incentives and the budget becomes meaningful.