We hit a milestone this quarter that made me realize we’re in trouble.
Our sprint retrospective data shows that 60% of our engineering capacity is going to “keep the lights on” work—production incidents, technical debt, scaling existing features, infrastructure maintenance.
Only 40% is going to new customer-facing features.
Two years ago, that ratio was inverted. 70% new features, 30% maintenance.
The Slow Slide Into Crisis
It didn’t happen overnight. Each quarter, the maintenance percentage crept up a few points:
- Q1 2024: 30% maintenance, 70% features
- Q3 2024: 40% maintenance, 60% features
- Q1 2025: 50% maintenance, 50% features
- Q3 2025: 55% maintenance, 45% features
- Q1 2026: 60% maintenance, 40% features
At first, we told ourselves it was normal. “We’re scaling, of course there’s more operational work.” But we crossed into something different.
The Warning Signs We Missed
Looking back, the signals were clear:
Product delivery:
- Features that took 2 sprints now take 4-5 sprints
- Simple changes require touching 6-8 services
- Every deploy carries anxiety because the blast radius is unpredictable
Operational:
- Incident rate up 45% year-over-year
- Mean time to recovery doubled
- On-call rotation is burning people out
Team:
- Senior engineers requesting transfers to other teams
- New hires spend 4-6 weeks just understanding the system
- “We should refactor this” discussions happen weekly but never get prioritized
Customer:
- Support escalations up 30% (performance, bugs, reliability)
- Feature requests piling up in backlog
- Competitive losses because we can’t ship fast enough
Is This the Tipping Point?
Forrester predicts that 75% of tech decision-makers will face moderate to high technical debt severity levels by 2026. I think we’re part of that 75%.
The technical debt tipping point is supposed to be “when debt interferes with business operations and can no longer be ignored.”
We’re there. But here’s my question:
What metrics signal you’ve definitively crossed from sustainable debt to crisis mode?
Is it a specific percentage of capacity on maintenance? A velocity drop threshold? Incident rate? Customer churn?
In financial services, we’re risk-averse by nature. But I’m struggling to build a data-driven case for “we need to stop feature development for 2 quarters and address architecture” when the business still wants growth.
How do you quantify the tipping point in a way that gets executive buy-in?