I’ve been diving into the latest observability research and found something that should make every leader pause: observability leaders are achieving 2.6x annual ROI according to Splunk’s State of Observability report.
But here’s the catch that most teams miss.
The Headline Numbers
The data is compelling:
- 2.6x annual ROI for observability leaders (Splunk 2024)
- Median $2 return per $1 invested (New Relic 2023 Forecast)
- 41% of organizations receive more than $1 million in total annual value
- IBM reports 219% ROI with 90% reduction in developer troubleshooting time
So why isn’t everyone celebrating?
The Measurement Gap
Because most teams aren’t measuring what matters. The same reports reveal:
| What Teams Measure |
Percentage |
| Exclusively operational metrics (SLIs/SLOs) |
17% |
| Primarily operational, business as “perk” |
58% |
| Elevated business impact metrics |
24% |
Only 24% of observability teams have elevated business impact metrics - including SLAs, revenue impact, and customer experience - to the same importance as operational data.
The Reporting Problem
It gets worse when you look at how teams communicate value:
- 93% report financial/business impact to leadership in some form
- Only 19% do so regularly as part of established processes
- 43% report occasionally
- 31% only when specifically requested by leadership
We’re collecting the data. We’re not translating it.
What Business Metrics Actually Matter
The 2.6x ROI comes from connecting observability to:
- Revenue at risk - What’s the business impact of this service degrading?
- Customer experience scores - How do technical metrics correlate to NPS/CSAT?
- Cost per transaction - What does it cost to process a customer request?
- Conversion impact - How does performance affect checkout/signup rates?
- SLA attainment - Are we meeting customer commitments?
The Real-World Evidence
Lenovo cut MTTR by 85% and maintained 100% uptime during peak e-commerce. That’s not an MTTR story - that’s a revenue protection story.
The organizations achieving 2.6x ROI aren’t better at collecting metrics. They’re better at connecting metrics to business outcomes.
The Cost of Not Measuring Right
- 61% say downtime costs at least $100,000 per hour
- 32% say critical business app outages cost more than $500K per hour
- Organizations with full-stack observability: $6.17M median annual outage cost
- Organizations without: $9.83M median annual outage cost
- Difference: $3.66 million per year
That $3.66M gap? That’s the cost of incomplete measurement.
The Strategic Question
If you’re investing in observability, are you measuring its impact in terms your CFO cares about?
Because the teams getting 2.6x ROI aren’t running better dashboards. They’re running better businesses.
How is your team connecting observability to business outcomes? I’d love to hear what’s working.
David, this resonates deeply with a transformation we went through over the past two years.
Observability as Strategic Investment
When I joined as CTO, our observability spend was categorized as “infrastructure cost” - essentially a tax on running systems. The conversation was always about minimizing it.
That framing was fundamentally wrong.
The Mental Model Shift
Observability isn’t a cost center. It’s an investment in decision quality.
Every business decision we make - from capacity planning to feature prioritization to incident response - is only as good as the data informing it. Observability is the infrastructure that makes those decisions possible.
What Changed Our Approach
We started asking different questions:
| Old Questions |
New Questions |
| How do we reduce observability costs? |
What decisions are we unable to make without better observability? |
| What’s our MTTR? |
How much revenue is at risk during incidents? |
| Are we meeting SLOs? |
Are we meeting customer expectations? |
| What’s our uptime? |
What’s the business impact of degraded performance? |
The Executive Conversation
I now present observability to our board the same way I present R&D investment:
ROI Framework:
- Incidents prevented or shortened → revenue protected
- Developer time saved → velocity increase
- Customer experience improved → retention and expansion
- Decision latency reduced → competitive advantage
The Budget Protection Effect
Elastic’s research noted that observability budgets stay protected because “every business runs on IT now.”
But it’s more than that. The teams that frame observability as business enablement rather than operational necessity get more budget, not just protected budget.
We increased our observability investment 40% last year. The CFO approved it because we showed the connection to outcomes.
David, the 58% figure for teams treating business impact as a “perk” hits close to home. That was us until about 18 months ago.
The Translation Problem
Engineering teams are naturally good at measuring technical metrics. We understand p99 latency, error rates, and throughput intuitively. But translating those to business impact requires a different muscle.
Why Engineering Defaults to Operational Metrics
- It’s what we control - I can directly improve MTTR. Revenue is influenced by many factors.
- It’s precise - Technical metrics are unambiguous. Business impact often involves estimation.
- It’s our language - SLOs make sense to engineers. Revenue at risk sounds like finance-speak.
- It’s comfortable - We know how to dashboard SLIs. We’re not sure how to dashboard business impact.
What Changed for Us
We partnered with our finance and analytics teams to build what we call the Impact Translation Layer:
Technical Signal → Service Context → Business Mapping → Dollar Impact
Example:
Checkout API latency > 2s
→ Checkout service degraded
→ Conversion rate drops 2.3% per 100ms delay
→ $47K revenue at risk per hour
The Key Insight
The translation doesn’t need to be perfect to be useful. We started with rough estimates:
- “If the payment service is down for an hour during peak, we lose approximately $X in revenue”
- “Each minute of checkout degradation costs approximately $Y in abandoned carts”
Even imprecise business context changed how we prioritized incidents and investments.
Team Adoption
The hardest part was getting engineers to think this way. We started including business impact in:
- Incident severity definitions
- Post-mortem templates
- SLO documentation
- On-call handoffs
Now it’s second nature. Engineers talk about “revenue-impacting services” and “customer experience endpoints” rather than just “tier-1 services.”
David, Luis - you’ve both touched on the analytical challenge that’s become central to my work: how do we systematically connect observability data to business KPIs?
The Data Science Perspective on Business KPI Alignment
The research shows only 28% of organizations currently use AI to align observability data with business KPIs. That’s a massive opportunity gap.
Building the Correlation Models
We’ve been working on what I call Observability-to-Business Correlation Models. The approach:
1. Establish Business Metrics as Dependent Variables
- Conversion rate
- Cart abandonment rate
- Session duration
- Customer satisfaction scores
- Revenue per session
2. Map Technical Metrics as Independent Variables
- Page load time
- API response latency
- Error rates by endpoint
- Availability by service
- Request success rates
3. Build Statistical Relationships
-- Example: Correlating checkout latency to conversion
SELECT
DATE_TRUNC('hour', timestamp) as hour,
AVG(checkout_latency_ms) as avg_latency,
SUM(conversions) / SUM(sessions) as conversion_rate
FROM merged_observability_analytics
GROUP BY 1
The Surprising Findings
When we ran this analysis, we discovered:
- Search latency had 3x stronger correlation to conversion than checkout latency (users abandon before they even add to cart)
- Image load time on product pages was the #2 predictor of bounce rate
- Mobile API performance had different impact curves than desktop
Real-Time Impact Scoring
Now we surface a Business Impact Score alongside technical metrics in our dashboards:
| Alert |
Technical Severity |
Business Impact Score |
| Checkout API > 2s |
High |
$47K/hour |
| Search degraded |
Medium |
$82K/hour |
| Product images slow |
Low |
$23K/hour |
The Business Impact Score often inverts our traditional severity assumptions.
The Feedback Loop
The best part? This data flows back into incident prioritization. On-call engineers now have business context to make better real-time decisions.