Are We Measuring What AI Actually Makes Possible? The $2.5T Blind Spot

Last week my CFO asked me to justify our AI tool spend for next quarter’s budget. I pulled up adoption metrics: 93% of our engineering team uses AI coding assistants daily. I showed throughput: PRs merged up 60%.

Then she asked the real question: “What business outcomes did this drive? Revenue? Customer satisfaction? Time-to-market on strategic features?”

I had no data. And apparently, I’m not alone.

The Investment Paradox

According to recent research, 25% of AI investments have been deferred to 2027 as CFOs demand measurable ROI. Meanwhile, the same studies show AI drove a 59% increase in engineering throughput. We’re seeing real gains—so why are companies pulling back?

The answer: only 14% of finance chiefs report seeing clear, measurable impact from their AI investments. The throughput is there, but it’s not translating to business value we can point to.

Where Our Metrics Break Down

Here’s what I’m seeing at our startup: Our engineering team is shipping more PRs than ever. Velocity charts look great. But our deployment cadence? Unchanged. Features reaching customers? Same timeline as six months ago.

We’re measuring activity (code written, PRs merged, tickets closed) instead of outcomes (features shipped, customer value delivered, revenue impact).

It’s like judging a restaurant by how fast the kitchen cooks, without asking if customers actually received their meals.

The Measurement Blind Spot

The research on this is sobering:

  • 39% of executives say measurement problems prevent calculating AI ROI clearly
  • High-AI-adoption teams completed 21% more tasks but organizational-level performance showed zero correlation with AI adoption
  • PR review time increased 91% despite throughput gains

The bottleneck shifted. AI made developers faster, but our delivery systems, quality gates, and organizational capacity didn’t scale with the output volume.

What Should We Actually Measure?

I’m a product person, not an engineer, but here’s what I think we need to track:

System-level flow metrics:

  • Time from code commit to production deployment
  • Feature lead time (idea to customer hands)
  • Change failure rate and recovery time
  • Deployment frequency for strategic initiatives

Business outcome metrics:

  • Customer-facing feature velocity
  • Revenue per engineer (for product work)
  • Reduction in tech debt incidents
  • Customer satisfaction with product velocity

Resource efficiency metrics:

  • Cost per deployed feature
  • Engineering capacity freed for strategic work
  • Rework rate / technical debt creation

But I’m honestly not sure if this is the right framework. What are other product and engineering leaders measuring to prove AI tool value to finance?

Are we leaving billions in gains on the table simply because we haven’t updated our dashboards to measure what AI actually makes possible?


Sources: Waydev 2026 Engineering Leadership Blind Spot, Second Talent - Measuring AI ROI 2026, Faros AI Productivity Metrics

This hits close to home. I just presented our Q1 AI tool metrics to the board last week, and the first question from our lead investor was exactly this: “What’s the business impact?”

The Bottleneck Has Shifted

Here’s what we’re seeing at our company:

  • PR volume up 98% year-over-year
  • Main branch success rate dropped from 85% to 68%
  • MTTR (mean time to restore) increased 40%
  • Deployment frequency for customer-facing features? Basically flat.

The problem isn’t that AI isn’t working—it absolutely is. Our engineers are writing code faster. The problem is that our CI/CD pipeline, code review processes, and deployment gates weren’t designed for this volume.

It’s like you widened the on-ramp to the highway but forgot the highway itself is still two lanes. Now you just have a bigger traffic jam.

What We’re Actually Measuring Now

We shifted to DORA metrics but with an AI-specific lens:

  1. Deployment Frequency - For strategic features, not just hotfixes
  2. Lead Time for Changes - Segmented by feature size and complexity
  3. Change Failure Rate - Tracking if AI-generated code has higher regression rates
  4. Mean Time to Recovery - This is where we’re seeing the pain

We’re also tracking review cycle time by stage: initial review, security review, QA, deployment approval. That’s where we found the bottleneck—not in code writing.

The CFO Translation Problem

But here’s the challenge, David: Even with these metrics, I struggle to translate this to dollar impact that satisfies our CFO. She wants to see:

  • Cost savings (headcount avoided or time saved × hourly cost)
  • Revenue impact (faster feature delivery = more deals closed?)
  • Risk reduction (fewer incidents × incident cost)

We’re experimenting with mapping deployment frequency of revenue-driving features to actual ARR growth, but the correlation is messy. Too many confounding variables.

How are other CTOs framing this for finance? Anyone successfully tied AI throughput gains to revenue or cost metrics that CFOs actually care about?

Michelle and David, you’re both describing exactly what I’m living through in financial services. But we have an extra constraint: compliance and security reviews.

When the Highway Hits Downtown

I love Michelle’s highway analogy. In our world, it’s like the highway hits downtown, and now every car needs to stop at three checkpoints: security review, compliance approval, and change management board.

Our data:

  • 40% more code written (AI-assisted)
  • Feature delivery timeline from inception to production: unchanged at 8-12 weeks
  • New bottleneck: Security and compliance review capacity

We hired more engineers to handle the volume, but that’s not where the constraint is. The constraint is in review capacity and architectural decision-making.

The Measurement Infrastructure Gap

David, your framework is solid, but here’s what I’d add: Flow metrics segmented by stage.

We implemented this using a simple workflow tracker:

  1. Development time (Code writing) - This accelerated 45%
  2. Code review time - Increased 30% (volume overwhelm)
  3. Security/compliance review - Increased 60% (more code = more surface area)
  4. Testing/QA time - Increased 40% (more test cases needed)
  5. Deployment approval - Unchanged
  6. Monitoring/validation post-deploy - Unchanged

The insight: AI accelerated stage 1, but we didn’t invest in stages 2-6. The system can’t absorb the output.

What We’re Doing About It

We’re now investing in:

  • Automated security scanning to reduce manual review time
  • Pre-commit compliance checks so issues are caught earlier
  • Senior IC capacity for architectural reviews (not just managers doing this)
  • Observability tooling to reduce post-deployment validation time

My Question Back to the Group

How do other engineering leaders justify the infrastructure investment in measurement and delivery systems to CFOs who only see “AI tools already paid for”?

My CFO’s logic: “We already bought the AI tools. Why do we need to spend more on measurement platforms and delivery pipeline improvements?”

How do you make the case that the secondary investments are what unlock the primary investment’s value?

Coming at this from a different angle—design and product quality—but same fundamental problem.

Our engineers are shipping UI changes and feature iterations way faster now with AI assistance. Sounds great, right?

Except: Design review is now the constraint.

We went from reviewing 15 PR-level design changes per week to 40+. Our 3-person design team can’t keep up. Result:

  • Inconsistent UI patterns shipping to prod
  • Accessibility regressions we catch post-deploy
  • Design system violations that create tech debt
  • More rework and “oops we need to fix that” cycles

The Quality Gate Problem

Luis, your flow metrics by stage are spot on. But here’s what I’d add: Quality gates passed vs. rework rate.

At my failed startup (painful lesson learned), we optimized for velocity without quality direction. We shipped fast, but:

  • 30% of shipped features needed immediate follow-up fixes
  • Customer complaints about inconsistent UX tripled
  • Engineering morale dropped because they felt like they were “just churning”

Velocity without quality = churn.

What This Means for Measurement

David’s framework needs a quality dimension. Maybe:

  • First-time quality rate (features that don’t need immediate rework)
  • Design review cycle time (separate from code review)
  • Accessibility/compliance defects per release
  • Customer-reported UX issues post-deploy

Provocative Thought

Maybe AI is amplifying our lack of design systems and quality infrastructure.

When code was slower to write, we had natural gates that caught quality issues. Now that code flows fast, we’re seeing all the gaps in our design systems, review processes, and quality standards.

AI didn’t create these problems—it’s just making them impossible to ignore.

Anyone else seeing quality debt accumulate faster despite (or because of?) AI-accelerated development?

This thread is gold. I literally brought these exact stats to our board meeting last week, so this is very real for me right now.

The Data That Made My Board Go Silent

  • 60% more PRs merged quarter-over-quarter
  • 91% longer PR review times
  • 21% more individual tasks completed
  • Net productivity at the organizational level: Zero correlation with AI adoption

That last one is what got their attention. Despite all the individual gains, when we zoom out to team and org level, the productivity improvements evaporate.

This Is a Leadership Capacity Problem

Michelle, you mentioned the CI/CD bottleneck. Luis, you pointed to review capacity. Maya, you highlighted design review constraints.

I think these are all symptoms of the same root cause: Leadership and review capacity hasn’t scaled with output volume.

Current average span of control for engineering managers: 12.1 reports (up from 10.9 in 2024). Research suggests optimal is 5-10 for technical teams.

Our managers are drowning in:

  • Code reviews (60% more volume)
  • Architectural decisions (more proposals due to faster iteration)
  • Quality gate approvals (Maya’s point about design review)
  • 1-on-1s and team coordination (unchanged headcount)

What the Bottleneck Really Is

It’s not CI/CD pipelines. It’s not tooling. It’s human judgment capacity at the senior/lead/manager level.

When individual developers get 59% faster but the review/approval layer stays constant, you get exactly what we’re seeing: more work in progress, longer cycle times, and no net organizational throughput gain.

Our Approach: Invest in Senior IC Capacity

We’re doing two things:

  1. Reducing manager spans from 12 to 8 - Hiring more managers is expensive, but necessary
  2. Investing in Staff/Principal IC roles - Senior engineers who can unblock architectural and review decisions without manager involvement
  3. Measuring flow efficiency at the team level, not individual productivity

The metric we’re tracking: Team-level flow efficiency = (Value-add time) / (Total lead time)

When AI accelerates individual coding but review time explodes, flow efficiency drops even if individual metrics look good.

The Challenge I’m Wrestling With

David asked how to justify this to CFOs. Here’s my framing:

“AI gave us a 59% throughput increase at the individual level. But we’re losing 40-60% of that gain to review and coordination bottlenecks. Investing $X in senior capacity and tooling will capture $Y of that lost value.”

But I’ll be honest: quantifying that $Y is hard. I’m estimating based on opportunity cost of delayed features and engineer productivity lost to context switching.

Anyone found a more rigorous way to quantify the value of “unblocking the review layer”? How do you put a dollar figure on senior IC capacity or reduced manager spans?

This is the conversation every VP Eng and CTO needs to have with their CFO in 2026. AI didn’t solve our delivery problems—it made our organizational design problems impossible to ignore.