We Measure AI Productivity Wrong: 31% Developer Speed Gains, 0% DORA Improvement. What Gives?

Our CFO pulled me aside last week after the board meeting. “Michelle, we’ve invested 00K in AI coding tools this year. What’s the ROI?” I had plenty of individual developer testimonials—engineers saying they’re 30-40% faster, shipping features in days instead of weeks. But when I looked at our DORA metrics? Nothing. Flat. Some actually worse.

I’m not alone in this. The research is painting a confusing picture:

The Individual vs. Organizational Paradox

Individual level: Studies show 30-60% speed improvements for scoped programming tasks. GitHub’s data shows developers completing 21% more tasks, merging 98% more pull requests[1].

Organizational level: But deployment frequency? Same. Lead time for changes? Same. Change failure rate? Actually increased in some teams[2].

What gives?

Where the Gains Disappear

I think I’ve figured out where our productivity gains are going: They’re getting absorbed by downstream bottlenecks we never optimized.

The data is brutal:

  • PR review time up 91%[1:1] - Our senior engineers are drowning in review queues because AI tripled code output but we didn’t scale review capacity
  • Testing bottlenecks - AI writes code fast, but our test suites weren’t designed for this volume
  • Deployment pipeline unchanged - Same release processes, approval gates, deployment windows

It’s Amdahl’s Law playing out in real time: A system moves only as fast as its slowest link. We accelerated one part (code writing) without modernizing the rest (review, testing, deployment).

The Measurement Crisis

Here’s the hard part: My CFO doesn’t care about individual velocity. She cares about:

  • Are we shipping features to customers faster?
  • Are we reducing incidents?
  • Are we delivering more business value per engineering dollar?

And honestly? I don’t have good answers yet.

The traditional metrics—DORA, velocity, throughput—were designed for a different era. They assume humans write code at human speed. When AI 10x’s code generation but we’re still measuring deployment frequency, we’re missing the real story.

What I’m Trying

We’re experimenting with a three-layer measurement approach[3]:

  1. Usage metrics: AI tool adoption rates, prompt usage, code acceptance rates
  2. System metrics: PR throughput, review latency, merge-to-deploy time
  3. Business metrics: Feature time-to-market, customer-reported defects, support ticket reduction

Early hypothesis: AI is making us busy but not necessarily effective. We’re writing more code, but are we solving more customer problems?

Questions for This Community

For engineering leaders: What metrics are you actually using to prove AI ROI? What’s resonating with your CFO/board?

For product leaders: How are you connecting AI developer productivity to business outcomes?

For anyone measuring this: Are we fundamentally measuring the wrong things? Should we abandon DORA metrics in the AI era?

I need to have real answers by Q2 earnings. Help me think through this.



  1. Faros AI - DORA Report 2025 Key Takeaways ↩︎ ↩︎

  2. The AI Productivity Paradox Research ↩︎

  3. How to Measure AI Developer Productivity and ROI ↩︎

This resonates deeply, Michelle. In fintech, the review bottleneck is even worse because of our regulatory environment.

AI generates compliant-looking code fast—authentication flows, transaction logging, audit trails. But we still need the same level of rigorous review we’ve always done. Actually, more scrutiny, because regulators don’t accept “the AI wrote it” as an excuse when things go wrong.

The Regulatory Paradox

Our compliance team now spends 40% more time reviewing PRs than before we adopted AI tools. Why?

  • Volume: 3x more code to review per sprint
  • Complexity: AI-generated code sometimes takes unusual approaches that require deeper analysis
  • Accountability: Every line needs a human sign-off for audit trails

We’re exploring AI-assisted code review to match the velocity:

  • GitHub Copilot Workspace for initial security scanning
  • Claude Code reviews for compliance pattern detection
  • Custom LLM trained on our regulatory requirements

Early results: We can triage faster (automated checks catch 60% of issues), but senior engineers still bottleneck on the architectural and business logic review.

The Real Question

Are we just automating ourselves into a different bottleneck? We optimized writing, now we need to optimize review. What’s next—deployment? Testing? Product validation?

It feels like we’re playing whack-a-mole with process bottlenecks.

This is the classic feature factory problem, just with AI acceleration.

Michelle, you’re seeing at the engineering level what I see at the product level: Shipping more ≠ delivering value.

The Product Paradox

Since we adopted AI coding tools 6 months ago:

  • Features shipped per quarter: +65%
  • Feature adoption (30-day active usage): -12%
  • Customer satisfaction (NPS): Flat

We’re building faster, but we’re not building better. In some cases, we’re building the wrong things faster.

Why This Happens

AI doesn’t help with:

  • Problem definition - Understanding what customers actually need
  • Discovery - Validating assumptions before building
  • Strategic prioritization - Saying no to features

It only helps with execution. So if your product strategy is weak, AI just helps you fail faster.

What I’m Measuring Instead

I stopped trying to connect AI productivity to output metrics (velocity, features shipped) and started measuring outcome metrics:

  • Time to validated learning: How fast can we test a hypothesis?
  • Customer problem resolution rate: Are we actually solving pain points?
  • Feature retention: Do users keep using what we built?
  • Revenue per feature: Business impact, not just activity

My hypothesis: AI’s biggest ROI is in faster iteration cycles, not more features. Build → measure → learn should be faster. But most companies just do “build” faster and skip the rest.

CFOs understand revenue and retention. Start there.

Michelle, this hits on something I’ve been wrestling with for months: We’re measuring individual productivity in a team sport.

Your CFO is asking the right question, but DORA metrics were never designed to answer it in the AI era.

The Organizational Effectiveness Gap

What worries me more than the DORA metrics:

  1. Knowledge transfer: AI helps individuals work in isolation. Are we still building shared understanding?
  2. Code quality long-term: Fast code now = maintainable code in 2 years?
  3. Junior developer growth: If AI handles syntax/algorithms, what fundamentals are juniors actually learning?
  4. Team cohesion: When everyone’s heads-down with AI assistants, are we still collaborating?

I ran a survey last month. Here’s what our engineers said:

  • 87% feel more productive individually
  • 62% feel code quality has decreased
  • 71% worry about long-term codebase health
  • Only 31% think the team is shipping faster overall

What I’m Tracking

Beyond DORA, I’m measuring:

  • Developer retention (are we keeping senior engineers who are drowning in review?)
  • Onboarding velocity (time to first PR for new hires—AI helps here!)
  • Cross-team collaboration (PR reviews across team boundaries)
  • Incident response time (are we building systems we can debug?)
  • Technical debt ratio (using SonarQube, tracking AI-generated code separately)

The Strategic Question

AI is a productivity amplifier. The 2025 DORA report said it perfectly: AI magnifies the strengths of high-performing orgs and the dysfunctions of struggling ones.

If your processes, culture, and technical practices were already strong, AI will make them better. If they were weak, AI will just help you fail faster.

Maybe the real question isn’t “how do we measure AI productivity” but “are we building organizational capability to absorb AI velocity?”

Coming from the design systems world, I’m seeing the exact same pattern play out.

The Design Parallel

AI tools (Figma AI, v0, Galileo AI) help designers generate components, mockups, variants crazy fast. Individual designer productivity is through the roof.

But guess what’s slower than ever?

  • Design review (checking brand consistency, accessibility, user context)
  • Design system integration (does this new component fit our patterns?)
  • Quality assurance (responsive behavior, edge cases, WCAG compliance)

AI can generate a login form in 30 seconds. It takes me 3 hours to review it for accessibility, make it responsive, add error states, integrate with design tokens, write documentation, and get stakeholder approval.

Quality Gates Are Not Shortcuts

The bottleneck isn’t the creative work anymore. It’s the judgment, taste, and contextual knowledge that ensures quality.

Michelle, you mentioned your DORA metrics didn’t improve. From a design lens, I’d ask:

  • Is the code you’re shipping actually better for users?
  • Are you shipping accessible, performant, maintainable solutions?
  • Or just shipping more code without corresponding quality improvements?

What I Measure

In design systems, we don’t measure “components created.” We measure:

  • Design system adoption rate (are teams using shared components?)
  • Accessibility audit pass rate (WCAG compliance)
  • Component reuse (code/design DRY principles)
  • Time to consistent implementation (not just time to first draft)

Speed is great. But speed without quality is just technical debt with better PR throughput.

The AI era requires us to get really good at the parts AI can’t do: Judgment, context, quality assessment, system thinking.