AI writes 41% of our code. Vendors claim 55% speedups. Bain says gains are 'unremarkable.' Who's measuring wrong?

AI coding assistants now write 41% of all code in 2026, with 84% of developers using these tools daily. That’s not a pilot program anymore—it’s production infrastructure for how we build software.

But here’s where it gets interesting: the productivity numbers tell wildly different stories depending on who’s measuring.

The Vendor Story

GitHub, Google, and Microsoft—all vendors of AI coding tools—published studies showing developers completing tasks 20% to 55% faster. GitHub Copilot users report feeling more productive. Google’s internal studies show significant velocity gains. Microsoft’s research highlights developer satisfaction improvements.

Those are compelling numbers. If your engineers could ship 50% faster, that would fundamentally change your roadmap capacity, right?

The Independent Reality Check

Bain & Company’s analysis paints a different picture. Their research shows teams using AI assistants see 10% to 15% productivity boosts—and critically, that time saved often isn’t redirected toward higher-value work. So even those modest gains don’t translate into positive returns.

Why the gap? Jue Wang, a partner at Bain, points out that developers spend only 20% to 40% of their time coding. Even a significant speedup there translates to more modest overall gains.

More striking: writing and testing code accounts for about 25% to 35% of the time from initial idea to product launch. Speeding up these steps does little to reduce time to market if other stages remain bottlenecked.

The Paradox Nobody’s Talking About

Here’s what really keeps me up at night: analysis of over 10,000 developers across 1,255 teams shows developers using AI complete 21% more tasks and merge 98% more pull requests.

Yet company-wide delivery metrics for throughput and quality show no improvement.

Individual developers are working faster. But companies aren’t shipping better software any faster. Where are those productivity gains going?

The Measurement Methodology Problem

I think we’re measuring the wrong things. Vendor studies measure task completion velocity in controlled environments. But production software development isn’t a series of isolated coding tasks.

Independent studies like METR’s randomized controlled trial found developers take 19% longer when using AI tools once you include review and debugging time. That’s the opposite of vendor claims.

The overhead of prompting, waiting, reviewing AI suggestions, and debugging generated code can exceed the coding speedup. Cursor data shows only 39% of AI generations were accepted—that’s a lot of friction.

The Questions I’m Wrestling With

As someone responsible for product velocity and delivery, I need better frameworks for evaluating AI productivity claims:

  1. What should we actually measure? Task completion speed? Time to production? Change failure rate? Developer satisfaction?

  2. Are we optimizing for individual velocity or organizational throughput? Because those appear to be different outcomes.

  3. How do we account for the systemic effects? If AI speeds up coding but creates downstream bottlenecks in code review and testing, did we actually gain anything?

  4. Is the value in speed or quality? Maybe AI’s real benefit isn’t velocity but enabling better architecture or more thorough testing.

I’d love to hear from this community:

  • How are you measuring AI coding assistant impact at your organizations?
  • What metrics matter for evaluating these tools?
  • Have you seen the vendor-claimed productivity gains translate to actual business outcomes?

Because right now, it feels like we’re either measuring the wrong things, or selling tools that optimize for the wrong part of the development process.


Sources for the curious:

David, this resonates deeply with what we’re seeing in our engineering org. You’ve identified the core measurement problem, and I’d add a systems-thinking perspective.

Coding isn’t the bottleneck anymore. Bain’s research showing coding represents only 25-35% of time-to-market is consistent with our experience in fintech. When we instrumented our delivery pipeline, we found:

  • Code authoring: ~30% of cycle time
  • Code review and iteration: ~25%
  • Testing and QA: ~20%
  • Deployment and validation: ~15%
  • Requirements clarification and design: ~10%

AI accelerates that 30%. But if code review now becomes the bottleneck—which it has—we’ve just shifted the constraint downstream.

The individual vs. organizational productivity disconnect is real. We ran an internal pilot with 40 engineers using AI assistants. Individual commit velocity went up 18%. Pull request volume increased 32%. Sprint velocity? Flat. Why?

  1. Review burden exploded. More PRs to review, each requiring the same cognitive load
  2. Test maintenance increased. AI-generated code had more edge cases that surfaced in integration
  3. Documentation lagged. Fast code, slow context-sharing
  4. Architecture drift. Without coordinated design, 10 engineers moving fast created inconsistency

We need to measure what actually matters for delivery: DORA metrics.

  • Deployment frequency: Are we shipping to production more often?
  • Lead time for changes: Time from commit to production
  • Change failure rate: Quality of what we ship
  • Time to restore service: Resilience when things break

In our pilot, none of these improved. Individual velocity gains got absorbed by systemic friction.

The measurement methodology you’re wrestling with needs to account for the entire value stream, not just the coding portion. Vendor studies optimize for demo-ability, not deployability.

The METR study finding that developers are actually 19% slower with AI tools hit me hard because I’ve felt this friction but couldn’t quantify it.

Here’s what nobody talks about in the “AI makes you faster” narrative: the cognitive overhead of working with AI.

When I’m using an AI coding assistant, my workflow looks like this:

  1. Think about what I need → 30 seconds
  2. Craft a prompt that’s specific enough → 45 seconds
  3. Wait for generation → 10 seconds
  4. Read and evaluate the output → 60 seconds
  5. Accept/reject/modify → 30-90 seconds depending on quality
  6. Context-switch back to my mental model → 20 seconds

That’s 3-4 minutes per interaction. And the kicker? Only 39% of AI generations get accepted according to Cursor’s data.

Compare that to just writing it myself when I’m in flow state: I know exactly what I want, my fingers execute it, I stay in my mental model. No context switches. No evaluation overhead. No “is this subtly wrong?” paranoia.

The real cost is context-switching. Every AI interaction pulls me out of deep work. I’m not just coding anymore—I’m managing an assistant, reviewing its work, debugging its misunderstandings.

Luis mentioned the review burden increasing. That’s the other side: when I’m reviewing AI-generated code from teammates, I have to be more skeptical. I can’t trust patterns I’d normally recognize because I don’t know if they were thoughtfully chosen or just what the AI suggested first.

The vendor studies measure isolated task completion. They don’t measure:

  • Cognitive load and mental fatigue
  • Flow state disruption
  • Reviewer trust and thoroughness
  • Debugging time for subtle AI bugs
  • Long-term code maintainability

Maybe AI tools will get better. Maybe we’ll develop better prompting skills. But right now, for me, the productivity equation isn’t as simple as “faster code authoring = faster shipping.”

Sometimes the fastest way to ship is to stay in flow and write it yourself.

From the CTO seat, this debate has real financial implications. Our CFO wants ROI justification for every dollar we spend on AI tooling, and the vendor claims aren’t holding up under scrutiny.

The investment reality: We’re spending ~$20-30/developer/month on AI coding assistants. For a 100-person engineering team, that’s $24-36K annually. Not huge, but our CFO is asking: what’s the return?

Vendors promise 20-55% productivity gains. If true, that should translate to either:

  1. Shipping features faster (increased revenue/reduced time-to-market)
  2. Reducing headcount needs (lower costs)
  3. Higher quality/fewer incidents (reduced operational burden)

We’re seeing none of these outcomes materially improve. And we’re not alone—Bain’s research shows enterprises deferring 25% of planned AI investments to 2027 because fewer than one-third of decision-makers can link AI tools to financial growth.

The measurement problem is a governance problem. Vendor studies are marketing, not science. They’re designed to be optimistic. When GitHub runs a study on Copilot productivity, they:

  • Select for high-engagement users (survivorship bias)
  • Measure in controlled environments (not production complexity)
  • Focus on metrics that favor the tool (task completion, not delivery)

What we need are controlled experiments with business metrics:

  • A/B test teams with/without AI tools for 6 months
  • Measure actual delivery: features shipped, revenue impact, incident rates
  • Account for total cost: licensing + oversight + review burden + quality issues

When we piloted this rigorously, the results were sobering. Individual developers reported feeling more productive. But organizational metrics showed marginal gains at best, and in some cases regression due to increased review load and quality issues.

The governance overhead is real. AI-generated code requires more scrutiny. We’ve had to invest in:

  • Enhanced code review processes
  • Additional security scanning (AI loves to generate vulnerable patterns)
  • Architecture review to prevent drift
  • Documentation standards to counteract AI’s context-free output

These aren’t free. The cost of governance can exceed the productivity gains.

I’m not anti-AI. But I am pro-rigor. We need better measurement frameworks and honest accounting of total costs before we can claim productivity wins.

This conversation highlights something I’ve been wrestling with: we’re treating AI coding assistants as individual productivity tools when they’re actually organizational capability questions.

The measurement challenge isn’t just technical—it’s cultural and systemic.

Different contexts need different metrics. What I’ve learned leading engineering through rapid growth:

For senior engineers: AI might slow them down (Maya’s flow state point resonates). They have strong mental models, type fast, and know the codebase. The overhead of managing AI assistance exceeds the benefit. Metric: Flow state preservation, not task speed.

For mid-level engineers: Mixed results. Sometimes AI helps with unfamiliar patterns. Sometimes it generates plausible-but-wrong code they don’t catch. Metric: Learning velocity and code review feedback quality.

For junior engineers: This is complex. AI can help them contribute faster initially. But are they learning the underlying patterns, or just becoming good at prompting? Metric: Skill development trajectory, not just output volume.

Luis mentioned architecture drift—this is where organizational culture matters. If 40 engineers use AI without alignment, you get 40 different implementation styles. The productivity gains fragment into integration costs.

What I’m measuring instead:

  1. Developer satisfaction and engagement: If tools make work feel like “babysitting AI” (Maya’s phrase), we’re degrading the experience regardless of output metrics

  2. Team cohesion and knowledge sharing: Are we building shared understanding, or just generating code faster in isolation?

  3. Sustainable pace: Can teams maintain this velocity without burning out from increased review burden and cognitive load?

  4. Learning and growth: Are engineers developing deeper expertise, or becoming dependent on AI for patterns they should internalize?

Michelle’s ROI framing is crucial for executive conversations. But from the VP Engineering seat, I also care about human sustainability. A 15% productivity gain that increases burnout and reduces learning isn’t a win.

The vendor vs. consultant measurement gap exists because they’re optimizing for different outcomes. Vendors optimize for adoption metrics. Consultants optimize for business outcomes. Neither is optimizing for engineering team health and capability development.

Maybe the right question isn’t “who’s measuring wrong?” but “what are we actually trying to optimize for?”

Because if we’re optimizing for short-term velocity at the expense of team capability, code quality, and sustainable pace, we’re making a terrible trade.