Developers Think AI Made Them 24% Faster. Study Shows 19% Slower. How Do We Measure What's Real?

I just read the METR study that dropped in mid-2025 and I’m honestly floored by the numbers. Before using AI coding tools, developers predicted they’d be 24% faster. The actual result? 19% slower. And here’s the kicker—even after experiencing that slowdown, they still believed they were 20% faster.

That’s a 39-percentage-point gap between perception and reality. How do we make tool decisions when the people using them can’t accurately assess their own productivity?

Why This Matters for Product Leaders

I’m coming at this from the product side, not engineering, but this has massive implications for how we evaluate and justify AI tool spend:

1. Self-reported productivity is unreliable
Developers feel faster because they complete individual coding tasks more quickly. But they’re discounting the expanded time spent debugging, reviewing, and validating AI-generated code. The METR study found this pattern across 16 experienced developers working on real issues in their own repositories—not synthetic benchmarks.

2. CFOs are demanding ROI proof, and we don’t have it
Only 29% of executives can confidently measure AI ROI. Meanwhile, 56% of CEOs report zero measurable ROI from AI investments in the past 12 months. That’s not sustainable when we’re asking for budget increases for these tools.

3. The measurement gap creates strategic risk
When developers love a tool but it doesn’t improve team-level metrics (DORA, cycle time, throughput), what do we do? Some teams report 40% individual efficiency gains but see no improvement in delivery metrics. That disconnect matters.

The Context Problem Nobody’s Talking About

The METR researchers believe the slowdown happens because experienced developers have tons of project context that AI assistants don’t have. So developers spend time retrofitting their agenda and problem-solving strategies into the AI’s outputs, then debugging those outputs extensively.

It’s like having a very fast junior developer who doesn’t understand the broader architecture or product goals. The code ships faster, but the total cycle time increases.

What Should We Actually Measure?

Here’s where I’m struggling: traditional developer productivity metrics (DORA, cycle time, deployment frequency) aren’t showing the gains that developers swear they’re experiencing. But developer happiness and tool adoption are off the charts.

Should we:

  • Trust the subjective experience and invest more in tools developers love?
  • Trust the objective metrics and question whether these tools actually help?
  • Develop new frameworks that capture something neither self-reports nor DORA metrics are measuring?

I’m genuinely curious how other product and engineering leaders are navigating this. Are you measuring AI tool impact? What metrics are you using? How do you reconcile the perception gap with business outcomes?

Sources:

David, this hits close to home. We rolled out GitHub Copilot across our 40+ person engineering org six months ago and I’ve been getting weekly questions from our CFO about ROI. The honest answer? I don’t have clean numbers.

What I’m Seeing on the Ground

Your point about the context problem is dead-on. My senior engineers spend way more time reviewing PRs now because the volume of code has increased. Juniors are shipping 2-3x more code per sprint, but about 30% of it comes back in review with “this works but doesn’t fit our patterns” or “this will create tech debt downstream.”

The code runs. Tests pass. But it’s optimized for the immediate problem, not the system architecture.

The Executive Pressure Is Real

Our CFO literally asked me last month: “We spent $150K on AI tools this year. Show me the $300K in productivity gains.” I showed him:

  • Developer satisfaction scores (up 40%)
  • Self-reported productivity (up 25-30%)
  • Deployment frequency (flat)
  • Cycle time (slightly worse)
  • Incident rate (up 15%)

He said “This looks like we paid $150K to make developers happier but ship the same amount with more incidents.” And I couldn’t really argue with that framing.

We Need Better Metrics, Fast

I think we need to separate “coding velocity” from “delivery throughput” from “business value delivered.” Right now we’re conflating all three.

Some ideas I’m exploring:

  • Track time from “problem identified” to “solution in production and validated” (not just coding time)
  • Measure technical debt accumulation separately for AI-generated vs human-written code
  • Survey developers specifically about where AI helps vs hurts (boilerplate vs architecture vs debugging)

But honestly, I’m flying blind here. The industry needs better frameworks for this, like yesterday.

Luis, your CFO’s framing is uncomfortably accurate. “We paid to make developers happier but ship the same amount with more incidents.” That’s the conversation I’m dreading with our board.

The Strategic Tension

What’s happening here is a classic short-term vs long-term tradeoff that we’re not being honest about:

Short term: Individual developers feel more productive, morale improves, recruitment might get easier (“we use the latest AI tools”)

Long term: We might be accumulating technical debt faster, creating knowledge gaps in junior developers, and building systems that are harder to maintain

The question is: are we willing to sacrifice long-term system health for short-term developer happiness? Because that’s essentially what the data suggests we’re doing.

The Quality Tradeoff Nobody Wants to Admit

David mentioned the METR study, but there’s more troubling data: AI-assisted code has 1.7× more issues and 23.7% more security vulnerabilities. That’s not a small gap—it’s a fundamental quality problem.

I’m at the stage where I’m asking: should we be more restrictive about where AI tools can be used? Not ban them, but create guardrails:

  • Use AI for boilerplate, tests, documentation
  • Require human-first design for core architecture and security-critical code
  • Implement stricter review requirements for AI-heavy PRs

But that creates cultural friction. Developers see it as “not trusting them” when really it’s “the tool isn’t ready for high-stakes decisions.”

What I’d Tell the Board

If I’m being brutally honest in the next board meeting, I’d say: “We’re in an experimental phase. The tools are genuinely helpful for certain tasks, but we’re learning the hard way where they create more problems than they solve. Our current ROI is negative, but we believe it will improve as the technology matures and we get smarter about where to apply it.”

That’s a hard sell when competitors are claiming huge productivity gains. But it’s the truth.

Coming from the design side, this perception vs reality gap feels very familiar. We went through something similar when Figma added AI features and later with tools like Framer.

The “Feels Fast” Problem

Designers would use AI to generate component variations and say “this is amazing, I’m so much faster!” But when we looked at actual project timelines, they weren’t shipping faster. They were generating more options and spending longer in revision cycles.

The tool made the generation phase feel effortless, so designers discounted all the time spent:

  • Evaluating and filtering generated options
  • Fixing inconsistencies and edge cases
  • Convincing stakeholders that the design was thoughtfully considered (not just “AI generated it”)

Sound familiar?

Does “Feeling Productive” Even Matter?

Here’s a contrarian take: maybe the perception of productivity matters more than we think, just not in the way we expect.

If developers feel good about their work, they stay longer, refer friends, engage more deeply with problems. The morale boost might have real but hard-to-measure business value—reduced attrition, better culture, more innovation.

But—and this is a big but—only if we’re honest about what we’re optimizing for. If we say “productivity” but really mean “morale,” that’s deceptive. And if morale comes at the cost of system quality (Michelle’s security vulnerability stat is terrifying), we’re buying short-term happiness with long-term technical debt.

The UX of Productivity Tools

There’s another angle here: these AI tools have incredible UX. They feel magical. They give instant gratification. That creates emotional attachment that’s hard to separate from actual utility.

We’re not just evaluating whether a tool makes us faster—we’re evaluating how using the tool makes us feel. And feelings are terrible at ROI calculations.

Maybe the real question isn’t “does AI make us faster” but “what type of work do we want to optimize for, and does AI support that vision?” If we want thoughtful, sustainable systems, AI might be actively harmful. If we want rapid iteration and learning-through-shipping, it might be valuable despite the quality tradeoffs.

But we need to be honest about which path we’re choosing.

Maya, your point about morale vs productivity is exactly the tension I’m wrestling with as VP Eng. Let me add the leadership complexity:

The Talent Retention Dimension

Here’s the uncomfortable reality: developers expect AI tools now. When we interview senior engineers, they ask “what AI coding assistants do you use?” If we say “we’re being cautious about adoption,” candidates sometimes see it as “they’re behind the curve.”

So there’s a talent acquisition and retention angle that’s completely separate from productivity:

  • Top talent wants to work with cutting-edge tools
  • Banning or heavily restricting AI could make us less attractive as an employer
  • Competitor companies are advertising their AI-first engineering cultures

Michelle’s point about negative ROI might be true, but if it prevents attrition of senior engineers (replacement cost: 6-9 months salary + 3-6 months ramp time), maybe it’s still worth it?

The Measurement Problem Is Also a Leadership Problem

David asked “what should we actually measure?” but I think the harder question is: who decides what we optimize for?

Should it be:

  • The CFO (optimize for cost/ROI, may restrict AI use)
  • The engineering leadership (optimize for delivery metrics, mixed results so far)
  • The developers themselves (optimize for happiness/tools, disconnected from business outcomes)
  • The CEO/board (optimize for competitive positioning, may mandate AI regardless of ROI)

I’ve been in rooms where these stakeholders have completely different views on AI tools, and there’s no clear tiebreaker.

What I’m Actually Doing

In practice, here’s my approach:

  1. Let developers use AI tools they want (morale, retention, learning)
  2. Track quality and velocity metrics closely (watch for degradation)
  3. Be brutally honest with executives (show data, don’t overpromise ROI)
  4. Create space for experimentation (not every initiative needs immediate ROI)

But I also acknowledge this might be wrong. Maybe we should be more restrictive. Maybe we should double down. The honest answer is: I don’t know yet, and neither does anyone else in the industry.

We’re all flying blind and hoping we don’t regret our decisions two years from now.