Engineers Using AI Were 19% Slower, Yet Convinced They Were Faster. The Perception Gap Is Now a Leadership Problem

I need to share some research that’s been keeping me up at night. As a VP of Product, I’m constantly evaluating tools and asking “what’s the ROI?” But what happens when the data tells a completely different story than what your team believes?

The Study That Changes Everything

METR (Model Evaluation & Threat Research) just published findings that should make every product and engineering leader pause. They ran a randomized controlled trial with experienced open-source developers working on their own repositories—people who knew their codebases intimately.

The results were shocking:

  • Developers expected AI tools to speed them up by 24%
  • Reality: They were 19% slower when using AI
  • Even after experiencing this slowdown, developers still believed they were 20% faster

That’s a 39-43 percentage point gap between perception and reality. Let that sink in.

Why This Is a Measurement Crisis

Here’s what really concerns me from a product perspective: if developers can’t accurately assess their own productivity with AI tools, how are we supposed to make informed decisions about:

  • Tool selection and procurement? We’re spending budget on tools that might be slowing teams down
  • Performance evaluation? Self-reported productivity metrics are unreliable
  • Roadmap planning? Are velocity estimates meaningless now?
  • Resource allocation? What if we’re solving the wrong problems?

The METR researchers found that developers accepted less than 44% of AI-generated code suggestions. That means 56% of what AI produces gets rejected or heavily modified. Yet the experience of having code suggested feels productive, even as the total time increases.

The Business Reality Check

This isn’t just an engineering problem—it’s hitting the C-suite:

  • Only 29% of executives can confidently measure AI ROI (Gartner)
  • 56% of CEOs report zero measurable ROI from AI investments in the past 12 months
  • CFOs are deferring 25% of AI investments to 2027 pending ROI proof

Meanwhile, 93% of developers are using AI tools. We can’t put this genie back in the bottle.

What I’m Struggling With

From a product lens, here’s my dilemma:

Subjective experience matters. If AI tools reduce cognitive load and make work feel less tedious, that’s real value—even if it doesn’t show up in cycle time metrics. Developer satisfaction and retention are crucial, especially when 85%+ of engineers expect AI tools.

But objective outcomes matter more. If we’re shipping slower, introducing more bugs (studies show 1.7× more issues with AI-generated code), and not seeing velocity gains… can we justify the investment?

I keep coming back to this question: Are we measuring the wrong things, or are we just not measuring correctly?

What I’m Curious About

For the product and engineering leaders here:

  1. How are you measuring AI tool impact? Beyond asking developers “do you like it?”
  2. What metrics actually matter? Cycle time? Defect rates? Code review iterations? Developer retention?
  3. Have you seen this perception gap in your teams? How did you address it?
  4. How do you balance team morale (they want AI tools) with organizational performance (unclear if it helps)?

The research papers are clear: self-reporting is unreliable. But what’s the alternative when we can’t just ignore how our teams feel about their work?

I’m genuinely curious how others are navigating this. Because right now, it feels like we’re flying blind with very expensive instruments that might be miscalibrated.


Sources:

David, this hits on something I’ve been wrestling with at the exec level. The perception gap you’re describing isn’t just affecting developers—it’s distorting decision-making all the way up the chain.

The Security and Quality Alarm

What keeps me up at night isn’t just the 19% slowdown. It’s this data point:

  • 1.7× more issues in AI-generated code
  • 23.7% more security vulnerabilities

We’re trading velocity (which might be illusory anyway) for a very real technical debt and security risk accumulation. And here’s the kicker: our review processes weren’t designed for this volume of AI-generated code.

When 93% of developers are using AI tools, that means the majority of code hitting PR review was drafted by an AI. Are your senior engineers equipped to catch 23.7% more security issues? Are they even looking for them?

The “Feels Faster” Problem at Scale

The METR study shows AI makes work feel productive even when it isn’t. This subjective improvement creates a political problem at the leadership level:

  1. Developers advocate for AI tools because the experience is genuinely better (less tedious)
  2. Executives see “developer satisfaction” metrics improve and interpret this as ROI
  3. Meanwhile, actual delivery velocity is flat or declining

I’ve sat in board meetings where we presented “40% efficiency gains from AI adoption” based on developer surveys. Then the CFO asks: “So why didn’t our release cadence improve? Why are we still missing deadlines?”

The honest answer? We were measuring satisfaction, not performance.

What Actually Needs to Change

From a CTO perspective, here’s what I think we need:

Upgraded review processes: We can’t rely on human code review alone when AI is generating this volume of suggestions. We need automated security scanning, more comprehensive test coverage, and honestly—more time allocated for review. The “feels faster to write” benefit evaporates if review becomes the bottleneck.

Layered measurement: Track both subjective (developer experience, cognitive load) and objective (cycle time, defect escape rate, incident frequency) metrics. When they diverge—like in the METR study—that’s a signal to investigate, not ignore.

Honest ROI conversations: I’d rather tell the board “we don’t know yet” than present misleading productivity numbers based on self-reporting. The integrity of our measurement matters more than having an answer executives want to hear.

The Question We’re Avoiding

Are we trading short-term developer experience improvements for long-term technical debt and security risk?

If the answer is yes, we need to be explicit about that tradeoff. If we’re not sure, we need better measurement before doubling down on AI tool investments.

To your point about “flying blind with miscalibrated instruments”—I think the instruments are fine. We’re just reading the wrong gauges. Developer satisfaction is important, but it can’t be the primary metric for AI tool ROI when the research shows perception and reality diverge this dramatically.

This conversation is perfectly timed. I’m literally in the middle of evaluating AI tool licenses for next quarter, and my team is convinced these tools are making them more productive. But when I look at our actual sprint metrics… I’m not seeing it.

The Trust Problem

Here’s my challenge: How do you have this conversation with your team without it sounding like “I don’t trust your judgment”?

My engineers are smart, experienced people. When they tell me “AI saves me 2 hours a day,” I don’t think they’re lying. The METR study shows they’re accurately reporting their experience—it really does feel faster. But feeling faster and being faster are not the same thing.

Michelle’s point about the CFO asking “why aren’t we shipping faster then?” hits home. I’ve had that exact conversation with our VP Eng. The answer I gave was “we need more time to evaluate the data.” But honestly? I didn’t have the data to evaluate.

What I’m Actually Measuring Now

After reading similar research a few months ago, I started tracking:

  • Code review iteration count (how many rounds of feedback before merge)
  • Time from PR open to merge (total review duration)
  • Defect escape rate (bugs that make it to production)
  • Incident response time (how fast we diagnose and fix production issues)

Early findings (3 months of data):

  • Review iteration count is up 15%
  • Time to merge is up 12% (despite code being written faster)
  • Defect escape rate is up 8%
  • Incident response time is actually down 6% (AI helps with debugging)

So we have ONE positive metric and three concerning ones. But my team would tell you they’re more productive because the debugging win is very visible and feels great.

The Junior Developer Concern

David, you mentioned the 44% acceptance rate for AI suggestions. What worries me is the split between senior and junior developers:

  • Senior engineers are critical readers—they treat AI output like Stack Overflow answers (useful but verify everything)
  • Junior developers are more likely to accept suggestions wholesale because they don’t have the pattern recognition to spot subtle issues

I’m seeing junior devs become productive faster (good!) but not building the foundational problem-solving skills they need to grow (bad!). It’s like learning to drive with automatic transmission—you get moving quickly, but you never learn how the engine works.

One of my senior engineers put it bluntly: “The juniors can code, but they can’t debug their own AI-generated code when it breaks.”

The Measurement Dilemma

To David’s question about “how do you balance team morale with organizational performance”—I don’t have a great answer yet. Here’s what I’m trying:

Transparent dashboards: I share the metrics I’m tracking with the team. When they say “we’re more productive,” I can say “what does that mean to you?” and we can look at data together.

Differentiated adoption: Maybe AI tools make sense for some tasks (boilerplate, tests, documentation) but not others (core business logic, security-critical code)?

Training budget reallocation: If AI is supposed to save time, where’s that time going? I’m pushing to redirect “productivity gains” into deeper skill development and architectural thinking time.

The Question I Keep Coming Back To

If AI makes developers feel 20% more productive but makes them 19% slower in reality—and both those things are true simultaneously—what’s the right decision?

I can’t ban AI tools. That would crush morale and make recruiting harder (85% of developers expect them, as David mentioned). But I also can’t ignore the metrics showing we’re not actually faster.

Maybe the answer is: we’re asking the wrong question. Instead of “does AI make us faster?” maybe we should ask “what does AI enable us to do that we couldn’t before?”

If the answer is “it reduces tedious work and lets us focus on harder problems,” that has value even if raw throughput doesn’t increase. But we need to be honest about that tradeoff with our executives and budget holders.

Okay, I’m coming at this from a totally different angle as someone who leads design systems, but this perception gap hits so close to home.

The Design Parallel

I see the exact same phenomenon with AI design tools. Designers generate dozens of variations in minutes with tools like Midjourney or Galileo AI, and they’re genuinely excited about how fast they can explore concepts.

But then what happens? Hours of cleanup work. The AI-generated designs:

  • Don’t follow our design system tokens
  • Have accessibility issues baked in
  • Ignore brand guidelines in subtle ways
  • Create components that can’t actually be built

So the designer feels productive in the exploration phase, but the total time from “idea” to “production-ready design” hasn’t actually decreased. Sometimes it’s longer because now we’re refining AI output instead of starting from our component library.

Sound familiar?

Maybe We’re Measuring the Wrong Thing

Luis, your comment about “what does AI enable us to do that we couldn’t before?” really resonates with me. Here’s what I think we’re missing in the productivity debate:

Cognitive load reduction is real value, even if throughput doesn’t increase.

When I use AI to generate documentation, I’m not doing it because it’s faster overall—I’m doing it because writing documentation is boring and draining. The AI draft gives me something to edit rather than facing a blank page. Is that worth the cost even if my docs-per-day metric doesn’t change? I think yes!

Similarly, when developers say AI “saves them 2 hours a day”—maybe they mean it saves them 2 hours of the most tedious work. The stuff that grinds you down and makes you want to quit your job. That has retention value that doesn’t show up in velocity metrics.

The Experience vs. Output Tension

Here’s what I’m struggling with: If work feels better but outcomes aren’t better, does that matter?

In design, we talk about “quality of life improvements” separate from “productivity improvements.” A better mouse doesn’t make you design faster, but it reduces hand fatigue. Worth it? Probably.

But if that mouse cost $500/seat and you promised your CFO it would increase design throughput by 40%… now you have a problem.

I think the AI conversation got framed as a productivity tool when maybe it should’ve been framed as a quality of work life tool. Different value prop, different metrics, different ROI calculation.

The Skills Development Blind Spot

Luis’s point about junior developers really worries me from a career development perspective. I mentor bootcamp UX students, and I see a similar pattern:

Students who learn with AI assistance can quickly produce “portfolio-quality” work—lots of pretty screens, smooth animations. But when I ask them why they made specific design decisions, they struggle. They’re curating AI output instead of developing design judgment.

This is a 2-5 year time bomb. These folks will hit a ceiling when they need to make strategic design decisions and realize they never built the foundational thinking skills.

What Actually Matters?

David asked: “Are we measuring the wrong things, or are we just not measuring correctly?”

I think we’re optimizing for the wrong goal. The question shouldn’t be “Does AI make us more productive?”

Maybe it should be:

  • “Does AI reduce burnout from repetitive work?” (Yes, probably)
  • “Does AI help us tackle problems we’d otherwise avoid?” (Mixed evidence)
  • “Does AI improve our ability to learn and grow?” (Concerning data says no)

If we framed AI tools as burnout prevention rather than productivity enhancement, we’d set more realistic expectations and measure different outcomes.

And honestly? Reducing burnout in tech is worth paying for, even if your sprint velocity doesn’t budge. But that’s a harder sell to a CFO than “40% productivity gains.”

This thread is giving me life because I thought I was the only one wrestling with this impossible tradeoff. As a VP Eng, I’m caught between my team’s genuine enthusiasm for AI tools and the executive team asking “where’s the ROI?”

The Leadership Dilemma

Here’s my reality:

On one side: My engineers love AI tools. Their satisfaction scores are up. Retention is strong in a brutal job market. When I survey the team, AI tools are consistently cited as a “reason to stay.”

On the other side: Our DORA metrics are flat. Deployment frequency is unchanged. Lead time is actually up slightly. The CFO is asking why we’re spending $200/engineer/month on tools that haven’t moved the needle.

Both of these things are true simultaneously.

The Measurement Framework I’m Using

Michelle’s point about “layered measurement” is exactly right. Here’s what I’m tracking now, organized into three categories:

Subjective Experience Metrics

  • Developer satisfaction scores (up 15% since AI adoption)
  • Self-reported time savings (average claim: 2.3 hours/day)
  • Cognitive load surveys (significant reduction in “tedious work” complaints)
  • Retention and recruiting (AI tools mentioned in 40% of offer acceptances)

Individual Performance Metrics

  • Code review iteration count (up 12%, echoing Luis’s data)
  • Time from commit to PR (down 8% - code written faster)
  • Time from PR to merge (up 15% - review takes longer)
  • Lines of code per day (up 25%, but is this even meaningful?)

Organizational Performance Metrics

  • Deployment frequency (flat)
  • Lead time for changes (up 5%)
  • Mean time to recovery (down 7% - AI helps with incident response)
  • Change failure rate (up 9% - more bugs reaching production)

The pattern is clear: Individual experience improves, organizational outcomes don’t.

The Retention Angle Nobody Talks About

David mentioned 85% of developers expect AI tools. Let me add context from my recruiting pipeline:

When I interview senior engineers from FAANG companies, AI tool access is a negotiation point. Candidates literally ask “what AI coding assistants do you provide?” in the same breath as asking about comp and equity.

If I say “we’re still evaluating AI tools,” I’ve lost credibility with top talent. They assume we’re behind the curve.

So even if the METR study shows 19% slowdown, I can’t just ban AI tools without harming recruiting. This isn’t a purely rational productivity decision anymore—it’s a talent market reality.

What I’m Doing About It

Maya’s reframing from “productivity tool” to “quality of work life tool” is exactly the conversation I’m having with our executive team. Here’s my pitch:

“AI coding assistants are not making us ship faster. The data doesn’t support that claim. What they are doing is reducing burnout from repetitive work and improving retention in a competitive talent market. That’s the ROI—not velocity gains.”

Is this what we promised 6 months ago when we rolled out these tools? No. But it’s the honest assessment based on real data.

The Transparency Strategy

Luis asked how to have this conversation without seeming like you don’t trust your team. Here’s what worked for me:

I shared the METR study with my entire engineering org.

Then I said: “This research shows a perception gap. Let’s measure it ourselves. Here are the metrics I’m tracking. Let’s check back in 90 days and see what our data shows.”

The result? My team became partners in figuring this out rather than defensive about their tool preferences. We’re now having productive conversations about:

  • Which tasks benefit most from AI assistance
  • Where code review is becoming a bottleneck
  • How to maintain skill development for junior engineers

Transparency builds trust, even when the data is uncomfortable.

The Question I’m Taking to My Board

Next month I have to present our AI strategy to the board. Here’s the framing I’m planning:

“AI coding tools are delivering value, but not in the ways we initially expected. The value is in talent retention and developer experience, not in shipping velocity. Given the 19% slowdown observed in research and our own mixed metrics, we should adjust our ROI expectations and investment decisions accordingly.”

Will the board love this? Probably not. But it’s honest, data-driven, and sets realistic expectations. Which is better than promising productivity gains we can’t demonstrate.

To David’s Original Question

How do you balance team morale (they want AI tools) with organizational performance (unclear if it helps)?

You don’t balance it. You make both visible.

Track both subjective experience and objective outcomes. When they diverge—like they clearly do with AI tools—that’s not a problem to solve. It’s a reality to manage with honest communication and clear tradeoffs.

And maybe, just maybe, we stop promising CFOs that every new tool will increase velocity by 40%. Some tools make work better without making it faster. That’s okay—as long as we’re honest about it.