I need to share some research that’s been keeping me up at night. As a VP of Product, I’m constantly evaluating tools and asking “what’s the ROI?” But what happens when the data tells a completely different story than what your team believes?
The Study That Changes Everything
METR (Model Evaluation & Threat Research) just published findings that should make every product and engineering leader pause. They ran a randomized controlled trial with experienced open-source developers working on their own repositories—people who knew their codebases intimately.
The results were shocking:
- Developers expected AI tools to speed them up by 24%
- Reality: They were 19% slower when using AI
- Even after experiencing this slowdown, developers still believed they were 20% faster
That’s a 39-43 percentage point gap between perception and reality. Let that sink in.
Why This Is a Measurement Crisis
Here’s what really concerns me from a product perspective: if developers can’t accurately assess their own productivity with AI tools, how are we supposed to make informed decisions about:
- Tool selection and procurement? We’re spending budget on tools that might be slowing teams down
- Performance evaluation? Self-reported productivity metrics are unreliable
- Roadmap planning? Are velocity estimates meaningless now?
- Resource allocation? What if we’re solving the wrong problems?
The METR researchers found that developers accepted less than 44% of AI-generated code suggestions. That means 56% of what AI produces gets rejected or heavily modified. Yet the experience of having code suggested feels productive, even as the total time increases.
The Business Reality Check
This isn’t just an engineering problem—it’s hitting the C-suite:
- Only 29% of executives can confidently measure AI ROI (Gartner)
- 56% of CEOs report zero measurable ROI from AI investments in the past 12 months
- CFOs are deferring 25% of AI investments to 2027 pending ROI proof
Meanwhile, 93% of developers are using AI tools. We can’t put this genie back in the bottle.
What I’m Struggling With
From a product lens, here’s my dilemma:
Subjective experience matters. If AI tools reduce cognitive load and make work feel less tedious, that’s real value—even if it doesn’t show up in cycle time metrics. Developer satisfaction and retention are crucial, especially when 85%+ of engineers expect AI tools.
But objective outcomes matter more. If we’re shipping slower, introducing more bugs (studies show 1.7× more issues with AI-generated code), and not seeing velocity gains… can we justify the investment?
I keep coming back to this question: Are we measuring the wrong things, or are we just not measuring correctly?
What I’m Curious About
For the product and engineering leaders here:
- How are you measuring AI tool impact? Beyond asking developers “do you like it?”
- What metrics actually matter? Cycle time? Defect rates? Code review iterations? Developer retention?
- Have you seen this perception gap in your teams? How did you address it?
- How do you balance team morale (they want AI tools) with organizational performance (unclear if it helps)?
The research papers are clear: self-reporting is unreliable. But what’s the alternative when we can’t just ignore how our teams feel about their work?
I’m genuinely curious how others are navigating this. Because right now, it feels like we’re flying blind with very expensive instruments that might be miscalibrated.
Sources: