Last quarter, we ran an AI coding tools pilot that should be a case study in how perception diverges from reality.
We gave half our engineering team access to GitHub Copilot. Classic A/B test. After three months, we surveyed both groups.
The Copilot group reported:
- 72% said they felt “significantly more productive”
- Average self-reported speed increase: 24%
- 89% wanted to keep using it permanently
- Overwhelmingly positive sentiment
Great, right? Ship it to everyone?
Then we looked at the actual DORA metrics: deployment frequency, lead time for changes, change failure rate, time to restore service.
No statistically significant difference between the two groups.
Zero. After three months of developers feeling 24% faster, the business outcomes were identical.
How is that even possible?
Then I found the research showing developers thought AI sped them up 24%, but studies measured them as actually 19% slower. Our data matched perfectly. Self-reporting is completely unreliable when people are excited about new tools.
Here’s my current theory on what’s happening:
1. We’re measuring velocity, not value
AWS research shows teams with AI complete 21% more tasks and create 98% more pull requests per developer. Sounds amazing until you realize: more PRs doesn’t mean better software.
If AI helps you ship 30% more features but they’re the wrong features, or if they introduce subtle bugs that slow down everything downstream, the net impact is zero or negative.
2. The bottleneck shifted
Our seniors report being overwhelmed with code review. Junior velocity went up, but now everything piles up waiting for experienced engineers to review AI-generated code that’s “almost right but not quite.”
Product managers spend more time clarifying requirements because AI can generate the wrong thing very quickly.
3. We’re measuring the wrong phase
Traditional metrics track coding speed. But AI changes the time distribution. What I’m seeing:
- Initial coding: 40% faster (feels amazing)
- Debugging time: 60% longer (feels terrible)
- Net result: slower overall, but feels faster because writing code is more fun than debugging
What I’m trying to measure instead:
- Time from idea to validated learning - not shipped code, but code that proves/disproves a hypothesis about user behavior
- Rework rate - how often do we have to redo AI-generated work because it didn’t solve the actual problem
- Debugging time ratio - time writing code vs time fixing code
- Senior engineer cognitive load - are our most expensive people spending more time reviewing or less?
Early findings: AI tools optimize for the wrong part of the value stream. They make the easy parts faster and the hard parts harder.
The executive problem:
How do I justify AI tool investments when developers love them but business metrics don’t improve? How do I push back when everyone FEELS more productive even if they’re not?
And the bigger question: if we can’t measure AI impact accurately, how do we make any investment decisions around these tools? Developer sentiment? Lines of code? Vibes?
Michelle’s right about skill debt accumulating silently. We won’t measure it until it causes a crisis. But we also can’t measure current impact reliably.
Anyone built a measurement framework that actually works for AI tool ROI?