Something that’s been haunting me since David’s original post: the 39-point perception gap.
Developers thought they were 20% faster. Studies showed they were actually 19% slower. That’s not a small measurement error—that’s a fundamental disconnect between subjective experience and objective reality.
This creates a massive decision-making problem for engineering leaders.
How do we evaluate tool effectiveness when our primary data source (developer feedback) is this unreliable?
The METR study specifics:
They gave experienced open-source developers real tasks from their own repositories. Developers using AI tools:
- Took 19% longer to complete tasks
- Believed they were 20% faster
- 39-point perception gap
Later cohort (800+ tasks, 57 developers) showed -4% slowdown with confidence interval of -15% to +9%.
Either way, the perception of speed improvement doesn’t match measurement of speed improvement.
This breaks our evaluation frameworks:
We run developer surveys: “Do AI tools help?” → 9/10 say yes
We measure actual velocity: Flat or slightly negative
We measure developer happiness: Way up
We measure business outcomes: Unchanged
Which metric drives decisions?
The parallels to product are uncomfortable:
It’s like shipping a feature with:
- High NPS (users love it)
- No retention improvement (they don’t use it more)
- No revenue impact (doesn’t change behavior)
Product teams know that NPS without business metrics is vanity. But in engineering, we’re treating developer satisfaction as a success metric independent of productivity.
Are we optimizing for the wrong things?
Questions for leaders:
- How do you weight subjective vs objective measures when they contradict?
- Can we trust any self-reported productivity data now?
- If developers feel faster but aren’t, is that worth something anyway?
- What objective metrics actually matter for AI tool evaluation?
What I’m struggling with:
Our developers genuinely love AI tools. Morale is high. Retention improved. That has real value.
But our velocity metrics show no improvement. Security is worse. Code quality requires more review.
Is “happier developers, same output” a successful outcome?
For retention and recruiting, maybe yes. For CFO ROI conversations, definitely no.
The perception gap means we can’t rely on the people doing the work to accurately assess the tools they’re using. That’s a weird place to be as a leader.
How are you navigating this?