Developers Feel 20% Faster, Measured 19% Slower. How Do We Make Decisions When Self-Reporting Fails?

Something that’s been haunting me since David’s original post: the 39-point perception gap.

Developers thought they were 20% faster. Studies showed they were actually 19% slower. That’s not a small measurement error—that’s a fundamental disconnect between subjective experience and objective reality.

This creates a massive decision-making problem for engineering leaders.

How do we evaluate tool effectiveness when our primary data source (developer feedback) is this unreliable?

The METR study specifics:

They gave experienced open-source developers real tasks from their own repositories. Developers using AI tools:

  • Took 19% longer to complete tasks
  • Believed they were 20% faster
  • 39-point perception gap

Later cohort (800+ tasks, 57 developers) showed -4% slowdown with confidence interval of -15% to +9%.

Either way, the perception of speed improvement doesn’t match measurement of speed improvement.

This breaks our evaluation frameworks:

We run developer surveys: “Do AI tools help?” → 9/10 say yes
We measure actual velocity: Flat or slightly negative
We measure developer happiness: Way up
We measure business outcomes: Unchanged

Which metric drives decisions?

The parallels to product are uncomfortable:

It’s like shipping a feature with:

  • High NPS (users love it)
  • No retention improvement (they don’t use it more)
  • No revenue impact (doesn’t change behavior)

Product teams know that NPS without business metrics is vanity. But in engineering, we’re treating developer satisfaction as a success metric independent of productivity.

Are we optimizing for the wrong things?

Questions for leaders:

  1. How do you weight subjective vs objective measures when they contradict?
  2. Can we trust any self-reported productivity data now?
  3. If developers feel faster but aren’t, is that worth something anyway?
  4. What objective metrics actually matter for AI tool evaluation?

What I’m struggling with:

Our developers genuinely love AI tools. Morale is high. Retention improved. That has real value.

But our velocity metrics show no improvement. Security is worse. Code quality requires more review.

Is “happier developers, same output” a successful outcome?

For retention and recruiting, maybe yes. For CFO ROI conversations, definitely no.

The perception gap means we can’t rely on the people doing the work to accurately assess the tools they’re using. That’s a weird place to be as a leader.

How are you navigating this?

Keisha, this is the executive dilemma in a nutshell.

At the board level, I have to defend AI investment with data. “Developers are happier” doesn’t get budget approval. “Measurable productivity gains” does.

The perception gap makes this conversation dishonest.

Executives: “Are AI tools delivering ROI?”
Me: “Developers love them, but velocity is flat.”
Executives: “So… no?”
Me: “But retention is up and recruiting is easier.”
Executives: “How much easier?”
Me: “…I don’t have numbers.”

We’re asking for continued investment based on vibes and retention hopes, not measurable outcomes.

The product analogy you made is perfect:

When product ships a feature with high NPS but no business impact, we call that a “vanity metric” and course-correct.

When engineering adopts tools with high developer satisfaction but no productivity impact, we call it “investment in developer experience” and keep spending.

Double standard.

I think the brutal answer to your question “Is happier developers, same output a successful outcome?” depends on:

  1. How much are we paying for that happiness? ($500K/year? Maybe. $2M/year? Harder sell.)
  2. Does happiness actually improve retention, or is that also perception vs reality?
  3. Can we get the same retention benefits cheaper (better benefits, interesting projects, career growth)?

The perception gap means we need better measurement systems. We can’t trust self-reporting. We need objective business metrics tied to AI tool usage.

This perception gap thing is wild, and it reminds me of a PM lesson that applies here.

Users are terrible at predicting what they want.

We learned decades ago: don’t ask users what features they want. Watch what they actually do. Behavioral data beats stated preferences.

The AI productivity gap is the same pattern:

  • Stated preference: “AI makes me faster, I love it”
  • Behavioral reality: Takes longer, requires more review, same delivery velocity

In product, we solve this with A/B testing:

  • Control group: No AI tools
  • Test group: AI tools
  • Measure: Actual delivery velocity, quality, customer value

But engineering teams don’t run controlled experiments on themselves. We just adopt tools and hope.

What if we treated AI adoption like product experimentation?

  1. Define success metrics BEFORE adoption (velocity, quality, time-to-production)
  2. Measure baseline for 3 months
  3. Roll out AI tools to half the team
  4. Measure for 6 months
  5. Compare objective outcomes, not surveys

Keisha’s question about which metrics matter: The ones that existed before AI.

Don’t invent new metrics to justify AI tools. Use the same metrics you used to measure team effectiveness before:

  • Cycle time
  • Defect rate
  • Customer-reported bugs
  • Revenue per engineer
  • Feature velocity

If AI tools improve those, great. If they don’t, but developers are happier, that’s a retention play, not a productivity play.

Be honest about what you’re buying.