We're 19% slower with AI, yet convinced we're 20% faster. What are we actually measuring?

I just read the METR research study and I’m experiencing serious cognitive dissonance.

Here’s the headline that stopped me cold: Experienced developers using AI coding tools were 19% slower at completing tasks. But here’s the kicker—those same developers believed they were 20% faster.

That’s a 39-point perception gap between what we think is happening and what’s actually happening.

My Own Experience

I’ll be honest: I feel more productive with AI tools. That dopamine hit when Copilot autocompletes exactly what I was thinking? The satisfaction of having a back-and-forth with Claude to solve a gnarly bug? It feels like I’m crushing it.

But the METR data suggests I might be deluding myself. And I’m not alone—the study participants predicted a 24% speedup before using AI, and even after experiencing the actual slowdown, they still estimated a 20% improvement.

Why the Disconnect?

The METR researchers have a theory that resonates uncomfortably:

“Developers overestimate speed-up because it’s so much fun to use AI. We sit and work on these long bugs, and then eventually AI will solve the bug. But we don’t focus on all the time we actually spent—we just focus on how it was more enjoyable.”

The enjoyment is masking the reality. We’re measuring how good it feels instead of how fast we actually shipped.

The Measurement Problem

This raises bigger questions for me:

  1. Are we measuring the wrong things? Individual velocity (PRs merged, lines written) vs team outcomes (features shipped, customer value)?

  2. Does the enjoyment premium matter? If devs are happier and more engaged (better retention, less burnout), does that offset the slowdown?

  3. Is this an AI problem or an org problem? Maybe the 19% individual slowdown reveals 19% of waste in our existing processes that AI makes visible?

  4. What about 2026 tools? METR’s newer data shows only -4% slowdown (vs -19% in early 2025). Are we just early in the adoption curve?

The Context That Matters

To be fair, the study focused on experienced developers working on complex open-source issues. That’s different from:

  • Junior devs learning (where AI might genuinely accelerate)
  • Routine CRUD work (where autocomplete probably helps)
  • Exploratory prototyping (where conversation might unlock ideas)

But it’s similar to the complex, ambiguous work most of us do day-to-day.

What I’m Struggling With

If I can’t trust my own perception of productivity, what can I trust?

Do I ignore the data and keep using tools that feel good? Do I time-track everything to objective-truth my way out of delusion? Do I focus on team-level outcomes and stop worrying about individual velocity?

How does your team actually measure AI productivity impact? And more importantly—do you trust those measurements?

Asking because I genuinely don’t know if we’re all collectively fooling ourselves or if we’re just measuring the wrong things.


Sources: METR Research Study, MIT Tech Review, GetDX Analysis

This hits close to home. We’re seeing exactly this pattern at our organization—and it’s creating some uncomfortable conversations with leadership.

The Org-Level Reality

At the individual level, our DORA metrics look great:

  • PR velocity up 31%
  • Commit frequency up 28%
  • Time-to-PR down 24%

But here’s what hasn’t changed:

  • Feature delivery timelines: unchanged
  • Sprint completion rates: actually down 6%
  • Production incidents: up 12%

The “organizational friction tax” is eating all the individual gains—and then some.

Where the Gains Vanish

I’ve been tracking where the productivity disappears. Three big sinks:

1. Code Review Overhead (+40% time)
More code means more review cycles. AI-generated code often looks plausible but has subtle issues that take longer to catch. We’re spending more time in review than we saved in writing.

2. Debugging AI Output (+35% incidents)
Production bugs from AI-assisted code are harder to debug because the author didn’t fully understand the implementation. We’re context-switching between “what I intended” and “what the AI wrote.”

3. Integration Complexity
Individual PRs ship faster, but integrating them takes longer. AI optimizes for local solutions, not system coherence. We’re refactoring more to maintain architectural consistency.

The Metrics Problem

You asked about measurement—here’s what I’m learning the hard way:

Wrong Metrics (what we tracked):

  • Lines of code written
  • PRs merged
  • Individual developer velocity
  • “Productivity sentiment”

Right Metrics (what actually matters):

  • Time-to-production (not time-to-PR)
  • Feature completion rate
  • Post-deployment stability
  • Customer value delivered

We were measuring output when we should have been measuring outcomes.

The Uncomfortable Truth

Your “enjoyment premium” question is spot-on. My team is happier—AI makes debugging feel like problem-solving instead of grunt work. Retention is up, burnout indicators are down.

But I can’t sell “happier developers” to our CFO when delivery timelines are flat and incident rates are up.

The productivity paradox is creating an ROI problem: we’re investing in tools that make developers feel great but don’t move business metrics. That’s… unsustainable.

What We’re Trying

Two experiments in flight:

1. Outcome-Based Metrics
Stopped tracking individual velocity. Now measuring feature delivery end-to-end. Early signal: reveals bottlenecks in design review and QA that AI doesn’t touch.

2. “AI Review” Culture
Treating AI-assisted code differently in review. Explicitly asking: “Did you understand what the AI wrote? Could you debug this at 2am?”

Still early days, but the conversation has shifted from “are we faster?” to “are we delivering more value?”


The 19% slowdown data matches what we’re seeing. The perception gap is the real problem—it’s creating misaligned expectations about what AI tools can actually deliver at organizational scale.

Coming from the design side, this perception vs reality gap feels painfully familiar. We’ve been through this exact pattern with design tools.

The Design Parallel

Remember when Figma’s auto-layout first shipped? Designers loved it—felt so productive, components snapping into place automatically. But our actual design-to-dev handoff time? Didn’t budge.

Why? Because auto-layout made it easy to create complex layouts fast. But complexity creates communication overhead. Engineers spent more time interpreting intricate designs than they did with our old, simpler wireframes.

Sound familiar? :thinking:

The “Flow State Illusion”

Here’s what I think is happening with AI coding tools (and design tools, and probably every productivity tool ever):

The enjoyment you’re feeling isn’t fake—it’s flow state. That back-and-forth with Claude, the dopamine hit from Copilot autocomplete—that’s your brain releasing reward chemicals because you’re in deep focus.

But flow state ≠ productivity. Flow state = engagement.

We mistake “I was deeply engaged for 3 hours” with “I accomplished 3 hours of work.” But if you spent 2 of those hours in an enjoyable conversation with AI that led you down wrong paths… you worked 1 hour, not 3.

Tools That Feel Good vs Tools That Work

I ran a startup that failed. One of our biggest mistakes? Building features that users loved using but that didn’t actually solve their problems.

We optimized for engagement (time-in-app, feature usage) instead of outcomes (customer success, problem solved). Metrics looked great, retention tanked after month 2.

AI tools might be doing the same thing to us. They’re incredibly engaging—but are they effective?

The Measurement Trap

You asked “are we measuring the wrong things?” and I think yes, but it’s worse than that.

We measure what’s easy to measure.

Easy to measure:

  • PRs merged :white_check_mark:
  • Lines written :white_check_mark:
  • Developer sentiment :white_check_mark:
  • “Feels productive” :white_check_mark:

Hard to measure:

  • Did this feature solve the customer problem? :person_shrugging:
  • Is this code maintainable 6 months from now? :person_shrugging:
  • Did we ship the right thing? :person_shrugging:
  • Team learning and capability growth? :person_shrugging:

So we optimize for what we measure (easy metrics), and ignore what actually matters (hard metrics).

Classic streetlight effect: looking for keys under the lamppost because that’s where the light is, not because that’s where you dropped them. :light_bulb:

The Question I Can’t Stop Thinking About

Your point about “enjoyment premium” having value (retention, morale) is real. I’m not dismissing it.

But here’s my uncomfortable question: Are we using AI tools because they make us better, or because they make us feel better?

And if it’s the latter—is that okay? Can “happier developers who ship at the same pace” be a valid outcome?

Or is that just… expensive placebo?

I genuinely don’t know. But the 39-point perception gap suggests we might be lying to ourselves about which one it is.

What Design Taught Me

After my startup failed, I learned to distinguish between:

  • Vanity metrics: things that look good in dashboards
  • Actionable metrics: things that drive decisions
  • Outcome metrics: things that matter to customers

AI productivity feels like vanity metrics. We’re showing beautiful charts of PR velocity to justify tool costs, while delivery timelines stay flat.

The real test: if you removed AI tools tomorrow, would customer outcomes change? If not, you’re paying for developer happiness, not productivity.

(Which might be fine! Just be honest about what you’re buying.)


The METR study is a gift: it’s giving us permission to question our assumptions. Maybe the question isn’t “how do we measure AI productivity?” but “what does productivity even mean?”

Are we optimizing for feeling productive or being productive? Because those might not be the same thing. :thought_balloon:

This conversation is happening in every C-suite right now, and the perception gap is creating serious strategic risk.

The CFO Problem

Last quarter, our CFO asked me to justify our AI tooling spend. I showed him the metrics:

  • GitHub Copilot: $20/dev/month × 120 devs = $28.8K/year
  • Cursor subscriptions: $40/dev/month × 40 senior devs = $19.2K/year
  • Infrastructure for self-hosted AI tools: ~$15K/month = $180K/year

Total: ~$228K annually

His response: “What’s the business return?”

I showed him developer sentiment surveys (↑), PR velocity (↑), and retention rates (↑).

He asked: “How many more features did we ship? How much faster did we hit market milestones? What was the revenue impact?”

I had no answer.

The Metrics Mismatch

Alex, you asked if we’re measuring the wrong things. From an executive perspective, the answer is unequivocally yes.

Engineering measures:

  • Developer experience metrics
  • Code output volume
  • Individual velocity
  • Team sentiment

Business measures:

  • Features delivered per quarter
  • Time-to-market for strategic initiatives
  • Customer value delivered
  • Revenue enabled by technical capabilities
  • Incident impact on customers

The 19% slowdown in the METR study maps to zero change in business outcomes. That’s a $228K investment with no measurable ROI.

The Organizational Risk

The perception gap isn’t just interesting psychology—it’s a liability.

When developers believe they’re 20% faster but are actually 19% slower, we get:

  1. Misallocated Resources: Doubling down on tools that don’t deliver business value
  2. Unrealistic Commitments: Teams over-promise based on false productivity gains
  3. Strategic Missteps: “If we’re 20% faster, we can cut the team 20%” (actual proposal from our board)
  4. Budget Erosion: Spending on feel-good tools while underfunding actual bottlenecks

That third one keeps me up at night. If leadership believes AI makes us 20% more efficient, they’ll cut headcount to “right-size” for the new productivity level. But if the data shows we’re actually slower…

We’re setting up for a massive disconnect between expectations and reality.

What Actually Matters

Luis is right about outcome-based metrics. Here’s what I’m tracking now:

Leading Indicators:

  • Feature completion rate (not story points)
  • Time from commit to customer value
  • Production stability trends
  • Customer-facing incident frequency

Business Metrics:

  • Revenue per engineer
  • Customer feature requests closed
  • Competitive feature parity gaps
  • Platform uptime and reliability

Investment ROI:

  • Cost per feature delivered
  • Engineering cost as % of revenue
  • R&D efficiency trends

None of these metrics show AI tools driving improvement. Some show decline.

The Uncomfortable Conversation

Maya’s question—“Are we paying for developer happiness or productivity?”—is the right one.

From a business perspective, developer happiness has value if it drives retention and performance. We’ve seen:

  • Retention up 12% YoY ✓
  • Performance (business outcomes) flat or declining ✗

That’s a problem. We’re investing in retention but not in capability or delivery.

The honest answer: We’re subsidizing developer preference, not business results.

That might be acceptable—if we’re explicit about it. Retention in this market is valuable. But we can’t call it a “productivity investment” when the data shows otherwise.

What I’m Doing Differently

  1. Separate retention from productivity: AI tools fund from talent budget, not engineering efficiency budget
  2. Outcome-based planning: Roadmaps measured in customer value delivered, not story points or velocity
  3. Executive education: Teaching leadership the difference between output metrics (PRs, commits) and outcome metrics (features, value)
  4. Honest communication: “AI tools improve morale but haven’t accelerated delivery” vs “AI makes us 20% faster”

The METR study gives us cover to have these honest conversations. The perception gap is industry-wide, not a failing of our team.

The Strategic Question

Alex, you asked what you can trust if not your own perception. From an executive lens:

Trust business outcomes. Full stop.

Did customers get value faster? Did we ship strategic features on time? Did reliability improve?

Everything else—velocity, sentiment, enjoyment—is input, not output. Inputs are worth measuring to understand the system, but outputs are what matter for investment decisions.

The 39-point perception gap is a wake-up call: we’ve been optimizing for the wrong things.


This isn’t about being anti-AI. It’s about being honest about what we’re buying and why. If we’re buying developer happiness, fine—but let’s fund it from retention budgets and measure retention ROI.

If we’re buying productivity gains, we need to see them in business metrics. Otherwise we’re just fooling ourselves with vanity KPIs.

Sitting between engineering and business, I’m watching this play out in real-time—and it’s creating some fascinating tensions.

What I’m Seeing From Product

On the engineering side:

  • Devs happier in retros ✓
  • “Velocity” up in sprint reports ✓
  • Code review volume up significantly :up_right_arrow:
  • Team morale genuinely better ✓

On the business side:

  • Feature delivery dates… unchanged :neutral_face:
  • Sprint completion rate flat or down :chart_decreasing:
  • Product quality issues up :chart_increasing:
  • Customer-facing feature requests backlog growing :chart_increasing:

The disconnect is real and it’s creating planning chaos.

The Roadmap Problem

Here’s where the perception gap hurts product:

Before AI tools:
“This feature will take 3 sprints based on our historical velocity.”
✓ We’d hit it ~80% of the time.

After AI tools:
Devs: “We’re 20% faster now, so this’ll take 2.4 sprints—call it 2.5.”
Reality: Still takes 3 sprints (or more, if we rush).

We’re making commitments based on perceived velocity gains that don’t materialize in actual delivery.

Leadership sees missed dates. Engineering sees “we delivered the story points.” Product is stuck in the middle explaining why velocity ≠ outcomes.

The Quality Tension

Michelle’s point about incident rates rings true. We’re seeing:

More code shipped faster
→ More edge cases not considered
→ More integration issues
→ More post-release bugs
→ More hotfix cycles

So yes, we “shipped” faster. But we also had to ship again. And again. Net delivery time? No improvement.

The “enjoyment premium” Luis mentioned? It evaporates fast when you’re on call debugging AI-generated code you don’t fully understand.

But There’s Real Value Here

I don’t want to pile on the negativity because there ARE genuine wins:

1. Prototyping Speed
When exploring ideas, AI tools are legitimately faster. We can test 3 approaches in the time we used to test 1. This has real value in discovery phases.

2. Learning Acceleration
Junior PMs and designers using AI to understand technical constraints learn faster. The AI conversation acts as a tutor, not just a tool.

3. Grunt Work Elimination
Boilerplate code, test scaffolding, documentation generation—these genuinely speed up. It’s the complex, novel work where AI struggles.

4. Morale & Retention
Maya’s “expensive placebo” framing is harsh but… retention in this market is insanely valuable. If AI tools keep talented people engaged and on the team, that might justify the cost even without productivity gains.

What Product Can Measure

To Alex’s original question about measurement, here’s what I’m tracking:

Cycle Time Breakdown:

  • Idea → PRD: unchanged
  • PRD → First PR: 18% faster ✓ (AI helps here)
  • First PR → Approved PR: 35% slower ✗ (review bottleneck)
  • Approved → Deployed: unchanged
  • Deployed → Stable: 22% slower ✗ (more bugs)

Net result: 8% slower end-to-end

So AI helps in coding phase but creates downstream friction that eats the gains.

Feature Quality:

  • Customer bug reports per feature: up 31%
  • Post-launch iteration cycles: up 2.1x
  • Time to “feature complete and stable”: basically flat

We’re shipping faster but less completely. Trading velocity for quality.

The Business Case Reconsidered

Michelle’s framework about retention vs productivity budget is brilliant. Let me extend it:

If we’re buying retention:

  • Measure: attrition rate, offer acceptance, team satisfaction
  • ROI: Cost to replace engineer (~$100K+) vs tool cost ($20-40/month)
  • Verdict: Clear win (40 devs retained = $4M saved vs $50K tool cost)

If we’re buying productivity:

  • Measure: features delivered, customer value, time-to-market
  • ROI: Features/quarter vs tool cost
  • Verdict: No clear gain (sometimes negative)

The problem: we’re selling it internally as productivity but getting value from retention.

That’s a messaging problem, not a tool problem.

What I’m Proposing

Instead of “AI makes us X% faster” (provably false), try:

“AI tools improve developer experience and retention”

  • Retention up 12%
  • Morale up significantly
  • Recruiting easier (devs want modern tools)
  • Cost: ~$40K annually
  • Savings from avoided attrition: ~$1-2M

“AI tools change where speed happens”

  • Faster: prototyping, boilerplate, exploration
  • Slower: complex features, code review, debugging
  • Net: roughly neutral on delivery, but shifts work to higher-value activities

“We’re in learning mode”

  • Early data (METR, our own metrics) shows adoption curve
  • 2026 tools better than 2025 (-4% vs -19% slowdown)
  • Investing in upskilling to use tools effectively
  • Measuring business outcomes, not just velocity

This reframes from “productivity miracle” to “strategic investment in talent and learning.”

It’s honest. It’s defensible. And it doesn’t set unrealistic expectations.


The 39-point perception gap is our wake-up call to get real about what we’re buying and why.

We can keep AI tools—but let’s be honest about the value proposition. It’s retention and morale, not velocity and delivery. That’s still valuable. Just… different.