We Measured Our Team's AI Coding Impact: 19% Slower Despite Everyone Feeling Faster—What Are We Missing?

Six months ago, I rolled out AI coding assistants (primarily Cursor with Claude) across our 40-person engineering team at a Fortune 500 financial services company. The team was excited—finally, a tool to help us move faster through our massive modernization backlog.

The response was overwhelmingly positive. In our quarterly surveys, 85% of developers reported feeling “significantly more productive.” Stand-ups were full of stories about refactoring entire modules in an afternoon or knocking out bug fixes in minutes instead of hours. Morale was genuinely up.

But here’s where it gets uncomfortable: I decided to measure the actual impact.

We tracked cycle time (from first commit to production), code review duration, bug rates, and PR merge times for three months before AI adoption and six months after. We controlled for project complexity and team changes.

The data showed we were 19% slower.

Not a little slower. Not within margin of error. Nearly 20% slower from commit to shipped feature. And this isn’t an isolated finding—it matches exactly what the METR study found in 2025 with experienced open-source developers.

The Perception-Reality Gap

Here’s what really bothers me: Even after I shared this data with the team, most developers still insist they’re faster. The disconnect is profound:

  • Team perception: “I’m 50-100% faster at writing code”
  • Measured reality: 19% slower to ship features
  • Team explanation: “The metrics must be wrong” or “We’re working on harder problems now”

But we controlled for complexity. The metrics aren’t wrong. Yet the team’s lived experience is that they feel dramatically more productive.

What I Think Is Happening

Looking at the detailed data, I see patterns:

  1. First-draft speed is real: Developers ARE faster at generating initial code
  2. Debugging time exploded: Time spent fixing bugs in AI-generated code is up significantly
  3. Review became a bottleneck: Code reviewers spending 40% more time per PR
  4. More iterations to “done”: Average PR has 2.3x more commits before merge

So yes, developers feel faster because they’re typing less. But they’re spending more time debugging, more time in reviews, more iterations to get to shippable quality.

The immediate feedback loop of AI code generation creates what researchers call “illusory productivity”—activity feels like progress even when it’s slowing down delivery.

The Questions I’m Wrestling With

1. Are we measuring the wrong things?
Maybe cycle time isn’t the right metric. What if the real value is in happier developers, better retention, more creative problem-solving? Should I care that we’re 19% slower if the team is 85% happier?

2. Is this a learning curve issue?
Are we just bad at using AI tools right now? Will we get better? The METR study was with experienced developers, so skill level doesn’t seem to be the issue.

3. How do we evaluate tools when users can’t accurately self-assess?
This is the Dunning-Kruger problem amplified. If developers genuinely can’t tell when a tool is helping vs. hurting, how do we make good purchasing and adoption decisions?

4. Do we trust the data or the developers?
I trust my team. They’re not lying about their experience. But the data is also clear. How do I reconcile this?

5. What’s the role of AI literacy in overconfidence?
Recent research suggests that higher AI literacy brings more overconfidence. Are we creating “hollow senior engineers” who can prompt well but lack deep problem-solving skills?

The Uncomfortable Truth

The hardest part of this isn’t the data—it’s the conversation. When I tried to discuss this with the team, I faced defensiveness, pushback, and genuine confusion. Developers feel I’m attacking their competence or their tools. But I’m not—I’m trying to understand reality so we can make informed decisions.

I’m not suggesting we ban AI tools. I use them myself. But I’m deeply concerned about making organizational decisions based on perception when reality tells a different story.

For other engineering leaders: Are you measuring AI’s actual impact? What are you finding? How are you having this conversation with your teams?

For the whole community: Is the perception gap a problem we need to solve, or is developer happiness valuable enough that we should optimize for perception even when it diverges from reality?

I’m genuinely seeking advice here. We’ve invested significantly in these tools, and I need to figure out whether we’re on the right path or if we need to fundamentally rethink our approach.


Cross-referencing research: The Productivity Paradox of AI Coding Assistants, MIT Technology Review on AI Coding

Luis, this hits close to home. We’re seeing the same pattern at our EdTech startup.

The Organizational Effectiveness Lens

What concerns me most isn’t just the 19% slowdown—it’s what this perception gap means for organizational decision-making. If developers can’t accurately assess their own productivity, we have a systemic problem that affects:

  1. Performance evaluations: We promoted three engineers last quarter partly based on their “exceptional productivity” with AI tools. Were those the right promotion decisions? How do I know?

  2. Resource allocation: We justified headcount based on velocity projections that assumed AI would maintain or improve throughput. Those projections were wrong.

  3. Tool investments: We’re now locked into significant contract commitments for AI tools based on enthusiastic (but potentially inaccurate) developer feedback.

What We’re Measuring Differently

After reading the METR study, I started tracking downstream impacts:

  • Bug escape rate to production: Up 23% since AI adoption
  • Production incidents: Up 18% (though not all directly attributable to AI)
  • Tech debt tickets created: Up 35% (“Fix rushed AI refactor” is a common theme)
  • Code review rejection rate: Up 40%

The pattern is clear: First-draft code comes faster, but quality is lower, creating more work downstream.

The Performance System Problem

Here’s what keeps me up at night: If developers genuinely feel they’re performing better when data shows they’re not, how do we build fair performance and promotion systems?

We’ve always relied heavily on self-assessment and peer feedback. But if perception is systematically divorced from reality, those inputs become unreliable. Do we:

  • Discount self-assessments entirely? (Feels wrong, erodes trust)
  • Weight data more heavily? (But engineering isn’t just velocity)
  • Accept that perception IS a valid outcome? (David’s point below about retention)

I don’t have a clean answer. But I know that promoting people based on false productivity signals creates long-term organizational capability problems.

The Inclusion Angle

There’s another dimension here that worries me: Research on confidence gaps shows that women and underrepresented minorities tend to underestimate their contributions while majority groups tend to overestimate.

If AI tools amplify existing confidence biases (which the Dunning-Kruger research suggests they do), we’re potentially creating a system where:

  • Overconfident developers feel even MORE productive (and get promoted)
  • Realistic self-assessors feel LESS confident relative to peers (and get overlooked)

This could exacerbate existing equity gaps in performance systems.

What I’m Trying

We’re experimenting with “reality-based confidence building”:

  1. Pairing metrics with perception: When engineers report feeling fast, show them their actual cycle time, bug rate, review duration
  2. Celebrating thoughtfulness, not speed: Explicitly recognizing engineers who take time to write clean, well-tested code
  3. Measuring team impact, not individual output: Shifting from “How much code did you write?” to “What problems did you solve? What value did you ship?”

But I’ll be honest—this is hard. Developers don’t like being told their perception is wrong. It feels like criticism even when it’s not meant that way.

Luis, to your questions:

Do we trust the data or the developers?

Both. Trust that developers’ experience is real (they DO feel faster). Trust that the data is also real (they ARE slower to ship). The work is helping developers see both truths simultaneously and understand why the disconnect exists.

Is the perception gap a problem we need to solve?

Yes. Not because developers are “wrong,” but because we can’t build effective organizations on false foundations. If our mental model of productivity doesn’t match reality, all our downstream decisions will be suboptimal.

I don’t think this means banning or limiting AI tools. It means getting much better at helping developers use them thoughtfully and assess their impact accurately.

Would love to hear how others are tackling the “measurement + morale” challenge.

This entire thread is giving me flashbacks to why my startup failed, and I mean that in the most helpful way possible.

The “Feels Fast” Trap

When we were building our B2B SaaS product, we used a ton of no-code and low-code tools. Webflow for the landing page, Zapier for integrations, Bubble for prototyping features. It felt AMAZING. We were shipping “features” every week. Our velocity felt incredible.

But here’s what we didn’t see at the time:

  • We were building features nobody asked for (because they were easy to build)
  • We were creating technical debt we couldn’t see (because it was buried in tool abstractions)
  • We were confusing “activity” with “progress” (shipping != solving problems)

The Psychology of Immediate Feedback

There’s this dopamine hit when you see something materialize quickly. With design tools, when I drag-and-drop a component and it instantly looks good, my brain goes “YES! PRODUCTIVE!”

But am I actually solving the user’s problem? Am I thinking through the interaction model? Am I considering edge cases?

Or am I just… doing stuff that feels like work?

AI coding seems to create the exact same pattern. The code appears fast, you feel accomplished, dopamine hits. But the thinking part—understanding the problem, considering the architecture, anticipating edge cases—that gets short-circuited.

Speed Without Intentionality Is Just Thrashing

This was our fatal mistake at the startup. We moved fast, but we weren’t thoughtful. We generated lots of code and features, but we didn’t solve the core user problem. Six months of “productivity” that ultimately amounted to nothing because we were optimizing for activity, not outcomes.

Luis, your 2.3x more commits before merge? That’s thrashing. It FEELS like iteration and improvement, but it’s actually just… more work to get to the same place you would have gotten with more upfront thinking.

The Measurement Challenge

Here’s what keeps me up: How do you measure thoughtfulness? How do you measure problem-solving quality?

In design, we face this constantly. A quick template-based design can look “good enough” but lack the nuance that makes it truly work for users. A thoughtful custom design takes longer but solves the actual problem.

Metrics like “designs shipped” or “iterations completed” don’t capture this difference at all.

What I’m Trying Personally

I’ve started journaling at the end of each day:

  • “What problem did I solve today?” (not “What did I ship?”)
  • “Did I understand WHY before I jumped to HOW?”
  • “If I couldn’t use AI/tools, could I still solve this problem?”

That last question is brutal but important. If I’m dependent on the tool to solve the problem, I haven’t actually learned anything.

The Uncomfortable Truth

I STILL fall for “feels productive” tools all the time. Just last week I used an AI tool to generate a bunch of component variations, felt super accomplished, then realized none of them actually fit our design system principles. Had to throw it all out.

The immediate gratification of generation is so seductive that it overrides the harder work of thinking.

Keisha’s point about celebrating thoughtfulness over speed resonates hard. We need to valorize the engineer who takes time to write clean, well-architected code, not the engineer who ships the most PRs.

But that’s a cultural shift, not just a measurement problem.

This data is both validating and concerning, Luis. From the product side, I’ve been seeing similar patterns that I initially couldn’t explain.

The Business Impact

Since our engineering team adopted AI coding tools 6 months ago, we’ve had some confusing metrics:

  • Sprint velocity up 15% (more story points completed)
  • But feature delivery timeline roughly the same
  • Customer satisfaction scores actually down 8%
  • Support ticket volume up 25%

I kept asking engineering “why aren’t we shipping faster if velocity is up?” Now this perception gap explains it perfectly.

The Contrarian Take

Here’s where I might push back a bit: Maybe the perception IS a reality that matters. Developer satisfaction and retention are real business outcomes. If the team feels more productive and engaged, that has value even if the hard metrics don’t show it yet.

BUT - and this is a big but - this only works if we’re making decisions based on accurate information. If we’re:

  • Buying more AI tool licenses based on false productivity signals
  • Promoting developers based on perceived (not actual) output
  • Setting unrealistic timelines because everyone “feels” faster

Then we’re building on a foundation of illusions, and that’s dangerous.

The A/B Testing Framework

What if we treated this like a product experiment?

  • Control group: Team A uses AI tools freely
  • Test group: Team B uses structured AI usage (defined guidelines, mandatory code review)
  • Measure: Both sentiment (how do you feel?) AND delivery (what shipped?)
  • Duration: 3 months

This way we separate the morale effect from the productivity effect. Maybe AI tools are worth it JUST for morale, even if velocity doesn’t improve. That’s a valid business case. But we need to know what we’re paying for.

The Worst Outcome

The worst case isn’t “AI tools don’t work.” It’s “we think they work when they don’t, and we make 10 other bad decisions based on that false belief.”

Curious what others think - am I being too product-manager-ish about this? Is there value in just letting people use tools that make them happy, even if the metrics don’t back it up?

Luis, this is one of the most important conversations we can be having right now. I’ve seen this exact pattern play out over 25 years across multiple “productivity” tools, and AI coding is following the same trajectory.

The Historical Pattern

Let me share what I’ve seen:

  • 1990s: CASE tools promised 10x productivity through code generation
  • 2000s: UML and model-driven development would eliminate manual coding
  • 2010s: Low-code platforms would make everyone a developer
  • 2020s: AI coding assistants will make developers 10x faster

Every generation’s “silver bullet” followed the same pattern: huge initial excitement, perception of massive productivity gains, then reality sets in, then nuanced understanding of when/how/who benefits.

The Dunning-Kruger Connection

Here’s what really concerns me: Recent research shows AI tools amplify overconfidence in developers. The more “AI literate” developers think they are, the more overconfident they become about their output.

This creates what I call “Hollow Senior Engineers” - people who are fast at generating code but lack the depth to understand architectural implications, edge cases, and long-term maintainability.

Where the 19% Goes

From what I’ve measured in my organization, here’s the breakdown:

  • Coding time: DOWN 40% (this is what developers feel)
  • Code review time: UP 60% (reviewers need more time to validate AI output)
  • Debugging/fixing time: UP 80% (AI generates plausible but flawed code)
  • Thinking/design time: DOWN 30% (jumping to coding too fast)

Net effect: 15-20% slower end-to-end, but developers only perceive the coding time reduction.

The Framework I Use

I separate development into three phases:

  1. Thinking time: Understanding the problem, considering approaches
  2. Coding time: Actually writing the code
  3. Fixing time: Debugging, reviewing, adjusting

AI speeds up #2 dramatically, but often at the expense of #1 (less thoughtful approach) and increasing #3 (more bugs to fix). We’re optimizing the wrong part of the cycle.

The Training Gap

Most developers self-taught themselves on AI tools the same way they self-taught Git - by trying commands until something works, without understanding the model. This leads to:

  • Over-reliance on AI for problems they should solve mentally
  • Acceptance of AI suggestions without understanding why
  • Atrophy of fundamental problem-solving skills

What We’re Doing

At my organization, we’re implementing:

  1. AI usage training: When to use AI, when not to, how to validate output
  2. Problem statement requirement: Document the problem BEFORE using AI to solve it
  3. Review attention: PRs with heavy AI usage get extra scrutiny
  4. Skill maintenance: Regular “AI-free” exercises to maintain fundamentals
  5. Metric evolution: Tracking “fix my own recent code” commits as a quality signal

The Controversial Question

Should we gate AI tool access by skill level? Let senior engineers use AI freely but restrict junior engineers until they build fundamentals?

I know it sounds like gatekeeping, but consider: we don’t let junior engineers deploy to production without oversight. Why would we let them use tools that can generate production code at scale without similar oversight?

Bottom Line

The perception gap isn’t a bug, it’s a feature of how these tools work. The immediate feedback loop creates an illusion of productivity. Our job as leaders is to help our teams see reality clearly while building genuine capability.

This is hard, important work. Thanks for starting this conversation, Luis.