AI Writes 41% of Our Code But Productivity Only Rose 10%. Are We Measuring the Wrong Things?

We’re sitting in this bizarre moment where AI is generating 41% of all new code in 2026, and yet when you look at actual organizational productivity gains, we’re seeing a plateau around 10%. That’s it. Not the 20-55% that the early GitHub/Google/Microsoft studies promised.

But here’s what really keeps me up at night as a product leader: We might be measuring the wrong things entirely.

The Perception Gap Is Real

The METR study stopped me cold. They ran a proper RCT with 16 experienced open-source developers—people with an average of 5 years on their own repositories. The results?

  • Developers using AI took 19% longer to complete tasks
  • But they believed they were 24% faster
  • That’s a 39-point perception gap between feeling and reality

Think about what this means for how we’re evaluating AI tools today. If developers feel faster but are slower, and we’re making tool adoption decisions based on subjective feedback… we’re flying blind.

Three Levels of Measurement, Three Different Stories

I’ve started thinking about this in layers:

Individual Developer Level:

  • Time saved writing code: :white_check_mark: Real (3.6 hours/week average)
  • Feeling of productivity: :white_check_mark: Real (developers report 10-30% boost)
  • Actual task completion speed: :red_question_mark: Mixed evidence (METR says slower, other studies say faster)

Team Level:

  • Sprint velocity: :confused: Mostly unchanged despite AI adoption
  • Code review burden: :chart_increasing: Increasing (66% say AI code is “almost right, but not quite”)
  • Technical debt accumulation: :chart_increasing: AI-assisted code has 1.7× more issues
  • DORA metrics: :neutral_face: No meaningful improvement at most orgs

Business Level:

  • Time to market: :red_question_mark: Unknown
  • Cost per feature: :red_question_mark: Unknown
  • Revenue impact: :red_question_mark: Only 33% of decision-makers link AI to financial growth

The CFO Problem

Here’s the business reality: CFOs are deferring 25% of AI investments to 2027 because they can’t see the ROI. And honestly? I don’t blame them.

If I’m being asked to justify K+/year in AI coding tool licenses, what’s my business case? “Developers feel more productive” isn’t going to cut it. “We’re generating more code” isn’t a win if that code needs more debugging time.

What I need is:

  • Time to market improvement for new features
  • Cost reduction per delivered capability
  • Customer impact metrics (faster bug fixes, more features shipped)
  • Risk metrics (security vulnerabilities, production incidents)

So What Should We Be Measuring?

I think we need a framework that acknowledges the complexity:

  1. Code generation speed (individual metric, short-term)
  2. Code durability (team metric, medium-term) - How long does it survive in production without modification?
  3. Review efficiency (team metric, short-term) - Are we spending more time fixing AI suggestions than we saved writing them?
  4. Production quality (business metric, long-term) - Incident rates, security vulnerabilities, customer-facing bugs
  5. Developer capability (organizational metric, long-term) - Are we building or eroding engineering skills?

But here’s the uncomfortable truth: Even with better metrics, we might find that AI tools are net negative in some contexts and net positive in others. And that’s okay! The goal isn’t to prove AI is universally good—it’s to understand when it helps and when it hurts.

The Question for This Group

What are you actually measuring to evaluate AI coding tools?

Are you tracking anything beyond “developer sentiment” and “adoption rates”? Have you found metrics that meaningfully correlate with business outcomes?

Because right now, it feels like we’re in a measurement crisis. We’re adopting tools at 93% rates while only 46% of developers fully trust them, and we can’t definitively say whether they’re making us faster or slower.

That’s… not great.


Sources:

David, this resonates hard. I’m seeing exactly this disconnect in my org.

My developers swear they’re more productive with AI tools. In our quarterly surveys, 85% report feeling faster. But when I look at our sprint velocity over the last 6 months since we rolled out GitHub Copilot and Claude… it’s basically flat.

We’re delivering the same number of story points per sprint. Same number of features per quarter. The throughput hasn’t changed.

The Hidden Costs Are Real

What HAS changed:

Code Review Time: Up ~30% on PRs that heavily use AI-generated code. Reviewers report needing to dig deeper because they can’t trust the AI code to “just work.”

Debugging Cycles: Our incident postmortems increasingly show issues from AI-generated edge case handling. The code looks right at first glance, passes tests, but has subtle logic errors.

Revert Rate: We started tracking this last quarter. PRs with >40% AI-generated code (we can estimate based on commit patterns and developer notes) have a 2.3× higher revert rate within 30 days.

So yes, developers write code faster. But we’re spending more time in review, more time fixing bugs, more time reverting changes.

The time savings are being eaten by the quality tax.

Measurement Question

David, your “acceptance rate” question is spot-on. I’d add:

  • Production durability: What % of AI-generated code survives 90 days without modification?
  • Review comments per line: Are reviewers leaving more correction comments on AI code?
  • Test coverage required: Do we need more comprehensive tests to catch AI edge cases?
  • Developer understanding: Can the author explain and defend their AI-generated code in review?

That last one is becoming my informal test. If a developer can’t explain why their AI-generated code works, I ask them to rewrite it.

The Nuance

I don’t think AI tools are bad. I think they’re context-dependent, and we’re treating them like a universal productivity boost when they’re not.

Where I see wins:

  • Boilerplate and repetitive patterns
  • Test case generation (with human verification)
  • Documentation and comments
  • Refactoring well-understood code

Where I see losses:

  • Complex business logic
  • Performance-critical code
  • Security-sensitive operations
  • Novel architectural patterns

But we’re not measuring these contexts separately. We’re just looking at aggregate “AI usage” and trying to draw conclusions.

The “almost right but not quite” problem Luis mentioned is haunting me. :grimacing:

I’m not an engineer, but I work closely with our dev team on our design system, and I’m seeing this quality erosion play out in real-time. It’s giving me serious flashbacks to my failed startup where we optimized for speed over understanding.

The “Good Enough” Trap

66% of developers say AI code is “almost right, but not quite.” That phrase makes my skin crawl.

Because “almost right” means you spend time:

  1. Reading and understanding what the AI generated
  2. Identifying what’s wrong or missing
  3. Fixing it
  4. Testing to make sure your fix didn’t break something else

Versus writing it from scratch where you:

  1. Think through the problem
  2. Write the solution
  3. Test it

The cognitive load is different. When you write code, you understand it from the ground up. When you fix AI code, you’re reverse-engineering someone else’s (the AI’s) thought process first.

Design Systems Parallel

We went through this exact thing with our component library. We had templates that could generate 80% of what you needed for a new component—layout, basic props, styling hooks.

At first, productivity seemed amazing! Designers were spinning up new components in hours instead of days.

Then we noticed: The components built from templates were brittle. They worked for the happy path but broke in edge cases. People didn’t understand the underlying architecture, so they couldn’t extend or customize effectively.

We were generating more components, but our component quality and system coherence dropped.

Sound familiar?

The Understanding Gap

Luis’s test is perfect: “Can the author explain and defend their AI-generated code?”

I’d go further: Can they extend it? Can they debug it when it breaks in production at 3am? Can they teach a junior developer why it works this way?

If the answer is no, then what we’ve created is a dependency, not a productivity boost. We’ve outsourced understanding to an AI that won’t be there when things go wrong.

Time vs Quality Tradeoff

David, you mentioned time-to-market and cost per feature. But there’s a hidden cost we’re not measuring:

Time spent fixing “almost right” code accumulates as technical debt.

In design, we call this “design debt”—when you ship fast without understanding the system, you create inconsistencies that compound over time. Every future change becomes harder because you’re working around these poorly understood foundations.

I suspect the same is happening with AI-generated code. We’re optimizing for the first implementation speed while ignoring the maintenance burden and evolution cost.

The Question I Can’t Shake

Are we building engineering capability or eroding it?

Because if AI makes code easier to generate but harder to understand, maintain, and evolve… we haven’t actually won anything. We’ve just shifted when the complexity hits us—from upfront design to long-term maintenance.

And maintenance is where startups die. I learned that the hard way.

This is the conversation I’ve been trying to have with our board for three months. Thank you for framing it so clearly, David.

The CFO just challenged me on our K annual spend for AI coding tools (GitHub Copilot Enterprise + Claude + Cursor licenses for 120 engineers). She wants ROI justification before we renew.

And honestly? I’m struggling to make the business case.

The Numbers Don’t Add Up Yet

Here’s what I presented last week:

Individual Productivity Claims:

  • Developers report 3.6 hours/week saved
  • 120 engineers × 3.6 hours × 50 weeks = 21,600 hours saved
  • At /hour fully loaded cost = .24M in “value”

CFO’s response: “Show me where those hours went. Did we ship more features? Reduce time-to-market? Close more customer feature requests?”

I couldn’t answer.

Because our DORA metrics are flat. Deployment frequency, lead time, change failure rate—all basically unchanged year-over-year.

The Risk Metrics Are Concerning

What I CAN show:

Security Vulnerabilities: Our security scans flag 23.7% more issues in code with heavy AI assistance. That’s not free—each vulnerability costs security team time to triage, engineering time to fix, and risk exposure.

Production Incidents: 18% increase in P2/P3 incidents traced to subtle logic errors in AI-generated code. These don’t take down the site, but they degrade user experience and create support burden.

Code Review Bottleneck: Our senior engineers (the ones doing thorough reviews) are now review-constrained rather than code-constrained. We shifted the bottleneck, not eliminated it.

The CFO Question: What’s the Business Outcome?

Maya’s point about technical debt accumulation is what keeps me up at night. Because the CFO doesn’t care about “code quality” or “developer experience” as abstract concepts.

She cares about:

  • Revenue enabled: Can we ship revenue-generating features faster?
  • Costs avoided: Are we reducing engineering headcount needs or incident response costs?
  • Risk mitigation: Are we reducing security breaches, compliance violations, SLA misses?
  • Customer satisfaction: Are we closing more feature requests, reducing bug reports?

I need to measure AI impact in THOSE terms, not engineering metrics.

What I’m Proposing to Track

Starting next quarter, we’re implementing this framework:

Tier 1 - Business Metrics (Monthly):

  • Features shipped per quarter (by revenue impact)
  • Customer-reported bugs (severity-weighted)
  • Time-to-resolution for customer feature requests
  • Security incidents attributable to code defects

Tier 2 - Team Metrics (Weekly):

  • Deployment frequency and lead time (DORA)
  • Change failure rate by code source (human vs AI-heavy)
  • PR review time and comment density
  • Code revert rate within 30/60/90 days

Tier 3 - Individual Metrics (Daily):

  • AI tool usage and acceptance rate
  • Developer sentiment surveys
  • Code authorship patterns

The hypothesis: Tier 1 metrics tell us if AI is delivering business value. If they improve, great—AI works. If they’re flat or declining, then Tier 2/3 metrics help us diagnose why.

The Hard Conversation

But here’s what I’m also preparing the board for: AI tools might be net negative for our specific use case.

We’re a B2B fintech platform with complex compliance requirements, legacy integration challenges, and high security standards. The “move fast” benefits of AI code generation might be outweighed by the “break things” costs in our domain.

And if that’s true, I need to know—and be willing to pull back on AI adoption despite industry hype.

That’s uncomfortable. It means potentially swimming against the current when everyone else is going all-in on AI. But better to face that reality now than justify a sunk cost for another year.

The Question

Luis, you mentioned context-dependent wins and losses. Have you considered domain-specific AI adoption policies? Like: “Use AI for test generation and documentation, but prohibited for payment processing and auth logic”?

Because I’m starting to think blanket “AI everywhere” adoption was the wrong move.

This thread is giving me a lot to think about, but I keep circling back to something that’s not getting enough attention:

If AI makes experienced developers 19% slower (METR study), what’s it doing to our junior engineers?

The Trust Paradox at Scale

David, you mentioned 93% adoption but only 46% trust. That statistic terrifies me from a talent development perspective.

Because when I look at my team:

  • Senior engineers (5+ years) use AI selectively—they know when to trust it and when not to
  • Mid-level engineers (2-5 years) use AI frequently—they’re still building judgment about when it helps
  • Junior engineers (0-2 years) use AI constantly—it’s how they learned to code

And here’s what I’m seeing in code reviews:

Junior developers increasingly can’t explain their own code. They can tell you what it does (because the AI told them), but not why it works, what the tradeoffs are, or how to modify it safely.

The Learning Curve Is Breaking

We have a new engineer who joined 8 months ago. Talented, smart, great culture fit. But when I pair with them on debugging a production issue, they’re… lost.

They can generate code quickly. They can Google and paste AI suggestions. But when something breaks in an unexpected way—which is ALWAYS in production—they don’t have the mental models to reason through it.

The Anthropic research showing 17% lower mastery scores with AI assistance isn’t just an academic finding. It’s playing out on my teams right now.

Long-Term Organizational Risk

Michelle, your CFO is asking about ROI. Here’s a cost we’re not measuring:

What’s the value of engineering capability we’re NOT building?

  • Junior developers who can’t function without AI tools
  • Mid-level engineers who never learned to architect systems from first principles
  • Senior engineers who spend all their time reviewing and fixing AI-generated code instead of mentoring

In 3-5 years, we might have a generation of engineers who are incredibly fast at generating code but fundamentally don’t understand what they’re building.

That’s an organizational capability crisis.

The Uncomfortable Question

Should we restrict AI tool access for junior engineers during their first 12-18 months?

I know how that sounds—like I’m being a gatekeeper, like “back in my day we walked uphill both ways.” But hear me out:

Learning to code is fundamentally about building mental models. You need to struggle with syntax, fight with debugging, understand error messages deeply. That friction is where learning happens.

If AI shortcuts all that friction, juniors never build the foundation they need to grow into mid-level and senior engineers.

Luis’s context-dependent approach makes sense to me, but I’d add an experience-level dependency too:

  • Juniors (0-18 months): Prohibited for production code, allowed for learning exercises with guidance
  • Mid-level (18 months - 4 years): Allowed but with mandatory explanation in PRs
  • Senior (4+ years): Full access with judgment about appropriate use

The Productivity Trap

But here’s the business pressure I’m facing: My CFO (same conversation Michelle is having) expects us to deliver the same or more with fewer engineers.

AI tools look like the answer—if one developer can generate 2× more code, we need fewer developers, right?

Except that logic only works if:

  1. Code generation = value delivery (it doesn’t)
  2. AI-generated code has same quality (it doesn’t—1.7× more issues)
  3. We’re not creating long-term capability debt (we are)

I’m stuck between short-term productivity pressure and long-term organizational health. And I don’t have a good answer yet.

What Are Others Doing?

Has anyone implemented AI-specific onboarding policies for junior engineers? Or measured skill development trajectories with/without AI assistance?

Because we’re making massive decisions about tool adoption without understanding the talent development implications. And by the time we figure out we’ve damaged our engineering capability… it might be too late to fix it.