Developers Report 45% Productivity Gains with AI, But We're Not Shipping Faster—Are We Measuring the Wrong Things?

We’ve had GitHub Copilot and other AI coding assistants rolled out across our engineering teams for eight months now. The individual feedback has been overwhelmingly positive—developers report feeling more productive, shipping code faster, and spending less time on boilerplate. Our internal surveys show 78% of engineers say AI tools make them “significantly more productive.”

But here’s what’s puzzling me: our sprint velocity hasn’t changed. Features still take the same amount of time to ship. Our DORA metrics are flat. Time-to-production for new capabilities looks identical to a year ago.

The Data Says There’s a Disconnect

I started digging into the research, and we’re not alone. Recent studies show:

  • Developers using AI complete 26% more tasks and merge 60% more pull requests (IT Revolution research)
  • AI coding tools save an average of 3.6 hours per week per developer
  • 84% of developers report using AI tools, with 51% using them daily

Yet the same research reveals a stunning paradox: companies report NO measurable improvement in delivery velocity or business outcomes despite these individual productivity gains (Faros AI Productivity Paradox Report).

Are We Measuring Activity Instead of Value?

I think we’re tracking the wrong things. We celebrate:

  • More commits pushed
  • More PRs opened
  • Faster code completion times
  • Lines of code written per hour

But none of these are business outcomes. They’re activity metrics, not value metrics.

A developer can use AI to refactor a module 3× faster than before—that shows up as productivity. But if that refactoring doesn’t enable new features or improve customer experience, did we actually become more productive as an organization?

Where Do the Gains Disappear?

My hypothesis: AI speeds up 5-10% of our delivery pipeline—the coding part—while the other 90-95% remains unchanged.

The bottlenecks are still:

  • Code review (now arguably harder because reviewers need to verify AI-generated patterns)
  • Testing and QA (same capacity, but more code to validate)
  • Integration and deployment (unchanged process)
  • Product refinement and requirement clarity (no AI help here)

We optimized one step in a multi-step pipeline and expected the entire pipeline to speed up. That’s not how systems work.

The Real Question: What Should We Measure?

If individual coding speed isn’t translating to business results, what should we be tracking?

I’m wrestling with this question for our next board meeting. Some possibilities:

Outcome-focused metrics:

  • Features delivered per sprint (not story points, actual customer-facing capabilities)
  • Time from customer request to production deployment
  • Business KPI impact (activation, retention, revenue) per engineering sprint
  • Customer problems solved, not tickets closed

System health metrics:

  • End-to-end cycle time (idea → production)
  • Change failure rate and rollback frequency
  • Technical debt accumulation vs reduction
  • Code quality trends (not just code volume)

But I’m not confident these capture it either.

Your Perspectives?

How are you measuring AI productivity impact at your organizations?

Are you seeing the same disconnect between individual developer velocity and team delivery outcomes?

What metrics have you found that actually correlate with business value instead of just engineering activity?

I’d love to hear what’s working—or what’s failing—for others navigating this. Because right now, I’m celebrating productivity gains I can’t actually demonstrate to the business.

Michelle, this hits home. We’re seeing the exact same pattern in our fintech engineering org.

Our developers absolutely love Copilot—I hear constant praise about how it handles boilerplate, suggests test cases, and speeds up refactoring. Individual velocity metrics are up across the board.

But our feature delivery pace hasn’t budged.

Where I’m Seeing the Time Go

The time “saved” in coding is getting consumed elsewhere:

Code review has become the new bottleneck. My senior engineers are spending 30-40% more time in code review than they did pre-AI. Why? They’re trying to understand AI-generated patterns, verify that the code actually solves the right problem (not just a problem), and check for subtle logic bugs that weren’t obvious at first glance.

One of my tech leads said it perfectly: “I can review human code by scanning for familiar patterns. AI code requires me to read every line like it’s the first time I’ve seen it.”

Testing hasn’t scaled with code volume. We’re generating more code per sprint, but our QA team size and CI/CD pipeline capacity are unchanged. Features stack up waiting for test coverage. The increased output from dev just created a backlog downstream.

Technical debt conversations are harder. When a developer writes code, they understand the tradeoffs they made. With AI-generated code, sometimes the tradeoffs aren’t visible until code review—and by then, developers are committed to defending the AI’s approach because it “works.”

The Measurement Trap

Your point about activity vs value metrics is critical. We track:

  • Story points completed :white_check_mark:
  • PRs merged :white_check_mark:
  • Commits per engineer :white_check_mark:

But we don’t effectively track:

  • End-to-end cycle time (from ticket creation to production deployment)
  • How long features actually spend waiting in review, testing, or deployment queues
  • The quality of what shipped (maintainability, not just functionality)

We’re measuring what’s easy to measure, not what actually matters.

What I’m Trying

I’m experimenting with tracking end-to-end lead time for different types of work:

  • New features: idea → production
  • Bug fixes: report → deployed fix
  • Technical debt: identification → completion

Early data suggests AI helps most with isolated, well-defined changes (bug fixes, small features). It helps least with complex, multi-system changes that require architectural thinking.

The speedup is real—but it’s uneven across different types of work. We might be measuring the wrong granularity.

Curious if others are seeing similar patterns in where AI helps vs where it doesn’t?

This discussion is fascinating from a product perspective, because we’re seeing the exact inverse on our side.

Engineering keeps telling us they’re more productive. The data they share looks impressive—more PRs, faster code completion, higher story point velocity.

But from where Product sits, nothing has changed.

The Product Reality Check

Our sprint commitments look the same. Feature delivery dates are just as unpredictable. The number of customer-facing capabilities we ship per quarter hasn’t increased.

When I ask engineering why a feature that “only needs 3 days of coding” still takes 2 weeks to ship, the answers are everything Luis mentioned: review delays, testing queues, integration complexity, deployment scheduling.

But here’s what worries me: I think we might be optimizing for the wrong outcomes entirely.

Are We Confusing Activity with Impact?

Michelle, your list of “activity metrics” is exactly what I hear celebrated in standups:

  • “Merged 15 PRs this week!”
  • “Completed 8 story points!”
  • “AI helped me refactor the entire auth module!”

Great. But did that:

  • Improve customer activation rates?
  • Reduce support tickets?
  • Enable new revenue opportunities?
  • Solve a user pain point?

We’re measuring engineering throughput, not customer outcomes.

What Product Actually Cares About

From a business perspective, productivity should mean:

  1. Customer problems solved (not tickets closed—actual user pain eliminated)
  2. Time to market (how fast we can test hypotheses with real users)
  3. Learning velocity (how quickly we can iterate based on feedback)
  4. Business KPI movement (activation, retention, revenue, NPS)

If AI makes developers 45% more productive but our product metrics don’t improve, then we’re being productive at the wrong things.

The Uncomfortable Question

Here’s what I’m starting to suspect: Developers are using AI to work faster, but not necessarily to work on what matters most.

Example from last sprint:

  • Dev used AI to refactor our API caching layer. Code quality improved. Response times got marginally better. They were very productive.
  • But what customers actually needed was a bulk upload feature that’s been in the backlog for 3 months.

The AI made it easier to do the refactoring work (which was more interesting), so that’s what got prioritized. The harder work—understanding customer needs, designing the bulk upload UX, coordinating with multiple teams—didn’t get easier, so it didn’t happen.

AI might be letting us dodge the hard, valuable work in favor of the easy, measurable work.

My Proposal

We should measure AI productivity impact the same way we measure any product investment:

  • Leading indicators: Adoption rates, developer satisfaction (are we removing friction?)
  • Lag indicators: Feature delivery rate, customer satisfaction scores, business metrics

And critically: Stop celebrating activity metrics in isolation. More code, more commits, more PRs only matter if they’re connected to outcomes customers care about.

Otherwise we’re just moving faster in the wrong direction.

What do you all think—am I being too harsh on the engineering productivity narrative?

Coming at this from the design systems side, and David’s point about “working faster on the wrong things” really resonates.

The Code Quality Gap

We’re seeing AI help developers ship components faster—but those components often bypass our design system entirely.

Recent example: Developer used Copilot to create a new form validation component in ~30 minutes. Impressive! Except:

  • It didn’t use our design tokens
  • Color contrast failed WCAG AA standards
  • Error messaging was inconsistent with our voice guidelines
  • Mobile responsive breakpoints didn’t match our system

So I spent 2 hours fixing accessibility issues and bringing it into design system compliance. Net productivity gain: negative.

Speed vs Craft

I think there’s a tension here between speed and craft that AI is exposing.

AI optimizes for “code that works.” But in design systems, we care about:

  • Code that’s consistent
  • Code that’s maintainable
  • Code that’s accessible
  • Code that teaches patterns to other developers

When developers lean heavily on AI, they sometimes skip the learning that happens during manual implementation. They don’t internalize why we do things a certain way.

Result: More code, but less understanding. Faster output, but more cleanup required.

Measurement Blind Spots

Michelle’s original question about metrics hits differently from a design perspective:

We don’t have good metrics for:

  • Design system compliance rates
  • Accessibility debt introduced vs resolved
  • Component reusability (are we creating one-off solutions faster, or reusable patterns?)
  • Cross-team consistency

All the things that make a codebase good for the long term aren’t captured in velocity metrics.

What I’m Advocating For

In our design systems team, we’re trying to track:

  • Quality gates passed (accessibility, design token usage, responsive design)
  • Component reuse rate (are new features using existing components, or creating new ones?)
  • Design QA time (how long does it take to bring AI-generated code up to standards?)

Not sure if these are the right metrics, but at least they acknowledge that speed without quality isn’t productivity—it’s technical debt accumulation.

The paradox might be: AI makes bad code faster to write, but we’re still measuring code quantity, not code quality.

This thread is highlighting a critical leadership challenge I’m wrestling with right now.

The Leadership Paradox

From where I sit as VP Engineering:

Developers are happier. Engagement scores are up. Retention has improved. Engineers genuinely love the AI tools—it’s one of the most popular initiatives we’ve rolled out.

But business outcomes haven’t improved. We’re not shipping features faster. Product velocity is flat. Customer-facing improvements per quarter: unchanged.

How do I reconcile these two realities?

If I tell the business “AI tools aren’t generating ROI,” I risk losing budget and disappointing a team that’s already energized by the tools.

If I claim success based on developer satisfaction alone, I’m lying about business impact.

The Systemic View

What this discussion is revealing: individual productivity gains don’t automatically translate to organizational productivity.

Luis’s point about code review bottlenecks, David’s observation about working on the wrong priorities, Maya’s concerns about quality debt—these are systems problems, not individual developer problems.

We gave developers better tools, but we didn’t:

  • Redesign our code review process for higher volume
  • Retrain reviewers on evaluating AI-generated code
  • Scale our QA and testing infrastructure
  • Improve our prioritization framework to focus on business value
  • Establish quality gates for AI-generated code

We optimized one node in a complex network and expected the whole system to improve. That’s not how systems work.

What’s Needed: Multi-Dimensional Measurement

Michelle, I think your instinct about outcome-focused metrics is right, but incomplete. We need a framework with multiple dimensions:

1. Adoption Metrics (are people using the tools?)

  • AI tool usage rates
  • Developer satisfaction scores

2. Impact Metrics (is it affecting the work?)

  • Cycle time across the full pipeline
  • Code quality trends (defect rates, security findings)
  • Technical debt velocity (accumulating vs paying down)

3. Outcome Metrics (is it creating business value?)

  • Feature delivery rate (customer-facing capabilities, not story points)
  • Time-to-market for new capabilities
  • Business KPI movement attributable to engineering work

4. Cost Metrics (is the ROI positive?)

  • AI tooling costs per engineer
  • Productivity gain (measured in outcomes, not activity) per dollar spent

Right now, most orgs are measuring #1 (adoption) and claiming victory. But we need all four dimensions to know if AI productivity is real.

The Cultural Shift Required

Here’s the harder part: We need to change what we celebrate.

If we keep celebrating PRs merged, commits pushed, and story points completed, that’s what teams will optimize for—regardless of whether it creates customer value.

We should celebrate:

  • Features that move business KPIs
  • Quality improvements that prevent future bugs
  • Architectural decisions that enable future velocity
  • Learning and innovation that compounds over time

The real productivity question isn’t “can developers code faster?”—it’s “can our organization deliver more customer value per unit of time and investment?”

Those are very different questions, and they require very different measurement approaches.

What measurement frameworks are others experimenting with? I’d love to learn from what’s working elsewhere.