We shipped 40% more code last quarter. Our bug count increased 65%. Something is fundamentally broken with how we’re measuring productivity in 2026.
I’m Rachel, managing an ML team at Anthropic, and I’ve been staring at our engineering dashboards trying to reconcile two contradictory realities. On paper, we’re crushing it. Commit frequency up. PRs merged faster. Story points velocity trending beautiful upward curves. Leadership loves our sprint reviews.
But talk to any engineer on my team and you’ll hear a different story. Code reviews feel rushed. Technical debt accumulating faster than we can address it. People working weekends to fix bugs that shouldn’t have shipped. One senior engineer told me last week: “I’m shipping more code than ever, but I’ve never felt less confident in my work.”
The AI Metrics Paradox is real, and it’s breaking our measurement systems.
Since GitHub Copilot, Cursor, and other AI assistants became standard tools, traditional productivity metrics have become actively misleading. We’re measuring what’s easy to count (lines of code, commits, PR velocity) while completely missing what actually matters (code quality, maintainability, developer confidence, sustainable pace).
The data backs this up. Research in 2026 shows that only 20% of teams are effectively measuring AI’s impact on engineering work. The other 80% are either not measuring it at all, or worse, using metrics that incentivize the wrong behaviors.
Here’s what I’m seeing in practice:
Our team’s GitHub activity metrics show 40% more commits per developer. Sounds great, right? But when I dug into the data, I found that bug-fix commits increased 85% and refactoring commits tripled. We’re not building more features faster - we’re creating more cleanup work.
Cycle time (commit to production) improved by 30%, but our mean time to recovery (MTTR) got worse by 45%. We’re shipping faster, but we’re also breaking things more often and taking longer to fix them. The net effect on customer experience? Negative.
One team celebrated hitting 100% of their sprint commitments for three quarters straight. Their incident count quadrupled in the same period. They were gaming velocity metrics by deferring quality work and accumulating technical debt.
The uncomfortable truth is that AI tools are amplifying our output without necessarily improving our outcomes. Developers can write code faster, but they’re also producing more bugs, creating more technical debt, and experiencing more cognitive overload from reviewing AI-generated suggestions.
What we need is a fundamental rethinking of engineering metrics for the AI era.
Traditional frameworks like DORA give us deployment frequency, lead time for changes, change failure rate, and time to restore service. These are still useful, but they don’t capture the full picture when developers are working alongside AI assistants.
What should we be measuring instead?
First, quality metrics need equal weight to velocity metrics. Not just bug counts, but time spent on unplanned work, technical debt trends, production incident severity, and customer-reported issues. If velocity is up but quality is down, that’s not productivity - it’s just technical debt accumulation with extra steps.
Second, developer confidence and cognitive load. Are engineers confident in the code they’re shipping? Do they understand the AI-generated suggestions they’re accepting? Or are they blindly accepting because they’re under pressure to move fast? We need to measure “time to confident deploy” not just “time to deploy.”
Third, value creation over code production. Did the code we shipped actually move business metrics? Are we solving customer problems or just generating commits? This requires connecting engineering metrics to product outcomes and customer satisfaction.
Fourth, sustainable pace indicators. Are people working longer hours to maintain higher velocity? Is on-call load increasing? Are code review quality standards slipping? Productivity that requires unsustainable effort isn’t productivity - it’s burnout waiting to happen.
I started tracking some of these alternative metrics with my team, and the results were eye-opening. When we measured “confident deploys” (code shipped where the engineer feels confident it will work), our apparent productivity dropped 25%, but our actual bug rate dropped 60%. We were shipping less, but shipping better.
The hardest part hasn’t been collecting better data - it’s been communicating these insights to leadership who are used to seeing nice upward-trending velocity charts. How do you tell executives that the productivity gains they’re celebrating might be illusory?
This is where I’m struggling and why I’m bringing this to the community. How are others navigating this measurement crisis?
What metrics are you tracking that actually reflect engineering effectiveness in the age of AI assistants?
How do you balance velocity with quality when leadership is focused on shipping faster?
Have you found ways to measure cognitive load, developer confidence, or sustainable pace that actually drive better decisions?
I can’t be the only one feeling like our productivity dashboards are lying to us. We need new frameworks, new metrics, and new ways of thinking about engineering effectiveness that account for how fundamentally AI has changed the way we work.
Would love to hear what’s working (and not working) for others wrestling with this challenge.