We're Tracking Lines of Code and PR Count, But Are These the Right AI Productivity Metrics?

The discussions Maya and Luis started have surfaced a critical question that I think every engineering org needs to grapple with: If traditional metrics don’t capture AI’s real impact, what should we actually be measuring?

The Measurement Problem

Here’s what I’m seeing across my teams: We’re drowning in data but starving for insight.

Metrics We’re Tracking (And They’re Not Helping):

  • Lines of code written
  • PR count and merge rate
  • Story points completed per sprint
  • Individual commit frequency
  • Time to “code complete”

The Problem:
All of these metrics are up with AI usage. So by these measures, we’re “more productive.”

But as Luis showed, bug volume is up. As Michelle showed, full-cycle time is worse. As we’re all seeing, actual value delivered to customers is flat or down.

We’re measuring activity, not outcomes. And that’s dangerous.

The Three-Level Measurement Framework

I’ve been working with our data analytics team to rethink how we measure AI productivity impact. We’re using a three-level framework:

Level 1: Developer Experience (Individual)

What we measure:

  • Perceived productivity (survey: “Do you feel AI makes you more productive?”)
  • Tool satisfaction and engagement
  • Time saved on specific tasks (self-reported)
  • Learning and skill development

Current data:

  • 87% say AI makes them feel more productive
  • 4.2/5 satisfaction with AI tools
  • Self-reported 3.8 hours saved per week
  • But… 34% report feeling less confident in their code understanding

Interpretation:
Developer experience is positive on balance. AI makes work feel better. But there are early warning signs about skill development and code understanding.

Level 2: Team Effectiveness (Delivery System)

What we measure:

  • Full cycle time (idea → production stable)
  • DORA metrics:
    • Deployment frequency
    • Lead time for changes
    • Change failure rate
    • Mean time to recovery
  • Code review time and quality
  • Defect escape rate (bugs reaching production)
  • Rework percentage (time spent fixing vs building)

Current data:

  • Cycle time: +11% (worse) :chart_decreasing:
  • Deployment frequency: +3% (flat)
  • Lead time: +8% (worse)
  • Change failure rate: +14% (worse) :chart_decreasing:
  • MTTR: About flat
  • Review time: +47% (worse)
  • Defect escape: +22% (worse) :chart_decreasing:
  • Rework: 38% of time (up from 24%)

Interpretation:
Team delivery effectiveness has declined. We’re not shipping faster or more reliably—we’re shipping more bugs and spending more time on rework.

Level 3: Business Outcomes (Value)

What we measure:

  • Features delivered to production
  • Feature adoption rate (% of users using new features)
  • Customer satisfaction (NPS for new features)
  • Time-to-value (from idea to customer impact)
  • Revenue impact of shipped features
  • Customer-reported bug/issue rate

Current data:

  • Features shipped: +12%
  • Feature adoption: -3% (slightly worse)
  • NPS for new features: +1 (flat)
  • Time-to-value: +15% (worse) :chart_decreasing:
  • Revenue impact: Roughly flat per feature
  • Customer issues: +18% (worse) :chart_decreasing:

Interpretation:
We’re shipping more features, but they’re delivering less value per feature, taking longer to reach customers, and generating more support burden.

The Disconnection

Here’s what’s fascinating (and concerning): The three levels tell completely different stories.

Level 1 (Developer): :white_check_mark: AI is great! I feel faster and more satisfied.

Level 2 (Team): :warning: We’re slower, buggier, spending more time on rework.

Level 3 (Business): :cross_mark: Value delivery hasn’t improved and might be getting worse.

This is the perception-reality gap Maya identified, but at an organizational scale.

Why Traditional Metrics Fail for AI

I think traditional engineering metrics fail for AI productivity because they were designed for a different bottleneck.

Traditional assumption: Coding speed is the constraint.
Modern reality: Requirements clarity, architectural decisions, coordination, and quality are the constraints.

AI makes the non-constraint faster. But as Theory of Constraints teaches us, optimizing a non-constraint doesn’t improve system throughput—it just creates inventory (in our case, code awaiting review, bugs awaiting fixes, features awaiting adoption).

What We Should Measure Instead

Based on what I’m learning, here are metrics that actually matter:

For Individual Developers:

  • Time spent on high-value activities (architecture, design, problem-solving vs debugging, rework)
  • Code understanding (can you explain and defend your code?)
  • Learning velocity (are you developing new skills or relying on AI as a crutch?)
  • Code quality (defects per KLOC, security issues, maintainability)

For Teams:

  • Flow efficiency (value-add time / total cycle time)
  • Quality metrics (defect density, change failure rate, security findings)
  • Waste metrics (rework time, code churn, abandoned work)
  • Collaboration effectiveness (code review quality, knowledge sharing)

For Organizations:

  • Time-to-value (idea → customer impact)
  • Customer outcomes (adoption, satisfaction, problem resolution)
  • Business impact (revenue, retention, growth attributable to features)
  • Strategic progress (are we advancing our platform, or just churning?)

The Uncomfortable Truth

When I look at this data with our exec team, the conclusion is hard to avoid:

By the metrics that actually matter—team effectiveness and business outcomes—AI hasn’t made us more productive. It might have made us less productive.

But by the metrics that are easy to measure and feel good—individual activity and velocity—AI looks great.

We’re at risk of optimizing for the wrong things.

Questions for the Community

  1. What metrics are you tracking? Are they measuring activity or outcomes?

  2. How do you balance qualitative and quantitative data? Surveys say one thing, delivery metrics say another—which do you trust?

  3. What’s a realistic measurement framework that doesn’t require a data science PhD but actually captures AI’s impact?

  4. How do you communicate this to non-technical leadership? Executives see “PR count up 60%!” and want to know why we’re not shipping more value.

  5. Are there leading indicators we should track? Metrics that predict future productivity impact before it shows up in delivery?

Where I’m Headed

I’m moving toward a dashboard that shows all three levels side by side:

  • Developer experience (important for retention)
  • Team effectiveness (important for delivery)
  • Business outcomes (important for value)

And being honest when they don’t align. If developers feel great but outcomes are poor, that’s a signal we need to investigate, not ignore.

The goal isn’t to prove AI is good or bad—it’s to understand where and how it creates value, and where it doesn’t, so we can make informed decisions.

What measurement approaches are working for y’all? I’d love to learn from teams that have figured this out better than we have.


Sources: DORA Report and Developer Metrics, 2025 Stack Overflow Developer Survey - AI

Keisha, this three-level framework is exactly what I’ve been trying to articulate to my engineering partners. As someone living in the Level 3 world (business outcomes), I can’t tell you how frustrating it is to see Level 1 metrics used to justify AI investments when Level 3 results aren’t materializing.

The Product Perspective on Measurement

Here’s my framework for thinking about this (coming from a product/business lens):

The Iron Triangle of Product Development:

  1. Build the right thing (effectiveness)
  2. Build it well (quality)
  3. Build it efficiently (speed)

AI is helping with #3 (maybe). It’s not helping—and might be hurting—#1 and #2.

But guess what matters most to customers and the business? #1 and #2.

What I’m Measuring (And Why)

In product, we’ve learned the hard way that output metrics are vanity metrics. Here’s what I actually track:

Discovery and Validation Metrics (Pre-Build)

Time in discovery:

  • How long from problem identified → solution validated
  • AI impact: None (AI doesn’t help with customer research)

Idea validation rate:

  • What % of ideas survive customer validation
  • AI impact: Potentially negative (easier to skip validation when building is fast)

Iteration cycles before build:

  • How many prototype/test cycles before committing to build
  • AI impact: Down (we’re skipping straight to build)

Build Metrics (During Development)

Time to first usable version:

  • How long until we can put something in front of users
  • AI impact: Down ~20% (genuinely faster)

Scope creep rate:

  • How much does scope expand during development
  • AI impact: Up significantly (engineers add “quick AI-generated features”)

Outcome Metrics (Post-Launch)

Feature adoption rate:

  • % of target users who use the feature within 30 days
  • AI impact: Down -8% (worse)
  • This is the killer metric - we’re shipping features fewer people want

Time-to-value realized:

  • How long until customers report value from the feature
  • AI impact: Up +22% (worse—taking longer)
  • Why: More bugs, more iterations, more “it almost works”

NPS impact:

  • How much does feature move NPS
  • AI impact: -2 points (worse)
  • More bugs, less polished experiences

Support ticket volume:

  • How many support tickets does feature generate
  • AI impact: Up +34% (worse)
  • AI-generated code = more edge cases and bugs

Feature stickiness:

  • % of users who use feature repeatedly
  • AI impact: Down -6% (worse)
  • We’re shipping features that don’t stick

The Business Reality

Here’s what this looks like from a business perspective:

Investment:

  • AI tool costs: $50-80 per engineer per month
  • For our team of 25 engineers: ~$20K/year
  • Plus: Training, infrastructure, security tooling: ~$50K/year
  • Total AI investment: ~$70K/year

Returns:

  • Features shipped: +12% (3 additional features)
  • Revenue impact of those 3 features: ~$40K/year (lower adoption)
  • Support costs: +$30K/year (more bugs)
  • Rework costs: +$80K/year in engineering time (using Luis’s data)
  • Net impact: -$100K/year :grimacing:

We’re paying for AI tools and getting negative ROI at the business level.

Why This Is Happening

I think the fundamental issue is what Michelle identified: We’re optimizing a non-constraint.

In product development, the constraints are:

  1. Understanding the problem (customer research)
  2. Designing the right solution (product design)
  3. Making trade-off decisions (prioritization)
  4. Coordinating across functions (alignment)
  5. Validating with users (iteration)

Coding speed is #6 or #7 on the constraint list.

Making #6 faster doesn’t help if #1-5 are still slow. It just means we build the wrong thing faster.

The Metrics That Actually Predict Success

Based on analysis of our last 30 features, here are the metrics that actually correlated with feature success:

Strong Positive Correlation with Adoption:

  • Time spent in customer discovery (R² = 0.74)
  • Number of design iterations before build (R² = 0.68)
  • Cross-functional alignment scores (R² = 0.71)
  • QA thoroughness (test coverage) (R² = 0.65)

Weak or Negative Correlation:

  • Development speed (R² = 0.12)
  • Lines of code (R² = -0.08)
  • PR count (R² = 0.05)

In other words: The metrics AI improves don’t predict success. The metrics AI doesn’t touch do predict success.

What I’m Advocating For

When I talk to engineering leaders, here’s what I push for:

1. Outcome-Based Success Criteria

Before starting any feature:

  • Define what customer outcome we’re trying to achieve
  • Define how we’ll measure it
  • Define what success looks like

Then measure whether we achieved it. Period.

2. Time-to-Validated-Value

Not “time to code complete” or even “time to production.”

Measure: Time from idea → validated customer value.

This includes:

  • Discovery time
  • Build time
  • Iteration time
  • Adoption time

AI might speed up build, but if it slows down the others, it’s not helping.

3. Customer-Centric Quality Metrics

  • Bug reports from customers (not QA—customers!)
  • Feature ratings and feedback
  • Support ticket volume
  • Customer satisfaction with feature

These capture the real quality impact.

4. Product-Market Fit Indicators

  • Adoption rate
  • Usage frequency
  • Feature stickiness
  • Customer retention impact
  • Revenue impact

These tell you if you built something that matters.

The Uncomfortable Conversation with Engineering

Here’s a conversation I had with our VP Eng last week:

Engineering: “We shipped 20% more features this quarter using AI!”

Me: “But adoption is down 8%, support costs are up 30%, and NPS hasn’t moved. Did we actually deliver more value?”

Engineering: “Well, we can’t control whether customers adopt the features…”

Me: “Actually, we can—by building better features. And building better features requires discovery, design, and quality—the things AI doesn’t help with.”

This was a tough conversation. But it needed to happen.

My Recommendations

For product and engineering leaders working together:

1. Align on What Matters

Stop celebrating activity (PRs, features shipped). Start celebrating outcomes (adoption, satisfaction, value).

2. Measure Full Value Stream

Discovery → Build → Launch → Adoption → Impact

AI might speed up one part. Measure the whole thing.

3. Be Honest About Trade-Offs

If AI makes building faster but quality worse, acknowledge that trade-off. Don’t pretend speed is free.

4. Invest Savings Wisely

If AI saves time, invest it in discovery and design—not in building more half-validated features.

5. Link Metrics to Business Impact

Show executives how delivery metrics connect to business outcomes. Make it impossible to celebrate vanity metrics.

Keisha, your three-level framework is perfect. I’d add one more level: Level 4 - Customer & Business Impact. Because ultimately, that’s what our jobs depend on.

If we’re shipping faster but customers aren’t happier and the business isn’t growing, we’re just burning money efficiently. :fire::money_bag:


Context: Best AI Coding Agents for 2026 - Real-World Reviews

David, that ROI analysis is brutal but exactly the kind of honest accounting we need. Keisha, your three-level framework should be standard practice for every engineering org evaluating AI productivity claims.

Let me add the CTO strategic perspective on measurement and share our approach.

The Strategic Measurement Question

As CTO, I have to answer to the board and investors. They ask: “We’re investing in AI tools—what’s the return?”

For the first 6 months, I showed them the easy metrics:

  • PR count up 45%
  • Developer satisfaction up 40%
  • “Productivity gains” estimated at 25%

The board loved it. More investment approved.

Then I did what David just did: Full cost-benefit analysis using outcome metrics.

The results were sobering, and I had to go back to the board and say: “The metrics I showed you were misleading. Here’s the real picture.”

Our Full ROI Analysis

Investment (Annual):

  • AI tool licenses (120 engineers): $96K
  • Infrastructure for AI governance (security scanning, extended CI/CD): $180K
  • Training and enablement: $50K
  • Additional QA capacity to handle increased bug volume: $200K
  • Total: $526K

Returns:

  • Features delivered: +10% (from 80 to 88 features)
  • Estimated revenue from additional features: $450K
  • BUT: Defect-related costs up $280K
  • Rework costs up ~$400K in engineering time
  • Net business value: -$756K

Negative ROI of 144%. We lost $1.44 for every dollar invested.

This got the board’s attention.

Leading vs Lagging Indicators

David’s correlation analysis is spot-on. Let me add some thoughts on leading vs lagging indicators for AI productivity:

Lagging Indicators (What We Usually Measure):

  • Features shipped
  • Deployment frequency
  • Revenue impact
  • Customer satisfaction

These tell you what happened, but by the time you know, the damage is done.

Leading Indicators (What We Should Watch):

  • Code review feedback quality (declining = quality issues brewing)
  • Time spent in rework (increasing = efficiency problems)
  • Security scan findings (increasing = vulnerability debt accumulating)
  • Engineer confidence in code (declining = skill development problem)
  • Cross-functional friction (increasing = trust erosion)

The leading indicators started flashing warning signs 3-4 months before the lagging indicators showed the full impact.

If I’d been watching them, I could have course-corrected earlier.

The Balanced Scorecard Approach

After that board meeting, I implemented a balanced scorecard for AI productivity:

Dimension 1: Developer Experience

  • Satisfaction: 8.2/10 :white_check_mark:
  • Perceived productivity: 7.8/10 :white_check_mark:
  • Engagement: 7.5/10 :white_check_mark:
  • Skill development: 6.1/10 :warning:

Dimension 2: Engineering Effectiveness

  • DORA deployment frequency: -2% :cross_mark:
  • DORA lead time: +9% :cross_mark:
  • DORA change failure rate: +15% :cross_mark:
  • DORA MTTR: +3% :warning:

Dimension 3: Quality & Security

  • Defect density: +28% :cross_mark:
  • Security findings: +45% :cross_mark:
  • Code review cycles: +35% :cross_mark:
  • Technical debt velocity: +22% :cross_mark:

Dimension 4: Business Outcomes

  • Customer feature adoption: -5% :cross_mark:
  • Customer satisfaction (NPS): +1 (flat) :warning:
  • Time-to-customer-value: +18% :cross_mark:
  • Support burden: +31% :cross_mark:

Dimension 5: Economics

  • Engineering cost per feature: +12% :cross_mark:
  • ROI on AI investment: -144% :cross_mark:
  • Opportunity cost (time on rework): High :cross_mark:

Overall Score: 3/25 metrics improved :grimacing:

What This Taught Me

1. Vanity Metrics Are Seductive

It’s easy to show metrics that look good. It’s hard to show metrics that matter.

As a leader, I have a responsibility to show the hard truth, not the easy story.

2. System Thinking Is Critical

Optimizing one part of the system (coding speed) without considering the whole (value delivery) is worse than doing nothing—because you invest resources and get negative returns.

3. The Hawthorne Effect Is Real

People perform better when they know they’re being measured and when they’re excited about new tools. Some of the initial “AI productivity gains” were probably just Hawthorne effect—people trying harder because AI was new and exciting.

As the novelty wears off, the real signal emerges. And for us, the real signal is: marginal gains at best, significant costs.

4. We Need Better Measurement Literacy

Most engineering orgs don’t have robust measurement practices. We measure what’s easy (activity) not what’s important (outcomes).

AI is exposing this weakness. If we can’t measure real productivity, we can’t determine if AI helps.

The Framework I’m Using Now

North Star Metric: Time-to-validated-customer-value
(From idea → customer using feature and reporting value)

Supporting Metrics (All Must Improve):

  • DORA metrics (especially change failure rate)
  • Customer outcomes (adoption, satisfaction, retention)
  • Engineering health (quality, rework rate, morale)
  • Economic efficiency (cost per outcome, not cost per output)

AI-Specific Metrics:

  • Where is AI being used? (What types of tasks)
  • What’s the quality difference? (AI code vs human code)
  • What’s the full-cycle impact? (Including rework)
  • What’s the skill development impact? (Are engineers learning?)

Recommendations for CTOs and VPs

1. Don’t Trust Early Metrics

First 3-6 months of AI adoption will show inflated benefits due to novelty and Hawthorne effect. Wait for the signal to emerge.

2. Measure Multiple Dimensions

Developer experience alone is insufficient. Quality alone is insufficient. You need all dimensions.

3. Be Prepared to Show Negative Results

If the data shows AI isn’t helping (or is hurting), you need to communicate that honestly to leadership—even if it’s uncomfortable.

4. Connect Metrics to Business Outcomes

Executives care about revenue, growth, customer satisfaction. Show how engineering metrics connect to those business outcomes.

5. Build Measurement Infrastructure

Invest in tools and practices to measure what matters. This pays dividends beyond AI evaluation.

6. Create Feedback Loops

Measurement without action is waste. Use metrics to drive decisions and improvements.

Where We’re Going

Based on our analysis, we’re shifting strategy:

From: “Use AI to code faster”
To: “Use AI selectively where it creates value without downstream costs”

Specific Changes:

  • AI encouraged for: boilerplate, test generation, documentation
  • AI discouraged for: security-sensitive code, complex business logic, architectural decisions
  • All AI code requires enhanced review
  • Measuring quality and full-cycle time, not just coding speed

Early results (2 months in): Quality metrics improving, full-cycle time decreasing, developer satisfaction still high.

We’re optimizing for outcomes, not activity. And the board appreciates the honesty and rigor.

Keisha, David, Luis, Maya—this community discussion is the kind of honest, data-driven conversation our industry desperately needs.

Thank you for having the courage to question the hype and look at what’s really happening. :folded_hands:


Related: Top 100 Developer Productivity Statistics with AI Tools

This whole thread is making my brain hurt in the best way. :exploding_head: Keisha, David, Michelle—y’all are dropping some serious knowledge about measurement and I’m taking notes.

From a design/IC perspective, let me add thoughts on what we can actually measure at the individual contributor level that might be useful leading indicators.

The Designer’s Perspective on Quality Measurement

In design, we learned something important: You can’t inspect quality in—you have to design it in.

Same with code, right? If you’re measuring quality at the end (bug reports, security findings), you’ve already lost. You need to measure the behaviors and practices that create quality.

Traditional Approach:

  1. Build feature
  2. Test feature
  3. Count bugs
  4. Fix bugs
  5. Repeat

Quality-First Approach:

  1. Understand problem thoroughly
  2. Design elegant solution
  3. Implement thoughtfully
  4. Test continuously
  5. Ship confidently

AI enables the traditional approach (build fast, fix later). We need to use it to support the quality-first approach.

Individual-Level Leading Indicators

Here are metrics I track for myself (and might be useful for engineers):

Understanding Depth:

  • Can I explain every decision I made?
  • Can I defend the approach without referring to “AI suggested it”?
  • Do I understand the edge cases and failure modes?

Design Quality:

  • How many refactors did this code need before feeling “right”?
  • Does the API/interface feel intuitive?
  • Will my future self understand this in 6 months?

Impact Thinking:

  • Did I validate the problem before building the solution?
  • Did I consider alternatives or just implement the first thing that worked?
  • Am I building for the user’s need or just following a spec?

Learning Growth:

  • Did I learn something new from this work?
  • Did I rely on AI as a teacher or as a crutch?
  • Can I do this better next time without AI?

These are hard to measure quantitatively but crucial for long-term craft development.

The “Can You Explain It?” Test

Maya’s suggestion in the earlier thread resonated: If you can’t write the tests without AI, you don’t understand the code.

I’d extend that: If you can’t explain why your code is correct to a junior engineer, you don’t understand it well enough to ship it.

This becomes a quality gate that’s easy to implement but hard to fake.

Time Allocation as a Metric

One thing I track manually: Where does my time actually go?

High-Value Time:

  • Problem understanding and research
  • Design and architecture thinking
  • Code review and mentoring
  • Learning and skill development

Low-Value Time:

  • Fighting with syntax and configuration
  • Googling error messages
  • Copying boilerplate
  • Context switching

Ideally, AI should reduce low-value time and free up high-value time.

But if I’m honest with my time tracking: AI has reduced some low-value time (syntax, boilerplate) but also added low-value time (debugging AI code, fixing AI mistakes, reviewing AI suggestions).

And I’m not sure the time saved exceeds the time cost.

The Satisfaction Paradox

Here’s something weird I’ve noticed: I feel more satisfied when I use AI, but I feel more proud of work I do without AI.

With AI:

  • Lots of quick wins
  • Constant progress feeling
  • Dopamine hits from accepting suggestions
  • But… kind of hollow? Like I’m assembling IKEA furniture, not building custom woodwork

Without AI:

  • Slower
  • More frustrating moments
  • But deeper satisfaction when it works
  • Pride in craft and understanding

I wonder if this satisfaction paradox shows up in others’ experiences?

What I Actually Care About (And How to Measure It)

At the end of the day, here’s what I want from my work:

  1. Solving real problems → Measure: user feedback, problem resolution
  2. Creating elegant solutions → Measure: code review feedback, maintainability
  3. Learning and growing → Measure: skill development, mastery
  4. Making an impact → Measure: feature adoption, customer satisfaction

None of these are improved by typing faster or generating more code.

AI might support these goals if used thoughtfully. Or it might undermine them if used carelessly.

My Personal Metric Framework

I started tracking my own “personal productivity scorecard” (yes, I’m that person :sweat_smile:):

Weekly Review Questions:

  • What problems did I solve this week? (not: what code did I write)
  • What did I learn? (specific skills or insights)
  • What feedback did I get on code quality?
  • How much time did I spend on high-value vs low-value work?
  • Am I proud of what I shipped?

Monthly Review:

  • Are my solutions getting more elegant or just faster?
  • Am I growing my skills or relying more on AI?
  • Is my work having impact?

Quarterly Review:

  • What would I do differently knowing what I know now?
  • What patterns am I seeing in my successful vs unsuccessful work?

This qualitative reflection catches things that quantitative metrics miss.

Recommendations for Individual Contributors

If you’re an engineer or designer trying to figure out if AI is really making you more productive:

1. Track Your Time Honestly

Where does time actually go? Include AI debugging and review time.

2. Measure Understanding, Not Output

Can you explain and defend your work? That’s productivity.

3. Track Learning

Are you developing skills or developing AI dependency?

4. Get Feedback

What do reviewers say about your code quality? Is it improving or declining?

5. Measure Impact

Did your work solve the problem? Did users adopt it? Did it create value?

6. Reflect Regularly

Weekly/monthly reflection on what’s working and what’s not.

The individual-level metrics ladder up to Keisha’s team metrics and David’s business metrics.

If individuals aren’t truly productive (in terms of impact and growth), teams can’t be effective, and businesses can’t deliver value.

Thanks for this incredible discussion, y’all. I’m rethinking how I use AI and how I measure my own work. :light_bulb::sparkles:


Further reading: MIT Technology Review - Rise of AI Coding