If Lines of Code and PR Velocity Don't Capture AI Productivity, What Metrics Actually Matter?

Following up on Michelle’s thread about the AI productivity paradox, I want to propose a concrete measurement framework.

Keisha mentioned we need multi-dimensional measurement—I agree completely. But most organizations don’t know where to start or what to actually track.

Here’s what I’m experimenting with in our fintech engineering org.

The Problem with Current Metrics

Most teams track what’s easy, not what’s meaningful:

Vanity Metrics (easy to measure, misleading):

  • AI acceptance rate (“developers accepted 73% of AI suggestions!”)
  • Token consumption (“we used 2M tokens this month!”)
  • Lines of AI-generated code
  • Time saved in coding (based on developer self-reporting)

These tell us people are using the tools. They don’t tell us if it matters.

A Three-Layer Framework

I’m proposing we measure AI productivity across three layers, from input to outcome:

Layer 1: Input Metrics (Are we using AI?)

Track adoption and usage:

  • % of developers actively using AI tools weekly
  • % of code commits with AI assistance
  • Developer satisfaction scores with AI tools

Purpose: Understand adoption patterns and identify blockers
Warning: High usage doesn’t mean high value

Layer 2: Process Metrics (Is the work moving differently?)

Track how work flows through the system:

  • End-to-end cycle time: Time from commit to production across different work types
  • DORA metrics: Deployment frequency, lead time for changes, change failure rate, MTTR
  • Queue times: How long work waits in review, testing, deployment stages
  • Rework rate: How often AI-generated code needs significant revision

Purpose: Identify bottlenecks and understand where AI helps vs hinders
Key insight: If cycle time doesn’t improve despite faster coding, the bottleneck is elsewhere

Layer 3: Outcome Metrics (Are we delivering more value?)

Track business impact:

  • Feature delivery rate: Customer-facing capabilities shipped per sprint (not story points)
  • Time-to-value: Days from customer request to production solution
  • Quality outcomes: Defect rates, security findings, performance metrics
  • Business KPI movement: Revenue, activation, retention, engagement tied to engineering work

Purpose: Connect engineering activity to business results
Key insight: This is what actually matters, but it’s hardest to measure

Specific Implementation

Here’s what we’re tracking weekly:

AI-assisted work:

  • Cycle time: 8.2 days average (commit → production)
  • Change failure rate: 12%
  • Rework rate: 18% (code needing significant post-AI revision)

Non-AI work (control group):

  • Cycle time: 9.1 days average
  • Change failure rate: 8%
  • Rework rate: 7%

Interpretation: AI speeds up initial coding (~11% faster end-to-end), but introduces more failures and requires more rework. Net benefit is marginal, not the 45% individual gains we expected.

What This Reveals

The measurement itself is revealing problems:

  1. Review bottleneck: AI code spends 35% longer in review than human code
  2. Testing challenges: AI-generated code has higher defect escape rate (catching less in pre-production)
  3. Type-dependent value: AI helps significantly with isolated bug fixes, marginally with new features

We’re learning that AI productivity is highly context-dependent.

The Measurement Warning

Here’s the trap: research shows that “without changes in measurement and governance, productivity gains rarely translate into business outcomes.”

You can’t just measure. You also need to:

  • Act on what you learn (fix bottlenecks AI exposes)
  • Adjust processes (review, testing, deployment for AI era)
  • Train teams (how to use AI well, how to review AI code)

Measurement without action is just surveillance.

Questions for the Community

  1. What are you measuring to track AI productivity?
  2. Are you seeing different results for different types of work (bugs vs features vs refactoring)?
  3. How are you connecting engineering metrics to business outcomes?
  4. What measurement mistakes have you made that others should avoid?

I’m particularly interested in Layer 3 metrics—connecting engineering work to business KPIs. That’s where we’re struggling most.

Luis, your Layer 3 (Outcome Metrics) is exactly where Product needs engineering to go—but it’s also the hardest layer to implement well.

The Attribution Challenge

Here’s what makes Layer 3 so difficult: How do you attribute business outcomes to engineering work, let alone to AI-assisted engineering work?

Example from our product analytics:

  • User activation rate improved 8% last quarter
  • Engineering shipped 3 features during that time: onboarding redesign, email integration, performance improvements
  • Which feature drove the improvement? All three? None of them? (Maybe it was the marketing campaign that happened simultaneously)

We can’t isolate the AI variable when we can’t even isolate the engineering variable.

What We’re Trying: Feature-Level Instrumentation

Instead of measuring “overall business impact of AI,” we’re trying to track impact at the feature level:

For each shipped feature:

  1. Hypothesis: What customer behavior should change?
  2. Instrumentation: What metrics will we track?
  3. Timeline: How long before we expect to see impact?
  4. Attribution: Was this AI-assisted or traditional development?

Then we compare AI-assisted vs traditional features on:

  • Time to ship (Layer 2)
  • Customer impact (Layer 3)
  • Quality/stability post-launch

Early (Inconclusive) Findings

After 6 weeks of tracking:

AI-assisted features:

  • Shipped 15% faster on average
  • Customer adoption rates: Mixed (2 above expectations, 3 below, 1 neutral)
  • Post-launch bugs: 2.3× higher in first two weeks

Traditional features:

  • Slower to ship
  • Customer adoption: Also mixed (can’t detect meaningful difference yet)
  • Fewer immediate bugs, but also fewer features (so absolute bug count similar)

Problem: Sample size too small to draw conclusions. We need 6+ months of data.

The Uncomfortable Reality

Luis, your framework is exactly right in structure. But I think the honest answer is: Layer 3 is too noisy to measure AI impact in the short term.

Too many variables:

  • Feature complexity varies wildly
  • Market conditions change
  • Product-market fit shifts
  • Multiple features ship simultaneously
  • Customer needs evolve

Maybe the better question is: Can we measure Layer 2 (process) well enough that we trust it correlates with Layer 3 (outcomes)?

If AI improves cycle time, reduces defects, and maintains quality—can we assume that translates to business value over time, even if we can’t prove it feature-by-feature?

My Proposal: Proxy Metrics

Instead of trying to tie every engineering change to a business KPI, what if we measure:

Customer-facing velocity:

  • New capabilities available to users per month
  • Customer-requested features shipped vs requested
  • Time from customer feedback to production resolution

Product health:

  • User-reported bugs per 1000 active users
  • Feature usage rates (are we shipping things people use?)
  • Customer satisfaction scores (NPS, CSAT) trends

Delivery consistency:

  • Predictability of delivery timelines
  • Percentage of commitments met
  • Variance in sprint delivery

These are closer to business outcomes than “PRs merged,” but more measurable than “did this specific AI-generated code improve revenue?”

Thoughts? Is this close enough to Layer 3, or am I cheating by using proxies?

Luis, this framework is excellent—and it’s highlighting a problem we should have solved before rolling out AI tools.

We Don’t Have Baseline Metrics

Here’s my embarrassing admission: We didn’t measure DORA metrics, cycle time, or defect rates before introducing AI.

So now we can’t answer the basic question: “Are we better than we were before AI?”

We have anecdotal evidence (developers say they’re faster) and high-level intuition (feature delivery feels the same), but no actual data comparison.

This is a failure of leadership—mine specifically.

The Cold Start Problem

For organizations in our position—which I suspect is most organizations—how do you start measuring productivity improvement when you don’t have a baseline?

Option 1: Cohort comparison

  • Track AI users vs non-AI users within the same org
  • Compare their cycle times, quality metrics, delivery rates
  • Problem: Self-selection bias (early AI adopters are often already high-performers)

Option 2: Before/after with lag

  • Start measuring everything now
  • Wait 6 months to establish current-state baseline
  • Compare next 6 months to this baseline
  • Problem: 12 months before you have any data; business won’t wait that long

Option 3: Proxy estimation

  • Use git history and JIRA data to retroactively estimate past cycle times
  • Establish rough baseline from historical data
  • Compare to current forward-looking metrics
  • Problem: Retrospective data is always messier and less reliable

I don’t love any of these options.

What I’m Actually Doing

We’re doing a hybrid:

  1. Start tracking everything now (Layer 1, 2, 3 metrics)
  2. Use cohort comparison (AI heavy users vs light users)
  3. Focus on improvement trends rather than absolute productivity claims

Instead of saying “AI made us 26% more productive,” we’ll say “teams using AI heavily show 15% better cycle time improvement quarter-over-quarter compared to light users.”

It’s not a clean before/after, but it’s better than nothing.

The Investment Question

David’s point about noisy Layer 3 metrics is important. But here’s the executive reality:

The board doesn’t care about cycle time improvements. They care about business outcomes.

If I can’t connect AI investment to business KPIs, I’ll lose budget—regardless of how well-instrumented my Layer 2 metrics are.

So we have to attempt Layer 3 measurement, even if it’s imperfect. The alternative is making decisions blind or losing the resources to make decisions at all.

My Ask to the Community

For those who did establish baselines before AI adoption: What did you measure, and how are you using that data now?

For those in my position (no baseline, need to show ROI): What proxy metrics or comparison approaches have actually worked for you?

I’m trying to avoid “productivity theater”—where we measure things that look impressive but don’t actually tell us if we’re creating more value.

But I also need to avoid “measurement paralysis”—where we wait for perfect data and never make decisions.

What’s the pragmatic middle ground?

This thread is so valuable. Luis’s framework + David’s attribution challenges + Michelle’s baseline problem = the reality most of us are living in.

Adding a Fourth Layer: Organizational Health

I want to propose we add organizational health metrics to Luis’s three-layer framework, because I’m seeing AI impact culture in ways that don’t show up in cycle time or business KPIs.

Layer 4: Organizational Health Metrics

  • Developer satisfaction and engagement
  • Knowledge sharing and learning velocity
  • Team collaboration quality
  • Innovation capacity (time spent on exploratory work vs feature factory)
  • Retention and attrition rates

Here’s why this matters:

The Silent Risk

We can optimize for Layer 2 (faster cycle time) and Layer 3 (business outcomes) while quietly destroying Layer 4 (organizational sustainability).

Example from our org:

Senior engineers are getting frustrated. They feel like code review has become “checking AI’s homework” instead of mentoring junior developers. The learning conversations that used to happen during PR reviews aren’t happening anymore.

One of my tech leads said: “Junior devs are using AI as a crutch. They’re not learning architecture patterns because AI gives them something that works. Then they can’t debug it when it breaks.”

Result: Faster code today, but weaker engineering capability tomorrow.

This shows up in Layer 4 (engagement, learning) long before it shows up in Layer 2 or 3.

The Innovation Question

David mentioned measuring “time spent on exploratory work.” I think this is critical.

If AI makes us more efficient at executing known patterns, but we use that efficiency to cram in more feature tickets, are we accidentally reducing innovation time?

We should track:

  • % of engineering time spent on learning, experimentation, R&D
  • Rate of technical proposals and architectural discussions
  • Cross-pollination (engineers contributing outside their primary area)

AI should create space for higher-value work, not just accelerate the feature treadmill.

The Equity Dimension

Michelle, your cohort comparison approach has a hidden bias risk:

High-performing developers adopt AI early and use it well. Lower-performing developers adopt later and struggle more.

If we compare these cohorts and conclude “AI makes people 15% more productive,” we might actually be measuring the productivity difference between high and low performers, not the AI impact.

Worse: AI might be amplifying existing performance gaps rather than creating universal productivity gains.

Are we tracking:

  • Productivity improvement rates across different skill levels?
  • Whether AI is helping struggling teams or widening the gap?
  • Impact on underrepresented groups (who may have less access to learning resources)?

If AI benefits high performers more than struggling teams, it’s a strategic risk, not just a measurement problem.

My Modified Framework

Layer 1: Adoption (Are we using it?)
Layer 2: Process (Is work moving differently?)
Layer 3: Outcomes (Are we delivering more value?)
Layer 4: Sustainability (Are we building capability or burning it down?)

All four matter. Optimizing one at the expense of others creates long-term problems.

Pragmatic Measurement for Layer 4

You don’t need fancy tools:

  • Pulse surveys (monthly, 3 questions): Are you learning? Are you energized? Do you feel productive?
  • Retention tracking (quarterly): Are we losing talent? Who’s leaving and why?
  • Promotion velocity (annually): Are people growing into new roles at healthy rates?
  • Innovation metrics (quarterly): Technical proposals submitted, architectural discussions held, R&D projects launched

These are all measurable with existing data. We just have to decide they’re as important as cycle time.

The Uncomfortable Truth

Sustainable productivity requires balancing all four layers.

If we hit our Layer 3 targets (business outcomes) by burning out engineers (Layer 4), we’ll hit a cliff next year when everyone leaves.

If we optimize Layer 2 (process efficiency) without Layer 3 (business value), we’re busy but directionless.

The goal isn’t maximum productivity. It’s sustainable, value-creating, capability-building productivity.

That’s a much harder thing to measure, but it’s what actually matters for long-term success.

Are others thinking about organizational health as part of AI productivity measurement, or am I overthinking this?