If 93% of Developers Use AI But Productivity Only Increased 10%, What Should We Actually Be Measuring?

Google reports that 25% of their code is now AI-assisted, yet they only see ~10% engineering velocity gains. The math doesn’t add up.

Either:

  1. AI-generated code requires disproportionate review/debugging time (the hidden cost)
  2. We’re measuring the wrong outputs (velocity isn’t the right metric)

I suspect it’s #2. We’re optimizing for activity (code written) when we should be optimizing for outcomes (value delivered).

The Metrics That Lie

Most teams measure AI impact using:

  • Lines of code written: Meaningless (more code ≠ better code)
  • PRs merged per week: Activity not outcomes
  • Cycle time: Can decrease while quality suffers
  • AI tool usage rate: Correlation not causation

These are output metrics that don’t tell you if you’re building the right things or building things right.

What We Should Measure Instead

I’ve been thinking about this through a product management lens. Here’s my proposed framework:

Level 1: Input Metrics (Are People Using AI?)

  • AI tool adoption rate (what % of engineers use it?)
  • Training completion rate (did they learn how?)
  • Engagement (how often do they use it?)

Purpose: Tells you if adoption is happening, not if it’s working.

Level 2: Output Metrics (Is More Code Being Produced?)

  • Code written per engineer
  • PRs created per week
  • Features shipped per sprint

Purpose: Tells you if activity increased, not if value increased.

Level 3: Outcome Metrics (Is Better Software Being Delivered?)

  • Time to value: Idea → production → customer validation
  • Code quality: Defect rate, post-deployment incidents
  • Customer satisfaction: NPS, support tickets, adoption
  • Team health: Developer satisfaction, retention, onboarding time

Purpose: Tells you if you’re delivering more value to customers.

The critical insight: Level 3 metrics matter most, but most teams only measure Level 1-2.

The Prevented Disasters Problem

Here’s what’s hard to measure: value created by problems that didn’t happen.

AI tools provide value through:

  • Bugs caught before production
  • Security issues identified in review
  • Performance problems spotted early
  • Architecture mistakes prevented

How do you measure a bug that never made it to customers? A security breach that never happened?

This is real value but nearly impossible to quantify.

The Option Value Challenge

Another measurement gap: AI creates option value—the ability to pivot, experiment, and respond to opportunities.

Example:

  • Engineer uses AI to prototype new approach in 2 hours instead of 2 days
  • Learns it won’t work, abandons it
  • Traditional metrics: “wasted 2 hours on dead-end”
  • Option value perspective: “validated hypothesis 16 hours faster”

We’re not measuring this learning velocity benefit.

What Google’s 10% Gain Actually Means

Let’s unpack that Google data point:

  • 25% of code is AI-assisted
  • Only 10% velocity improvement

This suggests:

  1. AI speeds up some parts of the process (writing code)
  2. But doesn’t speed up others (requirements, review, testing, deployment)
  3. So end-to-end cycle time improves modestly

If coding is 30% of the software delivery cycle, and AI makes it 50% faster:

  • 30% × 50% = 15% total cycle time improvement
  • Factor in review overhead, testing, etc.
  • Net result: ~10% velocity gain

The math actually checks out. AI optimizes one part of a multi-stage process.

The Quality-Velocity Tradeoff

What if the real story isn’t “AI only improved productivity 10%” but “AI improved sustainable productivity 10%”?

Velocity without quality is just accumulated technical debt.

Better question: How much does AI improve velocity while maintaining or improving quality?

Metrics I’d track:

  • Sustainable velocity: Features shipped that don’t require follow-up bug fixes
  • Rework rate: % of PRs that need subsequent fixes
  • Customer-facing quality: Incidents, performance regressions, user complaints

If AI increases velocity 30% but defect rate also increases 30%, you haven’t actually improved productivity—you’ve just shifted work from coding to debugging.

My Proposed Measurement Framework

Here’s what I’d recommend tracking:

Developer Experience:

  • Satisfaction with AI tools (7/10 or higher?)
  • Perceived productivity (do engineers feel more effective?)
  • Learning velocity (faster ramping on new domains?)

Engineering Efficiency:

  • Lead time for changes (commit → production)
  • Deployment frequency (shipping more often?)
  • Change failure rate (fewer incidents?)
  • Time to restore service (faster recovery?)

Business Outcomes:

  • Feature adoption (are customers using what we built?)
  • Customer satisfaction (NPS, retention)
  • Engineering retention (are engineers staying?)
  • Recruiting (can we attract better talent?)

Connect the dots: AI training → faster cycle time → more experiments → better product-market fit → revenue growth.

Questions for the Community

  1. What are you measuring? And is it actually telling you if AI is working?
  2. How do you measure prevented disasters? Bugs that never shipped, issues caught in review?
  3. What’s the right timeframe for ROI? Should we expect results in 3 months? 12 months? 24 months?
  4. Are qualitative metrics undervalued? Developer satisfaction, team morale, etc.?

I’m convinced most teams are measuring the wrong things. We need to shift from output metrics (code written) to outcome metrics (value delivered).

What’s everyone else seeing in their organizations?

This is the right question. Most orgs are drowning in vanity metrics while missing what actually matters.

The Input-Output-Outcome Framework

Your three-level framework is spot-on. Let me add specificity:

Input Metrics (Are we investing?)

  • Training completion: 75%+
  • Tool adoption: 60%+ daily active
  • Budget allocated: $X per engineer

Output Metrics (Are we producing?)

  • Code volume: Lines/PRs per week
  • Cycle time: Commit → PR → merge
  • Review time: Hours in review

Outcome Metrics (Are we delivering value?)

  • DORA Four Keys: Lead time, deploy frequency, change failure rate, MTTR
  • Customer metrics: Feature adoption, NPS, support tickets
  • Team health: Retention, satisfaction, onboarding time

Most companies measure inputs and outputs, then wonder why business impact is unclear.

The Prevented Disasters Measurement Challenge

You’re right that counterfactuals are hard. Here’s how we approach it:

1. Track Near-Misses

  • Bugs caught in code review (that would’ve shipped without AI)
  • Security issues flagged by AI (that manual review missed)
  • Performance problems identified before production

Tag these in your issue tracker: “prevented-by-AI”

2. A/B Test Where Possible

  • Team A uses AI tools, Team B doesn’t
  • Compare defect rates, incident rates, customer satisfaction
  • Requires discipline but provides clean signal

3. Qualitative Case Studies

  • Document 5-10 “AI prevented this disaster” stories
  • Use these for storytelling to executives
  • Numbers + narratives = compelling case

Example from our org:

  • AI flagged SQL injection vulnerability in PR
  • Engineer had missed it, would’ve shipped to production
  • Potential data breach prevented
  • Value: Immeasurable (could’ve been millions + reputation damage)

One prevented breach justifies 10 years of AI tool investment.

Sustainable Velocity: The Real Metric

I love your framing: sustainable velocity matters more than raw velocity.

Traditional velocity: Features shipped per sprint
Sustainable velocity: Features shipped that don’t require follow-up fixes

Track:

  • Rework rate: % of PRs followed by bug fix PRs within 30 days
  • Incident rate: Production issues per 100 deployments
  • Customer-reported bugs: Issues customers find (vs we find)

If AI increases velocity 30% but rework rate increases 30%, you’re running in place.

Our data (18 months of AI tool usage):

  • Velocity increased 12%
  • Rework rate decreased 8%
  • Net sustainable velocity: ~20% improvement

That’s real productivity.

Time to Value > Cycle Time

Cycle time measures “commit to production” but misses:

  • Time from idea to commit (product definition, design)
  • Time from production to customer validation (adoption, feedback)

Time to value measures end-to-end: idea → validated customer outcome

This is the metric executives should care about because it’s tied to business results.

At my company:

  • Cycle time (commit → prod): Improved 15% with AI
  • Time to value (idea → validation): Improved 8%

Why the gap? Because AI doesn’t speed up product decisions or customer adoption.

ROI Timeframe: 12-18 Months Minimum

To your question #3: expect 12-18 months before measurable business impact.

Compare to other platform investments:

  • Kubernetes: 18-24 months to mature adoption
  • Microservices: 24-36 months to full benefits
  • DevOps transformation: 18-30 months to elite DORA metrics

AI tool adoption is organizational habit change, not feature deployment. It takes time.

Milestone timeline:

  • Q1-Q2: Training, early adoption (input metrics)
  • Q3-Q4: Usage scaling, quality metrics (output metrics)
  • Year 2: Business impact visible (outcome metrics)

Manage expectations: This is a marathon, not a sprint.

Qualitative Metrics Are Undervalued

Your question #4: Absolutely yes.

Numbers tell you “what happened” but stories tell you “why it matters.”

We track:

  • Developer satisfaction: Quarterly survey, 1-10 scale
  • Anecdotal wins: Monthly “AI success stories” in all-hands
  • Manager observations: Do reports seem more productive/engaged?

Last quarter’s survey insight:

  • 78% of engineers say AI makes them “more effective”
  • But when asked what changed: “I can tackle new problems faster” (learning velocity)

That’s real value that doesn’t show up in code metrics.

My Measurement Stack

Here’s our actual dashboard:

Weekly (Inputs & Outputs):

  • AI tool usage rate
  • PRs created, cycle time
  • Code review time

Monthly (Outcomes):

  • DORA metrics
  • Defect rate, incident rate
  • Developer satisfaction (pulse survey)

Quarterly (Business Impact):

  • Feature velocity (shipped to customers)
  • Customer satisfaction (NPS)
  • Engineering retention
  • Recruiting funnel health

Connect these with narrative: “AI tools → faster cycle time → more experiments → better features → higher NPS → revenue growth.”

The Real Answer to Google’s 10% Paradox

Your math is right. If coding is 30% of the process and AI speeds it up 50%:

  • 30% × 50% = 15% potential improvement
  • Minus review overhead, testing, etc.
  • Net: ~10% actual improvement

But here’s the reframe: 10% sustained productivity improvement is massive.

If you could make your entire engineering org 10% more effective every year:

  • Year 1: 1.10x productivity
  • Year 3: 1.33x productivity
  • Year 5: 1.61x productivity

Compound that across 100s of engineers over multiple years. That’s enormous leverage.

The problem isn’t that 10% is too small. It’s that expectations were set unrealistically high (“AI will make us 10x faster!”).

Recalibrate: 10% sustained improvement is the goal, not the disappointment. :bar_chart:

Measurement is hard. Here’s what we’re actually tracking and what we’ve learned:

Proxy Metrics When Direct Measurement is Hard

You’re right that prevented disasters and option value are nearly impossible to measure directly.

Our approach: Track proxy metrics that correlate with value.

Instead of “bugs prevented” (can’t measure), we track:

  • Code review rejection rate: % of PRs sent back for revision
  • Review comments per PR: More comments = more issues caught
  • Time in review: Longer review = more thorough (up to a point)

These don’t directly measure “prevented bugs” but they correlate with code quality gatekeeping.

Our Actual Metrics (18 Months of Data)

Here’s what we measure and what changed:

Before AI Tools:

  • Cycle time: 4.2 days (commit → production)
  • Defect rate: 3.2 bugs per 100 deployments
  • Review time: 8 hours/week per senior engineer
  • Developer satisfaction: 7.1/10

After AI Tools (18 months):

  • Cycle time: 3.6 days (15% improvement)
  • Defect rate: 3.5 bugs per 100 deployments (8% worse)
  • Review time: 13 hours/week per senior engineer (62% more)
  • Developer satisfaction: 7.8/10 (10% improvement)

Interpretation:

  • Shipping faster :white_check_mark:
  • Quality slightly degraded :cross_mark:
  • Senior engineers working harder :warning:
  • Team happier overall :white_check_mark:

Net: Still positive but not a slam dunk. We’re iterating on training to improve the quality aspect.

A/B Testing: One Team With AI, One Without

We tried this for 6 months:

  • Team A (Platform, 12 engineers): AI tools
  • Team B (Data, 10 engineers): No AI tools

Results:

  • Team A shipped 18% more features
  • Team B had 12% fewer post-deploy issues
  • Team A’s senior engineers complained about review burden
  • Team B felt “left behind” and requested AI tools

Lesson: Hard to run clean experiments when cultural factors interfere. Team B’s morale drop from “not having the shiny tools” was a confound.

We ended the experiment and gave everyone AI tools.

Qualitative > Quantitative (Sometimes)

Your question about undervaluing qualitative metrics: Yes.

Most impactful feedback we got wasn’t from dashboards, it was from interviews:

Engineer quote: “AI doesn’t make me write code faster. It makes me braver about tackling unfamiliar domains.”

This is learning velocity—an outcome we weren’t measuring but is arguably more valuable than code velocity.

How do you measure “engineers are more willing to take on stretch projects”?

We started tracking:

  • % of engineers working outside their primary domain each quarter
  • Self-reported confidence in new tech (survey question)
  • Cross-team collaboration frequency

These aren’t perfect but they’re better than ignoring qualitative value.

Comparing AI-Assisted vs Non-AI-Assisted Work

One thing we do: Tag PRs as AI-assisted or not (self-reported).

Then compare:

  • First-time review pass rate: 87% (AI) vs 92% (non-AI)
  • Review cycles: 2.1 (AI) vs 1.6 (non-AI)
  • Defect rate (30-day): 4.1% (AI) vs 2.8% (non-AI)

This tells us: AI-assisted code requires more review and has higher defect rate (consistent with research).

But we also see:

  • AI-assisted PRs tackle harder problems (median complexity higher)
  • Engineers use AI for unfamiliar domains more (learning)

So it’s not apples-to-apples. AI-assisted code is lower quality for the same problem, but engineers are tackling harder problems with AI.

Time Horizon: Expect 12+ Months

Your question about timeframe: We saw meaningful impact at 12 months, not before.

Month 1-3: Training, adoption chaos, metrics worse
Month 4-6: Usage stabilizes, metrics return to baseline
Month 7-12: Quality improves as engineers learn best practices
Month 12+: Sustained improvements visible

If we’d measured at Month 6, we would’ve concluded AI tools failed. But we were in the “trough of disillusionment” part of the hype cycle.

Patience required.

What We’re NOT Measuring (But Probably Should)

Gaps in our measurement:

  • Option value: Experiments run that informed decisions (even if code was discarded)
  • Documentation quality: AI-generated docs might be more comprehensive
  • Onboarding time: New engineers ramp faster with AI-assisted learning

We’re adding these to our Q2 measurement plan.

My Main Lesson: Measure Multiple Levels, Connect the Dots

Don’t just track inputs OR outputs OR outcomes. Track all three and show the connections:

“AI training → higher tool usage → faster cycle time → more experiments → better product decisions → higher feature adoption → revenue growth”

This narrative + data combo is what convinces executives to keep investing.

And yes, 10% sustained productivity improvement is huge. Set expectations accordingly. :chart_increasing:

Adding the people and culture dimension that’s missing from this metrics discussion:

Developer Experience Metrics Matter Most

Code metrics (velocity, cycle time, defect rate) are important but incomplete.

People metrics often predict business outcomes better:

  • Developer satisfaction
  • Engineering retention
  • Recruiting funnel health
  • Diversity of hires

Why? Because happy, engaged engineers:

  • Stay longer (knowledge retention)
  • Attract better talent (recruiting flywheel)
  • Ship better products (intrinsic motivation)

AI tools impact these in ways that don’t show up in code metrics.

What We Measure (People-Focused)

Quarterly Developer Satisfaction Survey:

  • “How satisfied are you with your work?” (1-10)
  • “Do AI tools make you more effective?” (Yes/No/Unsure)
  • “What can you do now that you couldn’t before?” (Open-ended)

Engineering Retention:

  • Voluntary attrition rate
  • Exit interview themes (“Why did you leave?”)
  • Stay interview themes (“Why are you staying?”)

Onboarding Time:

  • Time to first PR (new hire → first meaningful contribution)
  • Time to full productivity (self-reported confidence)
  • Mentorship burden on senior engineers

AI tools should improve these. If they don’t, something’s wrong.

Our Data (24 Months)

Before AI tools (2024):

  • Developer satisfaction: 7.2/10
  • Voluntary attrition: 14%
  • Time to first PR: 3.2 weeks
  • Mentorship hours: 8 hours/week per senior

After AI tools (2026):

  • Developer satisfaction: 7.9/10 (10% improvement)
  • Voluntary attrition: 10% (29% reduction)
  • Time to first PR: 1.8 weeks (44% faster)
  • Mentorship hours: 6 hours/week per senior (25% reduction)

These are the outcomes that matter to me as a VP Eng: happier engineers, lower turnover, faster onboarding.

Code velocity is nice, but retention and satisfaction drive long-term success.

Learning Velocity: The Hidden Benefit

@eng_director_luis mentioned this and it’s so important.

Engineers with AI tools:

  • Tackle unfamiliar domains more confidently
  • Learn new languages/frameworks faster
  • Ask for stretch assignments more often

This is learning velocity—how fast your team acquires new capabilities.

How to measure:

  • % of engineers working outside primary domain (cross-training)
  • Self-reported confidence in new tech (survey)
  • of “I learned X using AI” stories in retrospectives

At my company:

  • Before AI: 25% of engineers worked outside primary domain in a given quarter
  • After AI: 43% worked outside primary domain

This skill expansion is huge for organizational flexibility.

Career Development Impact

Unexpected benefit: AI tools are changing career progression.

Before AI:

  • Juniors spent 1-2 years writing CRUD code (learning fundamentals)
  • Graduated to complex features after proving competence

After AI:

  • Juniors can generate CRUD code day one (with supervision)
  • Spend more time on architectural thinking and system design

This accelerates career development but requires different mentorship:

  • Less “here’s how to write this code”
  • More “here’s how to think about this system”

We’re tracking:

  • Time to promotion (IC1 → IC2, IC2 → IC3)
  • Self-reported readiness for next level
  • Manager assessment of growth trajectory

Early signs: Engineers are progressing faster but with different skill profiles than before.

Diversity and Inclusion Considerations

One thing I watch carefully: Are AI tools improving or harming diversity outcomes?

Questions I ask:

  • Do underrepresented engineers use AI at the same rate? (Adoption gap?)
  • Does AI-assisted code review have bias? (Lower approval rates for some groups?)
  • Are performance expectations changing in ways that disadvantage certain populations?

So far our data shows no significant disparities, but I’m monitoring closely.

Measuring What Executives Care About

At the end of the day, I need to show:

  • CFO: Is this worth the cost? (ROI, retention, efficiency)
  • CEO: Are we shipping better products? (Customer satisfaction, revenue)
  • Board: Are we building a stronger team? (Retention, recruiting, culture)

My dashboard for executives:

Engineering Health (People):

  • Developer satisfaction: 7.9/10 :white_check_mark:
  • Retention: 90% annual :white_check_mark:
  • Time to full productivity: 6 weeks :white_check_mark:

Engineering Productivity (Delivery):

  • DORA lead time: 3.2 days :white_check_mark:
  • Deploy frequency: 12/week :white_check_mark:
  • Change failure rate: 2.1% :white_check_mark:

Business Impact (Outcomes):

  • Feature velocity: +15% YoY :white_check_mark:
  • Customer NPS: 42 → 48 :white_check_mark:
  • Engineering cost per feature: -12% :white_check_mark:

This connects people metrics → productivity metrics → business metrics.

The 10% Paradox Reframed

To your Google question: 25% AI code, 10% velocity gain isn’t a paradox. It’s a sign of healthy measurement.

If it was 25% AI code, 50% velocity gain, I’d be suspicious:

  • Are we cutting corners?
  • Is quality suffering?
  • Are we creating technical debt?

10% sustained productivity improvement while maintaining quality is exactly what you want.

It’s compounding gains:

  • 10% more features shipped
  • 10% less turnover (knowledge retention)
  • 10% faster onboarding (scale faster)

Over 5 years, that’s 1.6x better outcomes. Massive.

My Core Advice

Don’t measure AI tools in isolation. Measure holistically:

  • Are engineers happier? (People)
  • Are we shipping faster? (Productivity)
  • Are customers more satisfied? (Business)

If all three improve, AI tools are working—regardless of what the code metrics say.

And qualitative feedback matters as much as quantitative. Numbers tell you what; stories tell you why. :glowing_star:

Coming from design, this entire thread resonates. We went through the same measurement struggle with design tools.

The Figma Parallel

When we adopted Figma (from Sketch), everyone asked: “Did it make us faster?”

Metrics we tried:

  • Designs created per week (↑ 40%)
  • Design review cycles (→ no change)
  • Customer satisfaction with product (↑ 8%)

The speed increase was real but mostly showed up in iteration velocity (trying more options) not delivery velocity (shipping to customers faster).

Sound familiar? That’s your AI coding paradox.

What We Actually Improved: Collaboration

The real value of Figma wasn’t speed—it was collaboration quality.

Before: Designers worked in isolation, shared static screenshots
After: Designers worked in shared files, real-time feedback

Metrics that captured this:

  • Cross-functional collaboration frequency (meetings with eng/product)
  • Design review participation (more people engaged earlier)
  • Misalignment issues (post-launch “this isn’t what I expected”)

AI tools might be similar: Value is in collaboration and learning, not just raw speed.

Measuring Creative Velocity

In design, we track:

  • Concepts explored per project (more options considered)
  • Time to rule out bad ideas (faster validation)
  • Designer confidence in decisions (fewer second-guesses post-launch)

These aren’t traditional productivity metrics but they matter.

For engineering, analogous metrics:

  • Architectures prototyped before committing
  • Time to validate technical approaches
  • Engineer confidence in solutions

Your “option value” problem is our “creative exploration” problem. Hard to measure but critical.

The Quality-Speed Tradeoff

Design rule: More design options doesn’t mean better design. Sometimes constraints breed creativity.

Could AI tools have the same issue?

  • Engineers generate 10 solutions in the time they used to generate 2
  • But spend more time evaluating which is best
  • Net result: Same time to decision, higher cognitive load

We saw this with AI-generated design variations. Had to add constraints (design system guardrails) to reduce overwhelming choice.

Maybe engineering needs similar: AI within guardrails to prevent option overload.

Measuring Confidence and Learning

One of the best questions we added to our designer survey: “How confident are you tackling problems outside your comfort zone?”

This captures learning velocity better than any quantitative metric.

Before AI design tools: 6.1/10 confidence
After AI design tools: 7.8/10 confidence

Engineers should measure the same: “How confident are you working in unfamiliar codebases/domains?”

This is the “bravery to tackle new problems” that @eng_director_luis mentioned.

The Prevented Disaster Problem (Design Edition)

We have the same measurement challenge: How do you measure bad designs that never shipped?

Our proxy: Track design review feedback themes

  • “This doesn’t meet accessibility standards” (caught early)
  • “This breaks our design system” (prevented inconsistency)
  • “This won’t work on mobile” (avoided customer pain)

Tag these as “prevented by design review” and count them monthly.

For engineering: Tag code review feedback with “prevented issues” category. That’s your prevented disasters proxy.

My Measurement Philosophy: Measure What Matters, Ignore the Rest

Most organizations measure what’s easy (lines of code, PRs, commits) instead of what matters (value delivered).

Ask yourself:

  • If this metric goes up, does the business get better?
  • If this metric goes down, do we care?

“Lines of code” going up doesn’t make the business better. “Customer satisfaction” going up does.

Measure the second one, ignore the first.

The 10% Google Gain is Actually Great

Everyone’s disappointed it’s “only 10%” but let me offer the design perspective:

Most design tool improvements give you 0-5% productivity gains. Figma’s real-time collaboration was a 10-15% gain—and that was transformational.

10% sustained improvement is huge. Anyone expecting 50-100% gains doesn’t understand how productivity actually works.

Productivity improvements are usually:

  • 1-5% (small win)
  • 5-15% (big win)
  • 15%+ (transformational)

AI tools at 10% are in “big win” territory. Celebrate that. :artist_palette::sparkles:


Summary from the group:

We’re all saying the same thing from different angles:

The real answer: Track multiple levels, connect the dots, tell the story.

And 10% sustained productivity improvement is the goal, not the disappointment. :bar_chart: