Let's Build the AI Productivity Measurement Framework We Actually Need—Together

After our discussions on measurement challenges, junior developer impact, and metrics frameworks, I want to synthesize what we’ve learned into something actionable.

Because here’s what’s clear: We’re all flying blind, and we need to fix that.

What We’ve Established

The Paradox:

  • 41% of code is AI-generated
  • 93% of developers use AI tools
  • Productivity gains plateau at 10%
  • Developers feel 24% faster but are actually 19% slower (METR study)

The Hidden Costs:

  • 1.7× more issues in AI-assisted code
  • 23.7% more security vulnerabilities
  • 30% increase in code review time
  • 2.3× higher revert rate for AI-heavy code (Luis’s data)
  • 17% lower skill mastery for AI-assisted learning (Anthropic research)

The Measurement Gap:

  • DORA metrics show flat or declining performance
  • Individual time savings don’t translate to team velocity
  • CFOs can’t see business value despite K+ tool budgets
  • We lack baselines and control groups

Proposed Three-Tier Framework

Building on Michelle’s structure:

Tier 1: Business Outcomes (Monthly)

What CFOs Care About

  • Revenue per engineer
  • Cost per delivered feature
  • Time-to-market for revenue features
  • Customer-reported bugs (severity-weighted)
  • Security incidents and compliance violations
  • Support burden and customer satisfaction

Decision Rule: If these are improving, AI is working. If flat/declining, investigate with Tier 2.

Tier 2: Team Health (Weekly)

What Engineering Leaders Care About

  • Code durability (% surviving 30/60/90 days without modification)
  • Review burden (time and comments per PR, by code source)
  • Change failure rate by code source (human vs AI-heavy)
  • Technical debt indicators (complexity, test coverage requirements)
  • Incident response time (by developer experience level)
  • Knowledge transfer effectiveness (PR explanation quality)

Decision Rule: Use these to diagnose why Tier 1 is flat/declining and identify context-specific policies.

Tier 3: Individual Patterns (Daily/Real-time)

What Individual Contributors Care About

  • AI tool usage patterns and acceptance rates
  • Developer sentiment and satisfaction
  • Code authorship attribution
  • Skill development trajectory (debugging proficiency, time-to-promotion)

Decision Rule: Use for coaching and identifying AI dependency or underutilization.

Implementation Roadmap

Month 1: Baseline

  • Establish pre-optimization metrics across all three tiers
  • Identify 1-2 “control” teams for AI-light comparison
  • Set up instrumentation and dashboards

Months 2-4: Pilot & Measure

  • Run A/B test within teams (AI-heavy vs AI-light for similar features)
  • Collect data across all three tiers
  • Weekly Tier 2 reviews, monthly Tier 1 reviews

Month 5: Analysis & Decision

  • Correlate Tier 3 patterns with Tier 2 outcomes with Tier 1 business value
  • Identify contexts where AI helps vs hurts
  • Make evidence-based policy decisions

Month 6: Policy Implementation

  • Context-specific AI adoption guidelines (Luis’s approach)
  • Experience-level access tiers (Keisha’s approach)
  • Ongoing measurement and iteration

Context-Specific AI Policies

Based on our discussions, proposed policies:

By Code Type:

  • :white_check_mark: High-value: Boilerplate, tests, documentation, refactoring well-understood code
  • :warning: Medium-value: Business logic, API implementations (require senior review)
  • :cross_mark: Restricted: Security-critical, compliance-regulated, performance-critical code

By Developer Experience:

  • Juniors (0-18 months): Restricted to learning exercises, no production code
  • Mid-level (18 months-4 years): Full access with mandatory explanation in PRs
  • Senior (4+ years): Unrestricted with professional judgment

By Business Context:

  • Fintech/Healthcare/Regulated: Stricter controls, audit requirements
  • Consumer/SaaS: More permissive, optimize for speed
  • Infrastructure/Platform: Restricted for critical paths, allowed for tooling

Success Criteria

After 6 months, AI tools justify their cost if:

:white_check_mark: Revenue per engineer increases 10%+
:white_check_mark: Time-to-market for revenue features decreases 15%+
:white_check_mark: Customer-reported bugs stay flat or decrease
:white_check_mark: Junior → mid-level promotion rate stays constant or improves
:white_check_mark: Team velocity (DORA) improves

AI tools should be adjusted if:

:neutral_face: Business metrics flat despite productivity claims
:cross_mark: Quality metrics declining (bugs, security issues)
:cross_mark: Skill development trajectory slowing
:cross_mark: Review burden offsetting coding speed gains

AI tools should be pulled back if:

:cross_mark: Business value declining
:cross_mark: Quality problems creating customer impact
:cross_mark: Organizational capability eroding
:cross_mark: Cost of quality exceeds tool savings

The Commitment

The goal isn’t to prove AI is good or bad—it’s to understand when it helps and when it hurts.

I’m committing to:

  • :white_check_mark: Implement this framework at my org starting Q2
  • :white_check_mark: Share data quarterly (anonymized) back to this community
  • :white_check_mark: Be willing to pull back on AI adoption if data says it’s not working
  • :white_check_mark: Focus on business outcomes, not engineering vanity metrics

The Ask

Who else is willing to run this experiment?

If 5-10 companies implement this framework and share learnings, we’ll have real data on:

  • Which contexts AI helps vs hurts
  • What experience-level policies work
  • Whether business value actually materializes
  • How to measure AI impact properly

Then we can stop flying blind and start making evidence-based decisions.

Who’s in?

I’m in. Count us as the first org committing to this framework.

David, this is exactly what we need—a structured, evidence-based approach instead of industry hype-driven adoption.

Our Commitment

Starting Q2 2026, we’re implementing:

Baseline Month (April):

  • Instrument all three tiers of metrics
  • Establish control team (Platform Engineering—already skeptical of AI tools)
  • Set up quarterly business review cadence with CFO

Pilot Quarter (May-July):

  • A/B test: Some teams AI-heavy, some AI-light
  • Weekly Tier 2 reviews with eng leadership
  • Monthly Tier 1 reviews with exec team

Decision Month (August):

  • Full analysis of 3 months of data
  • Present findings to board
  • Make go/adjust/retreat decision on K tool budget

Transparency Commitment:

  • Will share anonymized quarterly results back to this community
  • Will publish methodology and lessons learned
  • Will be honest about failures, not just successes

What I’m Nervous About

Political pressure to declare success regardless of data. We’ve made a big bet on AI tools. If the data shows they’re not working, can I actually get the org to pull back?

Sunk cost fallacy. We’ve spent K this year. Even if data says it’s not working, the CFO might push to “give it more time” rather than admit the investment was wrong.

Developer morale. If we restrict AI access after developers have gotten used to it, there will be pushback. Need to manage that change carefully.

But better to face these now than in 2 years when we’ve doubled down and eroded our engineering capability beyond repair.

The Meta Question

This framework itself is a bet that rational, data-driven decision-making will win over industry momentum and fear of missing out.

Not sure that’s a safe bet in today’s AI hype cycle. But it’s the right approach regardless.

Count me in as well. This is too important to wing it.

Our Implementation Plan

Tier 1 Addition - Talent Development:

I’m adding one metric to your Tier 1 that I think is business-critical:

Junior → Mid-Level Promotion Rate:

  • % of juniors promoted to mid-level within 24 months
  • Current baseline: 65%
  • Target with AI tools: Maintain or improve

Why it’s Tier 1 (business metric, not just eng metric):

  • Cost of failed junior hire: ~K (recruiting, onboarding, opportunity cost)
  • Cost of keeping someone stuck at junior level: ~K/year (overpaying for junior output)
  • Retention impact: Juniors who don’t grow leave, creating hiring treadmill

If AI tools tank this metric, they’re creating business harm even if code output looks good.

Tier 3 Addition - Skill Development Tracking:

For individual developers, I want to track:

Debugging Proficiency Score:

  • Time to resolve test environment issues (weekly practice sessions)
  • Root cause analysis quality in incident postmortems
  • Ability to solve novel problems without AI assistance

Mental Model Strength:

  • Can explain architectural decisions (assessed in design reviews)
  • Understands system interactions (assessed in cross-team projects)
  • Can mentor other developers effectively (peer feedback)

These are leading indicators—if they drop, we’ll see it in promotion rates 12-18 months later.

The Political Challenge

Michelle’s point about political pressure is real. Here’s my version:

My CFO has already publicly announced AI adoption as a strategic initiative. “Leveraging AI to do more with less” is in our investor deck.

If I come back in 6 months and say “the data shows it’s not working,” that’s egg on leadership’s face.

Need to frame it as “learning and optimizing” not “failed bet.” Maybe:

  • “AI works great for X contexts, restricting from Y contexts based on data”
  • “Implementing smarter AI policies to maximize ROI”
  • “Evidence-based optimization of our AI strategy”

Not “we were wrong about AI.”

The Timeline Pressure

David’s 6-month timeline is aggressive but right. We need answers before next budget cycle.

But 3 months of data might not be enough to see long-term capability erosion. Junior skill development takes 12-24 months to fully measure.

Maybe we need:

  • Short-term metrics (3 months): Code quality, review burden, velocity
  • Medium-term metrics (6-12 months): Skill development, promotion rates
  • Long-term metrics (12-24 months): Organizational capability, retention

And make incremental decisions at each checkpoint, not one big go/no-go at 6 months.

I’m Committing To

:white_check_mark: Implement the framework starting Q2
:white_check_mark: Add talent development metrics to Tier 1
:white_check_mark: Share quarterly learnings (anonymized)
:white_check_mark: Be willing to restrict AI access if data shows skill erosion
:white_check_mark: Focus on long-term organizational health over short-term productivity claims

Who else is joining this experiment?

I’m in. Let’s do this properly.

Our Existing Data

Good news: We’ve already been tracking some of these metrics informally. I can contribute baseline data:

Code Durability (Last 6 Months):

  • AI-heavy code (>40% AI-generated): 58% survives 90 days
  • Human-written code: 76% survives 90 days
  • Difference: 18 percentage points (statistically significant)

Revert Rate (Last Quarter):

  • AI-heavy PRs: 2.3× higher revert rate within 30 days
  • Mostly due to edge case handling and subtle logic errors
  • Not caught in code review because code “looked right”

Review Burden (Last Quarter):

  • AI-heavy PRs: 30% more review time
  • AI-heavy PRs: 2.1× more review comments per line of code
  • Senior engineers spending more time explaining why AI suggestions are wrong

So we already have evidence that AI code has quality issues. Now we need to connect that to business outcomes.

What We’re Adding

Context-Specific Policies (Starting Q2):

Based on our experience, we’re implementing Michelle’s domain-based approach:

Prohibited (AI not allowed):

  • Payment processing and financial transactions
  • Authentication and authorization logic
  • Data encryption and security-critical code
  • Performance-critical paths (< 100ms latency requirements)

Restricted (Senior review required):

  • Complex business logic
  • API contracts and integrations
  • Database schema changes
  • Infrastructure and deployment code

Encouraged (AI welcome):

  • Boilerplate and repetitive patterns
  • Test case generation (with human verification)
  • Documentation and comments
  • Internal tooling and scripts

Experience-Based Policies:

  • Juniors (0-18 months): AI prohibited for production code
  • Mid-level (18mo-4yr): Follow context policies above
  • Senior (4yr+): Professional judgment

The Measurement Commitment

I’m committing to track and share:

Weekly: Code quality metrics (durability, revert rate, review burden)
Monthly: Velocity and throughput (story points, features shipped)
Quarterly: Business outcomes (bugs, incidents, customer impact)
Annually: Talent development (promotion rates, skill assessments)

Will publish anonymized quarterly reports back to this group.

The Honest Conversation

Here’s what I’m most worried about: Developer retention.

If we restrict AI access when other companies go all-in, will developers leave for more “AI-friendly” environments?

Especially if they’re getting recruited with pitches like:

  • “We use the latest AI tools—you’ll be way more productive!”
  • “No annoying restrictions on Copilot/Claude like at your current company”
  • “We trust our engineers to use AI responsibly”

I need to be able to explain to my team WHY we’re being more cautious:

  • :white_check_mark: “We’re optimizing for your long-term career growth, not short-term code output”
  • :white_check_mark: “Data shows AI restrictions improve skill development and promotion rates”
  • :white_check_mark: “We want you to be great engineers, not AI-dependent code generators”

But that only works if the data backs it up. If restrictions DON’T improve outcomes, we’re just handicapping ourselves.

That’s the experiment.

I’m In

:white_check_mark: Implement context-specific and experience-based AI policies Q2
:white_check_mark: Track all metrics across three tiers
:white_check_mark: Share quarterly data and learnings
:white_check_mark: Make evidence-based adjustments every quarter
:white_check_mark: Be willing to pivot if data shows we’re wrong

Looking forward to comparing notes in 90 days.