Let's Build the AI Productivity Measurement Framework We Actually Need—Together

product_david · March 20, 2026, 9:45pm

After our discussions on measurement challenges, junior developer impact, and metrics frameworks, I want to synthesize what we’ve learned into something actionable.

Because here’s what’s clear: We’re all flying blind, and we need to fix that.

What We’ve Established

The Paradox:

41% of code is AI-generated
93% of developers use AI tools
Productivity gains plateau at 10%
Developers feel 24% faster but are actually 19% slower (METR study)

The Hidden Costs:

1.7× more issues in AI-assisted code
23.7% more security vulnerabilities
30% increase in code review time
2.3× higher revert rate for AI-heavy code (Luis’s data)
17% lower skill mastery for AI-assisted learning (Anthropic research)

The Measurement Gap:

DORA metrics show flat or declining performance
Individual time savings don’t translate to team velocity
CFOs can’t see business value despite K+ tool budgets
We lack baselines and control groups

Proposed Three-Tier Framework

Building on Michelle’s structure:

Tier 1: Business Outcomes (Monthly)

What CFOs Care About

Revenue per engineer
Cost per delivered feature
Time-to-market for revenue features
Customer-reported bugs (severity-weighted)
Security incidents and compliance violations
Support burden and customer satisfaction

Decision Rule: If these are improving, AI is working. If flat/declining, investigate with Tier 2.

Tier 2: Team Health (Weekly)

What Engineering Leaders Care About

Code durability (% surviving 30/60/90 days without modification)
Review burden (time and comments per PR, by code source)
Change failure rate by code source (human vs AI-heavy)
Technical debt indicators (complexity, test coverage requirements)
Incident response time (by developer experience level)
Knowledge transfer effectiveness (PR explanation quality)

Decision Rule: Use these to diagnose why Tier 1 is flat/declining and identify context-specific policies.

Tier 3: Individual Patterns (Daily/Real-time)

What Individual Contributors Care About

AI tool usage patterns and acceptance rates
Developer sentiment and satisfaction
Code authorship attribution
Skill development trajectory (debugging proficiency, time-to-promotion)

Decision Rule: Use for coaching and identifying AI dependency or underutilization.

Implementation Roadmap

Month 1: Baseline

Establish pre-optimization metrics across all three tiers
Identify 1-2 “control” teams for AI-light comparison
Set up instrumentation and dashboards

Months 2-4: Pilot & Measure

Run A/B test within teams (AI-heavy vs AI-light for similar features)
Collect data across all three tiers
Weekly Tier 2 reviews, monthly Tier 1 reviews

Month 5: Analysis & Decision

Correlate Tier 3 patterns with Tier 2 outcomes with Tier 1 business value
Identify contexts where AI helps vs hurts
Make evidence-based policy decisions

Month 6: Policy Implementation

Context-specific AI adoption guidelines (Luis’s approach)
Experience-level access tiers (Keisha’s approach)
Ongoing measurement and iteration

Context-Specific AI Policies

Based on our discussions, proposed policies:

By Code Type:

High-value: Boilerplate, tests, documentation, refactoring well-understood code
Medium-value: Business logic, API implementations (require senior review)
Restricted: Security-critical, compliance-regulated, performance-critical code

By Developer Experience:

Juniors (0-18 months): Restricted to learning exercises, no production code
Mid-level (18 months-4 years): Full access with mandatory explanation in PRs
Senior (4+ years): Unrestricted with professional judgment

By Business Context:

Fintech/Healthcare/Regulated: Stricter controls, audit requirements
Consumer/SaaS: More permissive, optimize for speed
Infrastructure/Platform: Restricted for critical paths, allowed for tooling

Success Criteria

After 6 months, AI tools justify their cost if:

Revenue per engineer increases 10%+
Time-to-market for revenue features decreases 15%+
Customer-reported bugs stay flat or decrease
Junior → mid-level promotion rate stays constant or improves
Team velocity (DORA) improves

AI tools should be adjusted if:

Business metrics flat despite productivity claims
Quality metrics declining (bugs, security issues)
Skill development trajectory slowing
Review burden offsetting coding speed gains

AI tools should be pulled back if:

Business value declining
Quality problems creating customer impact
Organizational capability eroding
Cost of quality exceeds tool savings

The Commitment

The goal isn’t to prove AI is good or bad—it’s to understand when it helps and when it hurts.

I’m committing to:

Implement this framework at my org starting Q2
Share data quarterly (anonymized) back to this community
Be willing to pull back on AI adoption if data says it’s not working
Focus on business outcomes, not engineering vanity metrics

The Ask

Who else is willing to run this experiment?

If 5-10 companies implement this framework and share learnings, we’ll have real data on:

Which contexts AI helps vs hurts
What experience-level policies work
Whether business value actually materializes
How to measure AI impact properly

Then we can stop flying blind and start making evidence-based decisions.

Who’s in?

cto_michelle · March 20, 2026, 9:46pm

I’m in. Count us as the first org committing to this framework.

David, this is exactly what we need—a structured, evidence-based approach instead of industry hype-driven adoption.

Our Commitment

Starting Q2 2026, we’re implementing:

Baseline Month (April):

Instrument all three tiers of metrics
Establish control team (Platform Engineering—already skeptical of AI tools)
Set up quarterly business review cadence with CFO

Pilot Quarter (May-July):

A/B test: Some teams AI-heavy, some AI-light
Weekly Tier 2 reviews with eng leadership
Monthly Tier 1 reviews with exec team

Decision Month (August):

Full analysis of 3 months of data
Present findings to board
Make go/adjust/retreat decision on K tool budget

Transparency Commitment:

Will share anonymized quarterly results back to this community
Will publish methodology and lessons learned
Will be honest about failures, not just successes

What I’m Nervous About

Political pressure to declare success regardless of data. We’ve made a big bet on AI tools. If the data shows they’re not working, can I actually get the org to pull back?

Sunk cost fallacy. We’ve spent K this year. Even if data says it’s not working, the CFO might push to “give it more time” rather than admit the investment was wrong.

Developer morale. If we restrict AI access after developers have gotten used to it, there will be pushback. Need to manage that change carefully.

But better to face these now than in 2 years when we’ve doubled down and eroded our engineering capability beyond repair.

The Meta Question

This framework itself is a bet that rational, data-driven decision-making will win over industry momentum and fear of missing out.

Not sure that’s a safe bet in today’s AI hype cycle. But it’s the right approach regardless.

vp_eng_keisha · March 20, 2026, 9:46pm

Count me in as well. This is too important to wing it.

Our Implementation Plan

Tier 1 Addition - Talent Development:

I’m adding one metric to your Tier 1 that I think is business-critical:

Junior → Mid-Level Promotion Rate:

% of juniors promoted to mid-level within 24 months
Current baseline: 65%
Target with AI tools: Maintain or improve

Why it’s Tier 1 (business metric, not just eng metric):

Cost of failed junior hire: ~K (recruiting, onboarding, opportunity cost)
Cost of keeping someone stuck at junior level: ~K/year (overpaying for junior output)
Retention impact: Juniors who don’t grow leave, creating hiring treadmill

If AI tools tank this metric, they’re creating business harm even if code output looks good.

Tier 3 Addition - Skill Development Tracking:

For individual developers, I want to track:

Debugging Proficiency Score:

Time to resolve test environment issues (weekly practice sessions)
Root cause analysis quality in incident postmortems
Ability to solve novel problems without AI assistance

Mental Model Strength:

Can explain architectural decisions (assessed in design reviews)
Understands system interactions (assessed in cross-team projects)
Can mentor other developers effectively (peer feedback)

These are leading indicators—if they drop, we’ll see it in promotion rates 12-18 months later.

The Political Challenge

Michelle’s point about political pressure is real. Here’s my version:

My CFO has already publicly announced AI adoption as a strategic initiative. “Leveraging AI to do more with less” is in our investor deck.

If I come back in 6 months and say “the data shows it’s not working,” that’s egg on leadership’s face.

Need to frame it as “learning and optimizing” not “failed bet.” Maybe:

“AI works great for X contexts, restricting from Y contexts based on data”
“Implementing smarter AI policies to maximize ROI”
“Evidence-based optimization of our AI strategy”

Not “we were wrong about AI.”

The Timeline Pressure

David’s 6-month timeline is aggressive but right. We need answers before next budget cycle.

But 3 months of data might not be enough to see long-term capability erosion. Junior skill development takes 12-24 months to fully measure.

Maybe we need:

Short-term metrics (3 months): Code quality, review burden, velocity
Medium-term metrics (6-12 months): Skill development, promotion rates
Long-term metrics (12-24 months): Organizational capability, retention

And make incremental decisions at each checkpoint, not one big go/no-go at 6 months.

I’m Committing To

Implement the framework starting Q2
Add talent development metrics to Tier 1
Share quarterly learnings (anonymized)
Be willing to restrict AI access if data shows skill erosion
Focus on long-term organizational health over short-term productivity claims

Who else is joining this experiment?

eng_director_luis · March 20, 2026, 9:46pm

I’m in. Let’s do this properly.

Our Existing Data

Good news: We’ve already been tracking some of these metrics informally. I can contribute baseline data:

Code Durability (Last 6 Months):

AI-heavy code (>40% AI-generated): 58% survives 90 days
Human-written code: 76% survives 90 days
Difference: 18 percentage points (statistically significant)

Revert Rate (Last Quarter):

AI-heavy PRs: 2.3× higher revert rate within 30 days
Mostly due to edge case handling and subtle logic errors
Not caught in code review because code “looked right”

Review Burden (Last Quarter):

AI-heavy PRs: 30% more review time
AI-heavy PRs: 2.1× more review comments per line of code
Senior engineers spending more time explaining why AI suggestions are wrong

So we already have evidence that AI code has quality issues. Now we need to connect that to business outcomes.

What We’re Adding

Context-Specific Policies (Starting Q2):

Based on our experience, we’re implementing Michelle’s domain-based approach:

Prohibited (AI not allowed):

Payment processing and financial transactions
Authentication and authorization logic
Data encryption and security-critical code
Performance-critical paths (< 100ms latency requirements)

Restricted (Senior review required):

Complex business logic
API contracts and integrations
Database schema changes
Infrastructure and deployment code

Encouraged (AI welcome):

Boilerplate and repetitive patterns
Test case generation (with human verification)
Documentation and comments
Internal tooling and scripts

Experience-Based Policies:

Juniors (0-18 months): AI prohibited for production code
Mid-level (18mo-4yr): Follow context policies above
Senior (4yr+): Professional judgment

The Measurement Commitment

I’m committing to track and share:

Weekly: Code quality metrics (durability, revert rate, review burden)
Monthly: Velocity and throughput (story points, features shipped)
Quarterly: Business outcomes (bugs, incidents, customer impact)
Annually: Talent development (promotion rates, skill assessments)

Will publish anonymized quarterly reports back to this group.

The Honest Conversation

Here’s what I’m most worried about: Developer retention.

If we restrict AI access when other companies go all-in, will developers leave for more “AI-friendly” environments?

Especially if they’re getting recruited with pitches like:

“We use the latest AI tools—you’ll be way more productive!”
“No annoying restrictions on Copilot/Claude like at your current company”
“We trust our engineers to use AI responsibly”

I need to be able to explain to my team WHY we’re being more cautious:

“We’re optimizing for your long-term career growth, not short-term code output”
“Data shows AI restrictions improve skill development and promotion rates”
“We want you to be great engineers, not AI-dependent code generators”

But that only works if the data backs it up. If restrictions DON’T improve outcomes, we’re just handicapping ourselves.

That’s the experiment.

I’m In

Implement context-specific and experience-based AI policies Q2
Track all metrics across three tiers
Share quarterly data and learnings
Make evidence-based adjustments every quarter
Be willing to pivot if data shows we’re wrong

Looking forward to comparing notes in 90 days.