Beyond DORA: What Metrics Actually Capture AI's Impact on Engineering Organizations?

cto_michelle · March 20, 2026, 9:44pm

Following up on our productivity measurement discussion and junior developer concerns, I want to get tactical about measurement.

The problem: We’re using the same metrics we always have—DORA, velocity, commit counts—to evaluate a fundamentally different way of working.

And those metrics are showing no improvement despite massive AI adoption.

What DORA Isn’t Capturing

Our DORA metrics over the last 12 months (since we went all-in on AI tools):

Deployment Frequency: Basically flat (2.3 deployments/day → 2.4)
Lead Time for Changes: Slightly worse (6.2 hours → 6.8 hours)
Change Failure Rate: Worse (4.1% → 5.7%)
Time to Restore Service: Flat (52 minutes → 54 minutes)

So by traditional engineering metrics, AI tools have been net negative or neutral.

But I don’t think these metrics are capturing what’s actually happening. They’re too coarse-grained.

What We Should Be Measuring Instead

I’m proposing three new categories of AI-specific metrics:

1. Code Quality and Durability Metrics

Code Durability Score:

% of code that survives 30/60/90 days without modification
Hypothesis: AI-generated code gets rewritten more often

Bug Attribution:

Production bugs per 1000 lines, segmented by code source (human vs AI-heavy)
Security vulnerabilities by code source
Hypothesis: AI code has higher defect density

Review Efficiency:

Review comments per PR (normalized by PR size)
Review time per line of code
Acceptance without changes rate
Hypothesis: AI code requires more review scrutiny

Technical Debt Accumulation:

Code complexity metrics (cyclomatic, cognitive) over time
Test coverage requirements (do we need more tests for AI code?)
Refactoring frequency
Hypothesis: AI code accumulates debt faster

2. Developer Capability Metrics

Debugging Proficiency:

Time to resolve production incidents (by developer experience level)
Ability to solve issues without AI assistance
Root cause analysis quality
Hypothesis: AI-dependent developers struggle with novel debugging

Knowledge Transfer:

Code explanation quality in PRs
Documentation completeness
Mentorship effectiveness (can seniors teach juniors who used AI?)
Hypothesis: AI reduces deep understanding, harming knowledge transfer

Skill Development Trajectory:

Time-to-promotion (Junior → Mid → Senior)
Technical interview performance over time
Architecture decision quality
Hypothesis: Heavy AI usage slows skill acquisition

3. Business Outcome Metrics

Value Delivery:

Time from idea to customer value (not just deployment)
Feature adoption and usage metrics
Customer-reported bugs per release
Hypothesis: More code ≠ more value

Cost of Quality:

Security incident costs (time + $ + reputation)
Support burden from production bugs
Engineering time spent on rework vs new features
Hypothesis: AI code quality issues have real business costs

Engineering Efficiency:

Revenue per engineer (ultimate productivity measure)
Cost per delivered feature
Engineering headcount as % of company (are we actually needing fewer engineers?)
Hypothesis: AI should improve these if it’s really a productivity boost

Proposed Measurement Framework

Tier 1 - Business Metrics (Monthly Review):

Revenue per engineer
Customer-reported critical bugs
Time-to-market for revenue features
Security incidents and compliance violations

If Tier 1 is improving: AI is working, keep going.
If Tier 1 is flat/declining: Investigate with Tier 2.

Tier 2 - Team Metrics (Weekly Review):

Code durability scores
Review burden and PR cycle time
Change failure rate by code source
Technical debt indicators

If Tier 2 shows AI code quality problems: Adjust policies (restrict AI use for critical paths, require senior review for AI-heavy PRs).

Tier 3 - Individual Metrics (Daily/Real-time):

AI tool usage patterns
Code authorship attribution
Developer sentiment
Learning and skill development

Use Tier 3 for coaching: Identify developers who are too AI-dependent or not leveraging AI effectively.

The Baseline Problem

Here’s my challenge: We don’t have pre-AI baselines for most of these metrics.

We adopted AI tools in a rush (“everyone else is doing it”) without establishing measurement frameworks first. Now we’re trying to retrofit baselines from historical data that may not be comparable.

If I could do it over:

Establish baseline metrics (DORA + new metrics above)
Run controlled pilot (one team with AI, one without, same projects)
Measure for 2 quarters before org-wide rollout
Make data-driven adoption decision

But we skipped steps 1-3 and went straight to org-wide adoption. Now we’re flying blind.

What I’m Implementing Next Quarter

Starting Q2, we’re instrumenting:

Week 1-2: Baseline measurement

Current code durability scores
Review metrics across all teams
Developer capability assessments

Week 3-12: A/B testing within teams

Some features: “AI-heavy” approach (full tool access)
Some features: “AI-light” approach (restricted to boilerplate/tests only)
Track everything in the framework above

End of quarter: Business case review

Did AI-heavy features ship faster? With what quality?
Did AI-light features cost more engineer time but have better outcomes?
Which approach delivered more customer value?

Then we make evidence-based policy decisions about where AI helps and where it hurts.

The Question for This Group

What metrics are you using to evaluate AI tools beyond “developer sentiment”?

Has anyone found leading indicators that actually correlate with business outcomes? Or proven that DORA metrics are sufficient and I’m overthinking this?

Because right now, I’m preparing to defend (or cancel) a K/year tool budget, and “developers like it” isn’t going to cut it with our CFO.

product_david · March 20, 2026, 9:45pm

Michelle, this is exactly the framework I’ve been trying to articulate. The three-tier structure is perfect.

The CFO Translation Layer

Your Tier 1 metrics are in CFO language, and that’s critical. Let me add how I’d pitch each one:

Revenue Per Engineer:

CFO hears: “Engineering efficiency”
What it actually measures: Are we delivering customer value faster or just generating more code?
Target: Should increase 10-15% annually if AI is working

Cost Per Feature:

CFO hears: “Unit economics improvement”
What it actually measures: Total engineering cost (salaries + tools + overhead) divided by features shipped
Target: Should decrease if AI truly improves productivity

Customer-Reported Bugs:

CFO hears: “Quality assurance and support cost reduction”
What it actually measures: Does AI code quality affect customers?
Target: Should stay flat or decrease, not increase

Time-to-Market:

CFO hears: “Revenue acceleration”
What it actually measures: Idea to customer value (not just deployment)
Target: Should decrease for revenue-generating features

The Quarterly Review Framework

I’m stealing your approach but adding a business review cadence:

Monthly: Tier 1 metrics only (executive dashboard)
Quarterly: Full three-tier review with engineering leadership
Annually: Strategic decision (double down, adjust, or retreat on AI adoption)

This matches how CFOs think about investments—monthly monitoring, quarterly deep dives, annual strategic reviews.

The Baseline Challenge

You’re right that we skipped establishing baselines. But we can retrofit:

Historical Baselines (Imperfect but Better Than Nothing):

Revenue per engineer: Look at last 8 quarters, establish trend
Customer bugs: Historical severity-weighted bug rates
Time-to-market: Pick 10 recent features, reconstruct their timelines

Control Group Baselines (Starting Now):

Identify 1-2 teams to be “AI-light” for comparison
Match them to similar teams on AI-heavy approach
Track everything for 2 quarters

Yes, there’s selection bias (teams opt into being AI-light). But it’s better than no comparison.

What I’m Proposing as Success Criteria

After 2 quarters of measurement:

AI Tools Are Working If:

Tier 1 metrics improving (revenue per engineer up, time-to-market down)
Tier 2 shows some quality tradeoffs (acceptable if Tier 1 justifies it)
Tier 3 shows mixed developer sentiment (some love it, some hate it)

AI Tools Are Questionable If:

Tier 1 metrics flat (no business value despite productivity claims)
Tier 2 shows quality degradation (more bugs, slower reviews)
Tier 3 shows capability concerns (juniors can’t function without AI)

AI Tools Should Be Restricted If:

Tier 1 metrics declining (business value going down)
Tier 2 shows significant quality/debt problems
Tier 3 shows skill erosion and AI dependency

The key: Be willing to pull back if data says AI isn’t working, despite industry hype.

The Missing Metric: Customer Value

One thing I’d add to your framework:

Feature Value Realization Rate:

Of features shipped, what % are actually used by customers?
What % drive measurable customer outcomes (engagement, conversion, retention)?

Because one risk of AI-accelerated development is shipping features faster than customers can absorb them, or shipping the wrong features faster.

I want to make sure “productivity” means “delivering customer value” not just “generating code.”