Beyond DORA: What Metrics Actually Capture AI's Impact on Engineering Organizations?

Following up on our productivity measurement discussion and junior developer concerns, I want to get tactical about measurement.

The problem: We’re using the same metrics we always have—DORA, velocity, commit counts—to evaluate a fundamentally different way of working.

And those metrics are showing no improvement despite massive AI adoption.

What DORA Isn’t Capturing

Our DORA metrics over the last 12 months (since we went all-in on AI tools):

  • Deployment Frequency: Basically flat (2.3 deployments/day → 2.4)
  • Lead Time for Changes: Slightly worse (6.2 hours → 6.8 hours)
  • Change Failure Rate: Worse (4.1% → 5.7%)
  • Time to Restore Service: Flat (52 minutes → 54 minutes)

So by traditional engineering metrics, AI tools have been net negative or neutral.

But I don’t think these metrics are capturing what’s actually happening. They’re too coarse-grained.

What We Should Be Measuring Instead

I’m proposing three new categories of AI-specific metrics:

1. Code Quality and Durability Metrics

Code Durability Score:

  • % of code that survives 30/60/90 days without modification
  • Hypothesis: AI-generated code gets rewritten more often

Bug Attribution:

  • Production bugs per 1000 lines, segmented by code source (human vs AI-heavy)
  • Security vulnerabilities by code source
  • Hypothesis: AI code has higher defect density

Review Efficiency:

  • Review comments per PR (normalized by PR size)
  • Review time per line of code
  • Acceptance without changes rate
  • Hypothesis: AI code requires more review scrutiny

Technical Debt Accumulation:

  • Code complexity metrics (cyclomatic, cognitive) over time
  • Test coverage requirements (do we need more tests for AI code?)
  • Refactoring frequency
  • Hypothesis: AI code accumulates debt faster

2. Developer Capability Metrics

Debugging Proficiency:

  • Time to resolve production incidents (by developer experience level)
  • Ability to solve issues without AI assistance
  • Root cause analysis quality
  • Hypothesis: AI-dependent developers struggle with novel debugging

Knowledge Transfer:

  • Code explanation quality in PRs
  • Documentation completeness
  • Mentorship effectiveness (can seniors teach juniors who used AI?)
  • Hypothesis: AI reduces deep understanding, harming knowledge transfer

Skill Development Trajectory:

  • Time-to-promotion (Junior → Mid → Senior)
  • Technical interview performance over time
  • Architecture decision quality
  • Hypothesis: Heavy AI usage slows skill acquisition

3. Business Outcome Metrics

Value Delivery:

  • Time from idea to customer value (not just deployment)
  • Feature adoption and usage metrics
  • Customer-reported bugs per release
  • Hypothesis: More code ≠ more value

Cost of Quality:

  • Security incident costs (time + $ + reputation)
  • Support burden from production bugs
  • Engineering time spent on rework vs new features
  • Hypothesis: AI code quality issues have real business costs

Engineering Efficiency:

  • Revenue per engineer (ultimate productivity measure)
  • Cost per delivered feature
  • Engineering headcount as % of company (are we actually needing fewer engineers?)
  • Hypothesis: AI should improve these if it’s really a productivity boost

Proposed Measurement Framework

Tier 1 - Business Metrics (Monthly Review):

  • Revenue per engineer
  • Customer-reported critical bugs
  • Time-to-market for revenue features
  • Security incidents and compliance violations

If Tier 1 is improving: AI is working, keep going.
If Tier 1 is flat/declining: Investigate with Tier 2.

Tier 2 - Team Metrics (Weekly Review):

  • Code durability scores
  • Review burden and PR cycle time
  • Change failure rate by code source
  • Technical debt indicators

If Tier 2 shows AI code quality problems: Adjust policies (restrict AI use for critical paths, require senior review for AI-heavy PRs).

Tier 3 - Individual Metrics (Daily/Real-time):

  • AI tool usage patterns
  • Code authorship attribution
  • Developer sentiment
  • Learning and skill development

Use Tier 3 for coaching: Identify developers who are too AI-dependent or not leveraging AI effectively.

The Baseline Problem

Here’s my challenge: We don’t have pre-AI baselines for most of these metrics.

We adopted AI tools in a rush (“everyone else is doing it”) without establishing measurement frameworks first. Now we’re trying to retrofit baselines from historical data that may not be comparable.

If I could do it over:

  1. :white_check_mark: Establish baseline metrics (DORA + new metrics above)
  2. :white_check_mark: Run controlled pilot (one team with AI, one without, same projects)
  3. :white_check_mark: Measure for 2 quarters before org-wide rollout
  4. :white_check_mark: Make data-driven adoption decision

But we skipped steps 1-3 and went straight to org-wide adoption. Now we’re flying blind.

What I’m Implementing Next Quarter

Starting Q2, we’re instrumenting:

Week 1-2: Baseline measurement

  • Current code durability scores
  • Review metrics across all teams
  • Developer capability assessments

Week 3-12: A/B testing within teams

  • Some features: “AI-heavy” approach (full tool access)
  • Some features: “AI-light” approach (restricted to boilerplate/tests only)
  • Track everything in the framework above

End of quarter: Business case review

  • Did AI-heavy features ship faster? With what quality?
  • Did AI-light features cost more engineer time but have better outcomes?
  • Which approach delivered more customer value?

Then we make evidence-based policy decisions about where AI helps and where it hurts.

The Question for This Group

What metrics are you using to evaluate AI tools beyond “developer sentiment”?

Has anyone found leading indicators that actually correlate with business outcomes? Or proven that DORA metrics are sufficient and I’m overthinking this?

Because right now, I’m preparing to defend (or cancel) a K/year tool budget, and “developers like it” isn’t going to cut it with our CFO.

Michelle, this is exactly the framework I’ve been trying to articulate. The three-tier structure is perfect.

The CFO Translation Layer

Your Tier 1 metrics are in CFO language, and that’s critical. Let me add how I’d pitch each one:

Revenue Per Engineer:

  • CFO hears: “Engineering efficiency”
  • What it actually measures: Are we delivering customer value faster or just generating more code?
  • Target: Should increase 10-15% annually if AI is working

Cost Per Feature:

  • CFO hears: “Unit economics improvement”
  • What it actually measures: Total engineering cost (salaries + tools + overhead) divided by features shipped
  • Target: Should decrease if AI truly improves productivity

Customer-Reported Bugs:

  • CFO hears: “Quality assurance and support cost reduction”
  • What it actually measures: Does AI code quality affect customers?
  • Target: Should stay flat or decrease, not increase

Time-to-Market:

  • CFO hears: “Revenue acceleration”
  • What it actually measures: Idea to customer value (not just deployment)
  • Target: Should decrease for revenue-generating features

The Quarterly Review Framework

I’m stealing your approach but adding a business review cadence:

Monthly: Tier 1 metrics only (executive dashboard)
Quarterly: Full three-tier review with engineering leadership
Annually: Strategic decision (double down, adjust, or retreat on AI adoption)

This matches how CFOs think about investments—monthly monitoring, quarterly deep dives, annual strategic reviews.

The Baseline Challenge

You’re right that we skipped establishing baselines. But we can retrofit:

Historical Baselines (Imperfect but Better Than Nothing):

  • Revenue per engineer: Look at last 8 quarters, establish trend
  • Customer bugs: Historical severity-weighted bug rates
  • Time-to-market: Pick 10 recent features, reconstruct their timelines

Control Group Baselines (Starting Now):

  • Identify 1-2 teams to be “AI-light” for comparison
  • Match them to similar teams on AI-heavy approach
  • Track everything for 2 quarters

Yes, there’s selection bias (teams opt into being AI-light). But it’s better than no comparison.

What I’m Proposing as Success Criteria

After 2 quarters of measurement:

AI Tools Are Working If:

  • :white_check_mark: Tier 1 metrics improving (revenue per engineer up, time-to-market down)
  • :warning: Tier 2 shows some quality tradeoffs (acceptable if Tier 1 justifies it)
  • :warning: Tier 3 shows mixed developer sentiment (some love it, some hate it)

AI Tools Are Questionable If:

  • :neutral_face: Tier 1 metrics flat (no business value despite productivity claims)
  • :cross_mark: Tier 2 shows quality degradation (more bugs, slower reviews)
  • :warning: Tier 3 shows capability concerns (juniors can’t function without AI)

AI Tools Should Be Restricted If:

  • :cross_mark: Tier 1 metrics declining (business value going down)
  • :cross_mark: Tier 2 shows significant quality/debt problems
  • :cross_mark: Tier 3 shows skill erosion and AI dependency

The key: Be willing to pull back if data says AI isn’t working, despite industry hype.

The Missing Metric: Customer Value

One thing I’d add to your framework:

Feature Value Realization Rate:

  • Of features shipped, what % are actually used by customers?
  • What % drive measurable customer outcomes (engagement, conversion, retention)?

Because one risk of AI-accelerated development is shipping features faster than customers can absorb them, or shipping the wrong features faster.

I want to make sure “productivity” means “delivering customer value” not just “generating code.”