Measuring AI Coding Tools: Are We Tracking Velocity When We Should Measure Cognitive Load?

We’ve crossed a threshold in engineering: 91% of organizations now use AI coding tools. But the conversation has fundamentally shifted—we’re no longer debating whether to adopt AI, we’re struggling with how to measure what it’s actually doing.

I’m leading a 120-person engineering organization through AI tool rollout, and I’m hitting a measurement wall that I suspect many of you are facing too.

The Velocity Trap We’re All Walking Into

Here’s what I’m seeing: Teams feel faster. Developers report increased productivity in surveys. Managers point to higher PR counts. But when I look at our delivery metrics—actual features shipped to customers, time from idea to production—the needle hasn’t moved. In some cases, it’s moved backward.

We’re measuring the wrong thing.

The One Clear Win: Stack Trace Analysis

There’s one use case where the ROI is undeniable: debugging. When a developer hits a cryptic error message—especially in our distributed systems—AI tools excel at answering “What does this error actually mean?”

We’ve measured this: our mean time to recovery (MTTR) for infrastructure-related bugs dropped 40% after deploying AI-assisted debugging. The workflow is clean: copy stack trace, get explanation, verify fix, ship. No ambiguity, minimal cognitive load, easy to measure.

This is the AI success story we can actually articulate to the board.

But Here’s What Keeps Me Up at Night

While our debugging metrics improved, other signals are flashing yellow:

  • Pull requests increased 20% (sounds great!)
  • Incidents per pull request jumped 23.5% (not great)
  • Change failure rate increased 30% (definitely not great)

Research from METR’s study on AI-assisted development found that on complex, novel tasks, senior developers were actually 19% slower when using AI. The culprit? Cognitive load from verification overhead. AI generates code that “looks right” but requires deep inspection to ensure it’s actually right.

The time cost isn’t in the generation—it’s in the review, the debugging of subtle issues, the refactoring of code that technically works but doesn’t fit our patterns.

The Metric We’re Not Tracking: Cognitive Load

What if velocity is the wrong metric entirely?

When developers switch from “coding mode” to “prompting mode” to “verification mode,” they’re paying a context-switching tax that flow state research shows is measurable and significant. The dopamine hit from instant AI suggestions creates a halo effect—developers believe they’re 20% faster even when the data shows they’re slower.

I’m starting to think the real value isn’t speed—it’s cognitive load reduction. And we’re barely measuring that at all.

Stack trace analysis works precisely because it reduces cognitive load (no more hunting through documentation and Stack Overflow for obscure errors). Code generation often increases cognitive load (now you’re verifying someone else’s code that you didn’t write).

So Here’s My Question to This Community

What metrics are you actually using to measure AI coding tools’ impact?

Are you tracking:

  • Flow time (sustained focus periods without context switching)?
  • Friction points (where developers get stuck and for how long)?
  • Developer satisfaction (subjective but important)?
  • Quality metrics tied to AI usage (defect rates, review cycles, technical debt)?

Or are you, like many organizations, defaulting to velocity metrics (PRs, lines of code, commits) because they’re easy to measure, even if they don’t tell the full story?

We’re at a inflection point. The teams that figure out how to measure AI’s actual impact—not just its perceived speed—are going to make much better decisions about where to invest in these tools and where to pull back.

I’d love to hear what’s working for you. What are you measuring? What have you tried and abandoned? Where are you seeing clear ROI vs. measurement confusion?

Related: How to measure AI’s impact on your engineering team and Beyond the Hype: Measuring AI’s Impact on Engineering Teams

This resonates deeply with what I’m seeing in financial services. The measurement challenge you’re describing isn’t just an AI problem—it’s a fundamental challenge in how we think about engineering productivity, and AI is amplifying it.

Velocity Metrics Created the Wrong Incentives (Even Before AI)

Three years ago, we made the mistake of optimizing for velocity: story points completed, PRs merged, deployment frequency. The result? Developers gamed the metrics. Small PRs that should have been one cohesive change got split into five. Complex refactors got deprioritized because they didn’t show immediate velocity.

When we introduced AI tools last year, those same broken incentives got supercharged. More PRs, faster commits, higher velocity numbers—and our change failure rate went through the roof.

We Shifted to Outcome Metrics, Not Output Metrics

Here’s what actually moved the needle for us:

Instead of measuring velocity, we measure:

  • Mean Time to Recovery (MTTR) - How fast can we recover from incidents?
  • Deployment frequency to production - Not to staging, to actual customers
  • Change failure rate - What percentage of changes cause incidents?
  • Lead time for changes - From commit to customer value

AI’s impact on these metrics has been mixed:

  • :white_check_mark: MTTR improved dramatically (your stack trace point is spot-on)
  • :warning: Change failure rate initially got worse before we intervened
  • :white_check_mark: Deployment frequency stayed stable (not worse, which is something)
  • :red_question_mark: Lead time unclear - faster coding, but longer review cycles

The Cognitive Load Measurement Gap

You’re absolutely right that cognitive load is the missing metric, but here’s my challenge: How do you measure it?

We’ve experimented with:

  • Developer experience surveys (monthly NPS-style)
  • Friction point logging (devs report when they get stuck)
  • Flow state tracking (time blocks without context switches)
  • Code review iteration counts (how many back-and-forths before merge)

The surveys revealed something important: junior engineers love AI tools (feel empowered), senior engineers are frustrated (spending more time reviewing AI-generated code). That’s a cognitive load redistribution, not reduction.

Has Anyone Measured Cognitive Load Directly?

I’d love to hear if anyone has cracked this. We’re using proxies (satisfaction surveys, review cycles), but those feel indirect.

One thing I am confident about: Measurement drives behavior. If you measure lines of code, you get more lines. If you measure PRs, you get more PRs. If you measure customer value and quality, you get teams thinking about those things.

AI tools will optimize for whatever we measure. We need to get the metrics right, or we’ll optimize for the wrong outcomes at scale.

Your point about the 19% slower finding is fascinating—I suspect that’s real in our org too, but hidden in aggregate velocity numbers. Might need to instrument this at the individual developer level to see it.

This conversation is hitting on something I’ve been thinking about from a product perspective: velocity without outcomes is a vanity metric, and it applies just as much to engineering productivity as it does to product development.

The Product Parallel: Features Shipped vs Customer Value

In product, we learned this lesson years ago. You can ship 50 features in a quarter and feel incredibly productive, but if none of those features drive adoption, retention, or revenue, what did you actually accomplish?

I’ve seen teams celebrate velocity while customer satisfaction scores decline. The metrics said “success,” but the business said “problem.”

AI Tools Are Amplifying This Disconnect

What I’m noticing: Engineering teams using AI to ship more code, but that doesn’t automatically translate to more customer value.

Real example from our B2B fintech startup: We used AI tools to accelerate development of a new enterprise feature. Engineering velocity metrics looked great—shipped 30% faster than projected.

But:

  • Product quality suffered (more customer-reported bugs in first month)
  • Feature adoption was lower than expected (UX wasn’t as polished)
  • We spent the next sprint fixing issues instead of building new capabilities

Net result: Slower time to customer value, despite faster time to initial deployment.

A Framework: Input → Output → Outcome

I propose thinking about AI measurement in three tiers:

Input Metrics (Are we using AI?)

  • AI tool adoption rates
  • Percentage of code assisted by AI
  • Developer engagement with AI features

Output Metrics (Are we producing more?)

  • PRs per developer
  • Lines of code
  • Deployment frequency
  • Velocity/story points

Outcome Metrics (Are we delivering value?)

  • Time to customer value (from idea to customer usage)
  • Customer-reported bugs (not internal test coverage)
  • Feature adoption and engagement
  • Customer satisfaction (NPS, CSAT)
  • Revenue impact (for B2B features)

The Trap: Optimizing Outputs While Outcomes Degrade

Your data perfectly illustrates this:

  • Output: PRs increased 20% :white_check_mark:
  • Outcome: Incidents per PR jumped 23.5%, change failure rate up 30% :cross_mark:

We’re optimizing the wrong level of the stack.

What Should Engineering Leaders Actually Measure?

Here’s my challenge to this community: How do we tie AI productivity gains to business outcomes, not just engineering outputs?

Some ideas I’m exploring:

  • Time to customer value: From commit to feature in customer hands (and being used)
  • Quality-adjusted velocity: Velocity discounted by defect rates or rework
  • Customer impact per sprint: What percentage of work directly improves customer metrics?
  • Technical debt trajectory: Are we accruing debt faster with AI assistance?

The honest truth from our startup: We’re faster at generating code, but I’m not convinced we’re faster at delivering customer value. The bottlenecks have shifted (from coding to review, from development to QA, from shipping to bug fixing), but the end-to-end cycle hasn’t improved.

The Question Nobody Wants to Ask

Are we measuring AI productivity because it makes engineering feel productive, or because it actually moves business outcomes?

If AI tools help us ship low-quality features 30% faster, is that actually better for the business? Or would we be better off shipping fewer, higher-quality features at the original pace?

I don’t have the answer, but I think we’re asking the wrong questions if we’re focused purely on velocity.

Okay, can I be brutally honest from a design systems perspective? This whole conversation is giving me flashbacks to what we’re dealing with right now.

AI Makes Developers Fast at the Wrong Things

Here’s what I’m seeing: Developers use AI to generate components quickly. Sounds great, right? Ship faster, build more, everyone wins.

Except…

The AI-generated code works. Tests pass. Feature ships. But then I look at it from a design perspective and it’s a consistency nightmare.

Real example: Developer needed a button variant. Instead of using our design system component, they asked AI to generate one. AI created a perfectly functional button with:

  • Inline styles instead of design tokens
  • Custom padding values instead of our spacing system
  • One-off color hex codes instead of theme variables
  • No accessibility attributes from our a11y guidelines

Technically correct. Functionally works. Completely undermines our design system.

The Verification Overhead Is Crushing Us

Here’s the part that connects to your cognitive load point: Reviewing AI-generated code is mentally exhausting.

When I review human-written code, I can usually spot pattern violations quickly. But AI-generated code looks right. It’s syntactically perfect, follows basic conventions, and the logic makes sense.

The problems are subtle:

  • Uses valid CSS but not our design tokens
  • Creates components that work but don’t fit our component library
  • Implements functionality correctly but with different naming patterns
  • Passes tests but has accessibility issues our a11y team would catch

I have to read every line with deep focus because AI code looks plausible and polished. Human code often has obvious tells that reveal the developer’s intent (or mistakes).

We’re Trading Short-Term Velocity for Long-Term Maintainability

Your metrics mirror what I’m seeing:

  • :white_check_mark: Initial development faster
  • :cross_mark: Code review cycles longer (more iterations to fix design system violations)
  • :cross_mark: Design debt accumulating (inconsistent components proliferating)
  • :cross_mark: A11y audit findings increased (AI doesn’t understand accessibility context)

Net result: We’re shipping individual features faster but the codebase is degrading. Six months from now, maintenance costs will be higher.

What We’re Trying to Measure

Since you asked what metrics we’re using, here’s what we’re experimenting with:

Design System Compliance

  • Percentage of PRs using design system components (vs one-off implementations)
  • Design token usage (are devs using tokens or hardcoded values?)
  • Component proliferation (how many “button” variants exist in the codebase?)

Code Review Burden

  • Review iteration cycles (how many rounds before merge?)
  • Time spent in review (per PR, per reviewer)
  • Design team involvement in code reviews (increasing = bad signal)

Quality Metrics

  • Accessibility audit findings
  • Visual regression test failures
  • Design QA findings

The trend since AI adoption: All of these metrics got worse.

The Honest Reality

AI doesn’t understand the why behind design decisions. It knows “how to make a button” but not “why we use this specific button pattern in this context.”

That contextual knowledge—design intent, accessibility requirements, brand consistency—requires human judgment. AI can’t replicate it, so we’re stuck with the verification overhead you described.

I love the stack trace debugging use case because it’s narrow and verifiable. But code generation? We’re finding the cognitive load is significant and the quality trade-offs are real.

Maybe the answer isn’t “how do we measure AI velocity” but “where should we use AI and where shouldn’t we?”

This conversation is crystallizing something I’ve been wrestling with as VP of Engineering: we’re measuring at the wrong level entirely.

Individual vs Team vs Organizational Measurement

Let me frame this differently. AI impact varies dramatically depending on which level you’re measuring:

Individual Developer Level

  • Flow state and focus time
  • Cognitive load and context switching
  • Learning velocity and skill development
  • Job satisfaction and autonomy

Team Level

  • Code quality and review effectiveness
  • Collaboration patterns and knowledge sharing
  • Cycle time and delivery predictability
  • Technical debt trajectory

Organizational Level

  • Business value delivery
  • Innovation capacity
  • Talent retention and culture
  • Competitive positioning

Here’s the problem: AI might help individuals while hurting teams, or help teams while organizational outcomes remain unclear.

The Data I’m Seeing at Scale

We’re scaling from 25 to 80+ engineers, and I’m watching AI impact play out across these levels:

Individual level: Mixed signals

  • Junior engineers report feeling empowered (can tackle problems above their level)
  • Senior engineers report frustration (spending more time reviewing, less time architecting)
  • Net cognitive load: Unclear - it’s been redistributed, not necessarily reduced

Team level: Concerning trends

  • Code review becoming bottleneck (more code to review, higher scrutiny needed)
  • Quality metrics degrading (more incidents, longer resolution times for subtle bugs)
  • Knowledge sharing shifting (less pairing, more solitary AI-assisted coding)

Organizational level: Too early to tell

  • Delivery throughput hasn’t improved measurably
  • Innovation capacity questionable (are we building new things or just faster old things?)
  • Culture impact unclear but worth monitoring (developer satisfaction, retention)

The 19% Slower Finding Resonates

That research about senior developers being 19% slower on complex tasks? I think that’s real, and it’s a scaling problem.

If your best engineers—the ones who should be focused on architecture, complex problem-solving, and mentoring—are instead spending more time:

  • Verifying AI-generated code
  • Reviewing AI-assisted PRs from junior engineers
  • Fixing subtle bugs that AI introduced
  • Teaching juniors not to over-trust AI suggestions

…then you’ve got an organizational capacity problem, not an individual productivity win.

What We Should Actually Be Measuring

Based on this discussion, I’m evolving my thinking on metrics:

Leading Indicators (Individual/Team)

  • Flow state quality (sustained focus without AI interruption)
  • Code review burden (time, iterations, complexity)
  • Knowledge distribution (how many people can solve X problem?)
  • Developer sentiment on cognitive load

Lagging Indicators (Organizational)

  • Customer value delivery time (idea to customer usage)
  • Quality-adjusted throughput (velocity * quality)
  • Technical health (debt accumulation, system complexity)
  • Talent metrics (retention, satisfaction, growth)

The key insight: Don’t optimize one level at the expense of others.

If individual velocity goes up 20% but team quality drops 30% and organizational delivery stays flat, that’s not a win.

The Question That Keeps Me Up

Are we optimizing for individual productivity or organizational effectiveness?

Because those might not be the same thing. And if AI helps individuals generate more code but creates organizational review bottlenecks, quality degradation, and senior engineer burnout… that’s a Pyrrhic victory.

Michelle, your question about “what metrics matter” is the right one. But I think we need to be measuring at multiple levels simultaneously and looking for misalignments.

When individual metrics look great but organizational metrics don’t move (or worse, degrade), that tells you something important: you’ve optimized the wrong thing.

What I’d love to see from this community: frameworks for multi-level AI impact measurement. Not just “are developers faster” but “is the organization healthier, more effective, and delivering more value?”

Anyone else measuring across these levels? What are you seeing?