60% of Orgs Don't Measure AI Impact - Flying Blind on the Biggest Tech Shift?

I spend my days building measurement frameworks and data pipelines, so when I read the latest industry reports on AI adoption in engineering, I genuinely felt a chill. The data is damning: only 20% of engineering teams are using metrics to measure the impact of their AI tools (Cortex 2026 Benchmark Report). That means the vast majority of organizations have deployed AI coding assistants across their engineering orgs with no systematic way to know if the investment is working.

Let me put this in perspective. According to Deloitte’s State of AI Report 2026, organizations have moved from experimental sandboxes to full-scale commercial deployment. Gartner reports that 90% of engineering teams now use AI in their workflows, up from 61% just a year ago. Yet the measurement infrastructure has not kept pace with the adoption curve. We are flying blind on the biggest technology shift in software engineering since the move to cloud.

The “Feeling Productive” Trap

Here is what I see most often: teams adopt Copilot or Cursor, developers report “feeling more productive,” and leadership checks the box on AI transformation. But feelings are not data. The METR study — a rigorous randomized controlled trial with experienced open-source developers working on their own repositories — found that developers using AI tools actually took 19% longer to complete tasks, despite self-reporting that they felt faster. Read that again. They felt faster. They were slower.

This perception gap is exactly why measurement matters. Without baseline data and controlled comparisons, organizations are making multi-million-dollar investment decisions based on vibes.

What You Should Be Measuring

If I were building an AI impact measurement framework from scratch (and I have, twice now), here is what I would track:

Tier 1 — Delivery Metrics (Before/After AI Adoption):

  • PR cycle time — from first commit to merge. Slice by AI-assisted vs. non-assisted PRs.
  • Change failure rate — critical, because the Cortex 2026 data shows a 30% increase in change failure rates alongside AI adoption. Speed without stability is regression, not progress.
  • Deployment frequency — are you actually shipping more, or just writing more code?
  • Incident resolution time — does AI help you recover faster?

Tier 2 — Quality Signals:

  • Code churn rate — code added then quickly deleted or rewritten. Industry data shows code churn climbing to nearly 7% and AI-generated code is a major driver.
  • Defect escape rate — bugs that make it to production. Cortex found incidents per PR jumped 23.5% alongside AI adoption gains.
  • Review turnaround — LinearB’s analysis of 8.1M pull requests shows AI-generated PRs wait 4.6x longer for review. That bottleneck matters.

Tier 3 — Developer Experience:

  • Developer satisfaction surveys — quarterly, anonymous, with specific questions about AI tool effectiveness.
  • Context switching frequency — are AI tools reducing or adding to cognitive load?
  • Learning and growth perception — are developers still building skills or becoming dependent?

What You Should NOT Measure

This matters just as much. Do not use these as AI impact metrics:

  • Lines of code — AI inflates this metric into meaninglessness. More code is not better code.
  • Commits per day — same problem. Frequency of commits tells you nothing about value delivered.
  • AI tool acceptance rate — a high Copilot acceptance rate might mean developers are accepting suggestions uncritically, not that the suggestions are good.
  • Individual developer output — I cannot stress this enough. Measuring AI impact at the individual level creates perverse incentives and surveillance anxiety. More on this below.

A Practical Framework

The companies getting this right — and Webflow is a good public example — are slicing their existing DORA metrics by AI vs. non-AI usage, by developer tenure, by language, and by repository complexity. Webflow discovered that engineers with 3+ years at the company benefit most from AI tools, seeing roughly a 20% throughput increase, while newer engineers saw negligible or negative effects.

That kind of nuance only emerges from deliberate measurement. You cannot gut-feel your way to that insight.

The bottom line: if your organization has deployed AI coding tools without a measurement framework, you are not being innovative. You are being negligent. The 2026 data is making it abundantly clear that AI adoption without measurement is a recipe for mounting technical debt, degraded reliability, and wasted budget — all hidden behind a comforting narrative of developer satisfaction surveys that say “we feel faster.”

I would love to hear how others are approaching this. What are you measuring? What surprised you in the data?

Rachel, this post could not have come at a better time. I just walked out of a board meeting where the question was put to me directly: “Carlos, what is the ROI on our $2.1M annual AI tooling investment?” And the honest answer was that nobody in the organization could give them a defensible number.

Here is the financial reality. We have 450 developers on GitHub Copilot Enterprise at $39/user/month — that is $210,600 annually just for the seat licenses. But when I tried to build a business case for renewal, our engineering leadership gave me sentiment data. Surveys. Anecdotes about faster code completion. The CFO looked at me like I had brought finger paintings to an earnings call.

So I built my own framework, and I want to share it because I think finance leaders across the industry are facing the same problem.

The AI Tool ROI Framework (Beyond Productivity)

Direct Productivity Value:
Take your average fully-loaded developer cost ($185K at our company), divide by working hours, and you get a blended rate. If AI tools save even 30 minutes per day per developer — a conservative estimate from GitHub’s own studies — that is roughly $11,500 per developer per year in recaptured time. Against $468/year in license costs, the math looks great on paper. But as Rachel points out, the METR study suggests those 30 minutes might be illusory.

Retention and Hiring Competitiveness:
This is the value most finance teams miss. In our last engagement survey, 73% of engineers said access to AI tools was a factor in their decision to stay. With replacement costs at 1.5x to 2x annual salary, even a marginal improvement in retention — say, 2-3 fewer departures per year — saves $550K-$740K annually. That alone covers the entire Copilot investment three times over.

Risk and Compliance Costs:
Here is where it gets tricky. The Cortex data Rachel cited — 30% increase in change failure rates — translates directly into incident costs. At our scale, each production incident costs roughly $15K-$40K in engineer time, customer impact, and remediation. If AI adoption is driving even a modest increase in incidents, that cost erodes the productivity gains fast.

The “AI Sprawl” Tax:
I audited our AI tool spending last quarter and found we were paying for Copilot, Cursor, Tabnine, Claude Pro seats, ChatGPT Teams, and three different internal LLM experiments — all on separate budget lines across different cost centers. Total shadow AI spending was 40% above what was officially budgeted. That is not an ROI problem. That is a governance problem.

My recommendation to any VP Finance reading this: do not wait for engineering to build the measurement framework. Build it yourself from the finance side, because the board is going to ask, and “the developers feel more productive” is not an answer that survives a downturn.

Rachel, I want to share a story from the trenches because we actually tried to measure AI impact rigorously, and the results left us more confused than when we started.

Last year I ran a six-month controlled experiment across my org (roughly 120 engineers, eight teams). Four teams got full access to Copilot Enterprise and Cursor, four teams stayed with traditional tooling. We tracked every DORA metric, code quality signal, and developer experience data point I could get my hands on.

The Results Were Baffling

Individual task completion time dropped by about 25% for AI-assisted teams. Clear win, right? Except sprint velocity stayed completely flat. Same number of story points delivered per sprint, same features shipped, same deployment cadence. Six months of data, no statistically significant difference in team-level output.

When I dug in, I found what I now call the AI Productivity Paradox, and Harness just published a report confirming it at scale: productivity gains at the individual task level get absorbed by the system before they reach the team level.

Here is what was actually happening:

1. Scope Expansion: When developers finished tasks faster, product managers filled the freed capacity with additional scope. The backlog is infinite. Faster task completion just meant more tasks got pulled in, not that the sprint goal moved faster.

2. Review Bottleneck Shift: This was the killer. AI-assisted developers were producing more PRs, but the review burden shifted to senior engineers who became the bottleneck. LinearB’s data showing AI PRs wait 4.6x longer for review? We saw exactly that. Our senior engineers were spending 40% more time in code review, which meant less time on their own high-impact work.

3. The Verification Tax: Your point about the 30% increase in change failure rates maps directly to our experience. Teams had to build new verification processes — additional test coverage, AI-specific linting rules, architecture review checkpoints — that consumed the time savings from faster initial coding.

4. Debugging Displacement: Developers wrote code faster but spent significantly more time debugging code they did not fully understand. The net time-on-task for a feature, from start to production-verified, barely moved.

The most frustrating part? My executive leadership saw “25% faster task completion” and declared AI adoption a success. They did not want to hear about the sprint velocity data or the downstream costs. The narrative of AI productivity had already been set, and the nuanced data was unwelcome.

My advice for other engineering directors: measure at the team and system level, not the task level. Individual task speed is the AI equivalent of measuring lines of code. It feels meaningful and is almost entirely misleading. Track end-to-end cycle time from idea to production. Track change failure rate. Track developer experience holistically. And be prepared for the possibility that the data tells a story nobody wants to hear.

Rachel, I agree with the overall thesis — organizations need to measure AI impact — but I want to push back on something and then share the IC perspective that I think is missing from this conversation.

Measuring AI impact at the individual developer level is not just unproductive. It is actively harmful.

I have worked at two companies that tried it. At the first, management started tracking per-developer Copilot acceptance rates and “AI-assisted PR velocity.” Within a month, the culture shifted. Developers started accepting more AI suggestions uncritically to keep their numbers up. Code quality tanked. Senior engineers who spent more time on architecture and mentoring — work that does not generate PRs — looked “less productive” by the AI metrics. One of our best architects got a mediocre performance review partly because his AI utilization was low. He left. We lost someone irreplaceable because we measured the wrong thing.

At my current company, we handle it differently, and I think it is the right approach:

What I Want Measured (Team/Org Level)

  • Team-level cycle time — from ticket creation to production deploy. Did AI adoption move this needle? If not, why not?
  • Production incident rate per deploy — are we shipping faster at the cost of stability? The Cortex data Rachel cited (23.5% more incidents per PR) suggests this is a real risk.
  • Developer experience surveys — but anonymous and aggregate. I want to tell my company honestly whether AI tools are helping or getting in my way, without worrying that my answer affects my performance review.
  • Time spent in code review — Luis’s point about the review bottleneck resonates. At my company, I have seen the same thing. I am spending nearly double the time reviewing PRs because half of them are AI-generated code that is syntactically correct but architecturally questionable.

What I Would Resist Being Measured

  • My personal AI tool usage rate — whether I use Copilot or vim macros is my business. Judge me on outcomes, not on which tools I choose to use.
  • My AI suggestion acceptance rate — this incentivizes accepting bad suggestions. A low acceptance rate might mean I have high standards, not low productivity.
  • Any metric that compares my AI-assisted output to a colleague’s — this creates an arms race where developers optimize for the metric rather than for code quality.

Here is the uncomfortable truth: the best use of AI for a senior developer is often invisible to metrics. I use Claude to think through architectural decisions, to rubber-duck complex debugging sessions, to draft RFC documents. None of that shows up as “AI-assisted code generation.” If you measure AI impact by code output, you are measuring the least valuable application of the technology.

Carlos raises a good point about the board wanting ROI numbers. My honest response as an IC: if the measurement framework makes developers feel surveilled, you will get two outcomes — gaming the metrics and attrition of your best people. Both are more expensive than the AI tools themselves.

Measure outcomes. Measure team health. Do not measure my keystrokes.