Skip to main content

Org-Level Goodhart: When Teams Game AI Adoption Metrics

· 9 min read
Tian Pan
Software Engineer

According to one study, 95% of generative AI pilots technically succeed—and 74% of companies using generative AI still haven't shown tangible business value. That gap is not a coincidence. It's a measurement problem dressed up as a technology problem, and most organizations won't diagnose it correctly because the people doing the measuring are the same ones being measured.

This is Goodhart's Law at the organizational level: once an AI adoption metric becomes a performance target, it ceases to measure what you care about. The metric keeps going up. The underlying outcome stays flat or gets worse.

The Three Ways Teams Game AI Adoption KPIs

Gaming happens at every level of the org chart, and it rarely looks like fraud. It looks like people doing their jobs under the incentive structure they were given.

Product managers optimize for conversation count. The easiest way to hit an "AI interactions per month" target is to put a chat button on every page—whether or not a conversational interface makes sense for those users. Teams learn to report the number of sessions rather than the number of sessions that achieved something. An AI assistant that gets opened and immediately dismissed counts the same as one that saves someone thirty minutes of work.

Engineering teams wrap features in feature flags and call them AI-powered. When "shipped AI features this quarter" is a KPI, the fastest path is to take an existing workflow, add a single LLM call somewhere in it, and count it. Leadership reports a feature count that sounds transformative. The actual improvement to users ranges from marginal to zero.

Leadership reports hours logged in AI tools as productivity. This is the most persistent form of gaming because it's usually unintentional. If your developers have Copilot open for eight hours a day, you report high AI utilization. You don't report that a 2025 randomized controlled trial found experienced open-source developers took 19% longer to complete tasks when using AI tools, because that study doesn't show up in your utilization dashboard.

The common thread is that all three metrics measure inputs or activity, not outputs or outcomes. Low marginal cost makes this worse: AI makes it cheap to generate more conversations, more features, more code, more summaries—so organizations drown in volume while starving for value.

The Signals That Tell You It's Happening

By the time a metric is clearly being gamed, the damage is usually done. But there are early signals that show up before leadership catches on.

PR size inflates without a corresponding quality signal. One analysis found that AI adoption correlates with a 154% increase in average pull request size, while code churn doubles and deployment stability drops around 7.2%. When the metric is "PRs opened," engineers open more PRs. The code output goes up. Whether any of it should have been written at all is a different question.

Adoption dashboards go up while support tickets stay flat or grow. Real productivity improvement shows up in downstream metrics. If AI chat volume is climbing but customer support costs aren't falling and user task completion rates aren't improving, the chat sessions aren't doing what the dashboard implies.

The 96/77 split. A useful canary: survey executives and individual contributors separately about whether AI has increased or decreased workload. Research found that 96% of executives expected productivity gains from AI, while 77% of the employees who actually use the tools said AI had increased their workload. If your org shows that same split—leadership optimistic, contributors exhausted—the metrics being reported upward are probably not describing the same reality as what's happening on the ground.

ROI claims that can't survive a methodology audit. IBM data shows that 79% of organizations claim productivity gains from AI, but only 29% can measure ROI confidently. If your productivity gain number exists but nobody can explain how it was calculated—what the baseline was, what the control group was, what was held constant—it's probably fiction.

Why Standard Software Metrics Fail Here

Traditional software adoption metrics had a structural advantage: the outputs were determinate. A ticket was resolved or it wasn't. A deployment went out or it didn't. Code passed tests or failed.

AI outputs are neither determinate nor uniformly observable. An AI agent can take five different paths to a correct answer, all of which look different to a reviewer but are equally valid. A language model can generate fluent, confident text that contains factually wrong information that's hard to catch without domain expertise. A code suggestion can look correct and pass tests while introducing subtle logic errors that surface months later.

This makes the standard measurement playbook fail in specific ways:

  • Accuracy metrics miss path diversity. "Did the AI get the right answer?" doesn't capture whether it got there efficiently, whether it tried unnecessarily risky alternatives, or whether a human would have caught something it missed.
  • Adoption metrics don't penalize low-quality output. When the tool is free at the margin, users generate more output because generating is easier than evaluating. Volume metrics reward this; outcome metrics don't.
  • DORA metrics survive better than most, but still don't close the loop. Deployment frequency and change failure rate measure system-level outcomes rather than code quality proxies, so they're harder to game. But they still won't tell you whether the software you're shipping is actually solving user problems.

The deeper issue is that most organizations haven't defined what a good AI-assisted outcome looks like at the task level. Without that definition, any measurement will drift toward measuring something else.

Metric Designs That Are Harder to Game

The key property you want from a metric is that gaming it requires actually doing the thing you care about. These three designs come closer than most.

Task completion rate on user-selected tasks. This measures what percentage of tasks—specifically tasks the user chose to start—an AI system finishes without human intervention. Enterprise targets for structured tasks run 85–95%. The hard part is defining "complete" in a way that doesn't let the AI wriggle out of difficult cases. If the system can declare a task complete by punting to a human, you haven't solved anything. Define completion as "the user didn't need to re-do this themselves."

Error-correction cost. Track the downstream human time spent fixing AI-generated work. A tool that speeds up initial generation but doubles review burden isn't saving time—it's moving cost. This metric surfaces the quality degradation that throughput metrics hide, and it's hard to manipulate because you'd have to make the AI accurate to reduce the cost.

User-reported effort reduction on specific workflows. This is softer than the above, but more honest than general satisfaction scores. Ask users whether a specific task—one you've defined in advance—took meaningfully less effort with the tool than without it. The specificity matters: "did this save you time on X?" is measurable and predictable; "do you like the AI tool?" is a sentiment survey.

A common structure: use a control group. A small, randomly assigned group that doesn't have access to the tool for 60–90 days gives you a real baseline. This is harder to game because you'd need to suppress the control group's results—which is its own kind of signal.

What this framework leaves out deliberately: conversation counts, features shipped, hours logged in tools, benchmark scores on vendor evaluations, and lines of code generated. These can all look great while the org is moving sideways.

The Failure Mode Nobody Fixes

There's a specific organizational failure mode that combines all of the above, and it's the reason these problems persist even after engineering leaders recognize them.

The teams that report the metrics are the teams whose performance is evaluated on the metrics. The AI program lead's job security is tied to the AI adoption numbers looking good. The engineering manager's performance review notes their team's AI feature velocity. The product manager's bonus includes AI engagement. Nobody in that structure has the incentive—or often the authority—to report that the numbers are mostly measurement theater.

This creates a compounding problem: the longer the metrics are gamed, the more dependent the org becomes on the inflated numbers. An AI program that's reported 40% productivity gains for eighteen months can't suddenly acknowledge that the gains were 5%. The accumulated fiction becomes a liability that's easier to maintain than to unwind.

The organizations that escape this pattern have one thing in common: someone who doesn't depend on the metrics for their own performance evaluation has authority to audit them. External measurement teams, user research operations that report outside the product chain, or leadership that treats flat or negative AI productivity data as useful signal rather than a problem to explain away. Measurement requires independence from the outcome of the measurement.

The Honest Baseline

Most organizations are at 5–15% real productivity improvement from current AI tools, concentrated in specific, well-defined task categories: code autocompletion on routine patterns, summarization of large documents, first-draft generation for known formats. Outside those task types, the gains are near zero or negative.

That's a useful tool. It's just not the transformative productivity revolution that adoption dashboards describe. The gap between the real number and the reported number is Goodhart's Law at scale—not fraud, not incompetence, but a predictable outcome of measuring the wrong things under the wrong incentives.

The fix isn't better AI. It's defining what task completion means before you deploy the tool, measuring downstream outcomes rather than upstream activity, and separating the people who need the numbers to look good from the people who need the numbers to be accurate.


References:Let's stay in touch and Follow me for more thoughts and updates