Skip to main content

The AI Product Metrics Trap: When Engagement Looks Like Value but Isn't

· 11 min read
Tian Pan
Software Engineer

A METR study published in 2025 asked 16 experienced open-source developers to predict how much faster AI tools would make them. They guessed 24% faster. The study then measured what actually happened across 246 real tasks — bug fixes, features, refactors — randomly assigned to AI-allowed and AI-disallowed conditions. The result: developers with AI access were 19% slower. After the study concluded, participants were surveyed again. They still believed AI had made them 20% faster.

That gap — between perceived productivity and measured productivity — is not a quirk of one study. It is the central problem with how most teams currently measure AI features. The signals that feel like success are, in many cases, measuring the novelty of the tool rather than its usefulness. And the first 30 days are the worst time to look.

The First 30 Days Are Not What They Appear

When a new AI feature ships, usage metrics almost always look good. Session counts climb. Completion rates are high. User satisfaction scores peak. Product managers screenshot the dashboard and share it in Slack.

What they're seeing is the novelty effect. Research shows novelty effects cause scores to inflate by 30% of a standard deviation on average, with decay happening within eight weeks. For consumer AI features, the pattern is even more dramatic. Statsig's analysis of product launches found that "if your metrics are still strong after a month, you're probably onto something real" — implying the default expectation is that most metrics are artificially elevated before that point.

The consumer data confirms this. Consider a major AI video generation app that reached 1 million downloads faster than ChatGPT. Day 1 retention: 10%. Day 7: 2%. Day 30: 1%. Downloads were completely uncorrelated with durable value. The app burned an estimated $15M/day in inference costs during peak while generating a fraction of that in lifetime revenue.

This isn't an outlier. The a16z State of Consumer AI 2025 report found that none of ChatGPT's major new feature launches — Pulse, Group Chats, Record, Shopping Research, Tasks, Study Mode — broke through in retention, despite impressive initial adoption numbers. Only 3% of the estimated 1.8 billion free AI platform users ever convert to paid subscriptions.

The practical implication is that the typical measurement window for AI feature success (launch plus two weeks) is exactly the window most contaminated by novelty inflation. The launch looks like product-market fit. What you're actually measuring is curiosity.

The Metrics That Feel Good but Predict Nothing

There are three metrics that nearly every team tracks for AI features that have demonstrated limited or inverse predictive value for actual impact.

Session count and completion rate measure whether users showed up and got to the end of the flow. They don't measure whether the outcome was useful. A user who runs an AI-generated summary, reads it, realizes it's wrong, and manually re-does the task has a 100% completion rate in your analytics. So does a user who found the summary genuinely helpful. The two events look identical.

Acceptance rate is the canonical example in developer tooling. GitHub Copilot's initial research reported acceptance rates and framed them as productivity evidence. But subsequent independent analysis found that developers with Copilot access had significantly higher bug rates while issue throughput remained unchanged. The code was accepted. The code was worse. Acceptance rate measured a UI interaction, not an outcome.

Deflection rate is the customer service equivalent of acceptance rate. A chatbot that frustrates customers until they abandon the conversation has a 100% deflection rate. So does one that provides fabricated answers with enough confidence that users leave believing they were helped. Qualtrics' 2025 survey of 20,000+ consumers found that AI customer service fails at four times the rate of non-AI channels — yet deflection rate as a category continues to be the primary metric used to justify AI customer service investments.

The structural problem with all three metrics is that they measure the AI system's behavior, not its effect on the thing the system was supposed to help with. They are proxies that accidentally measure the wrong thing.

What the Perception-Reality Gap Looks Like in Practice

The METR developer study is worth dwelling on because of what it reveals about self-reporting as a fallback measurement strategy. When objective measurement isn't in place, teams substitute user surveys and satisfaction scores. Developers in the METR study didn't need to be deceived — they genuinely experienced AI tools as productivity-boosting. The sensation of productivity was real. The productivity was not.

Several mechanisms drive this inversion:

  • Visible activity bias: Watching code generate rapidly creates a subjective sense of progress.
  • Cognitive load reduction: Less typing feels like less work, which registers as more efficient.
  • Attribution asymmetry: Successes that involve AI are attributed to AI. Failures are attributed to unclear requirements, complex problems, or the underlying code.
  • Sunk cost rationalization: After spending time learning to work with an AI tool, people are motivated to believe the investment paid off.

McKinsey's research adds another dimension: on tasks that developers themselves rated as high complexity, time savings shrank to less than 10%. Junior developers on some tasks took longer with AI assistance. The productivity gains, to the extent they exist, are concentrated in low-complexity, well-defined work — exactly the work that was easier to measure in early controlled experiments.

The 88%/6% split from McKinsey's broader survey captures the aggregate effect: 88% of organizations actively use AI tools, but only 6% report measurable financial results. Engagement is nearly universal. Value is rare.

The Metrics That Actually Predict Value

Replacing vanity metrics requires instrumenting outcomes, not events. The specific instruments depend on the feature domain, but there are consistent patterns across teams that have gotten this right.

For developer tools:

  • Committed code ratio — what fraction of AI suggestions survive code review unmodified vs. heavily edited? Acceptance rate answers when the user clicked accept. Committed code ratio answers whether the AI output was actually correct.
  • Downstream bug rate — do tasks completed with AI assistance ship bugs at the same rate as those done without? This requires correlating deployment data with AI usage data, which is tedious but directly measures quality.
  • Cycle time per task type — not overall velocity (which is noisy) but time on specific recurring task categories. If you can tag "fixing a test I wrote vs. fixing a test AI wrote," you'll find the signal.

For AI customer service:

  • Verified resolution rate — confirmation through follow-up contact behavior (did the customer return with the same issue?) or explicit verification (post-interaction check). Not deflection.
  • CSAT delta — AI-handled ticket satisfaction minus agent-handled ticket satisfaction for the same issue categories. This controls for issue complexity.
  • Repeat contact rate — same customer, same issue, same week. Resolved deflections don't come back. Deflections that gave up do.

For any AI feature:

  • Human escalation and override rate — how often do users substantially rewrite, undo, or manually redo the AI's work? A target of under 25% override rate at 90 days (after novelty fades) is a useful benchmark.
  • Feature toggle-off rate — the strongest signal. If more than 20% of users who were exposed to an AI feature actively turn it off, that is a direct vote against value.
  • Return usage after 30 days — not raw retention, but specifically whether users who experienced the feature in week one came back to use it in week five. This filters out casual explorers.

The a16z retention research suggests rebasing retention calculations: instead of measuring M12/M0 (which includes all the novelty-effect adopters who churned in months 1-2), measure M12/M3. By month 3, the users who remain are the ones who found actual value. That cohort's long-term retention is a much better signal of whether the feature works.

Leading vs. Lagging Indicators and Why the Distinction Matters

One reason teams get trapped in vanity metrics is that the metrics that actually matter take time to appear. Worklytics' analysis of Copilot deployments found that the tool's effect shows up in leading indicators first — commit frequency, code review turnaround, developer sentiment — before it appears in lagging indicators like deployment frequency, product release velocity, and incident rates.

This creates a real organizational problem. Teams need early signals to make product decisions. The early signals available are mostly the ones that turn out to be misleading. The signals that would actually tell you whether the feature is delivering value won't be visible for 4-8 weeks.

The mitigation is to instrument for leading indicators that correlate with downstream outcomes, not leading indicators that feel good but correlate with nothing. Verification cost — the time and effort users spend checking AI output — is a practical leading indicator for trust and therefore retention. Teams that instrument verification behavior (time between AI output and next user action, edit distance on AI suggestions, re-query rate after AI answers) have an early signal of whether users trust the output enough to act on it.

A high verification cost means users are treating the AI as a rough draft generator rather than a reliable assistant. That might be fine for some workflows, but it needs to be explicit. A team that measures completion rate and sees high completion rate doesn't know whether users are trusting the output or spending 10 minutes fact-checking it after every response.

The Organizational Pattern Behind the Measurement Failure

MIT's study of 150 enterprise AI deployments found that organizations struggle to measure AI's impact on productivity and profit — and that the measurement gap predicts failure. Where tracking existed, value realization rose and risk incidents fell. Where teams tracked adoption rather than outcomes, initiatives were abandoned at higher rates.

The pattern is consistent: teams default to measuring AI usage rather than business outcomes because usage data is immediately available and outcomes take weeks to materialize. This is compounded by organizational incentives — the team that shipped the AI feature has already moved on to the next sprint by the time the lagging outcomes become measurable.

The fix requires treating AI feature instrumentation as a first-class deliverable, not an afterthought. Define outcome metrics before writing the prompt. Establish a 90-day measurement window as the minimum for a feature assessment. Separate the novelty cohort (first 30 days) from the evaluation cohort (days 31-90) in every analysis. And build in a mechanism for measuring the thing the feature was supposed to change — whether that's task time, error rate, re-work rate, or contact volume — not just the feature's own usage.

The teams that will extract real value from AI features are not the ones with the highest usage metrics. They're the ones who can tell the difference between a graph that shows people trying something new and a graph that shows people finding something indispensable.

Building a Measurement Stack That Doesn't Lie

The minimum viable instrumentation for any AI feature:

  1. A control condition or counterfactual — either A/B test with AI off, or shadow-track what users would have done without AI access. Without this, you cannot separate AI impact from external factors.
  2. A downstream outcome metric — whatever the AI was supposed to improve. If you cannot define this before launch, you have a specification problem, not a measurement problem.
  3. A 90-day retention measurement — specifically for users who meaningfully used the feature (not users who saw it once), at day 30 and day 90.
  4. An override/escalation rate — how often users substantially correct or abandon AI output.

These four instruments will give you an honest answer faster than any amount of session analytics. Most teams find they're a significant investment to set up. That investment is the correct price of knowing whether your AI feature is creating value or just generating activity.

The goal is not to be pessimistic about AI features. It's to be accurate. Accurate measurement is what lets you iterate on the features that are working and stop investing in the ones that aren't. The teams running on session counts and completion rates are flying blind — and the first 30 days of data are making it worse.

References:Let's stay in touch and Follow me for more thoughts and updates