Skip to main content

AI Product Metrics Nobody Uses: Beyond Accuracy to User Value Signals

· 9 min read
Tian Pan
Software Engineer

A contact center AI system achieved 90%+ accuracy on its validation benchmark. Supervisors still instructed agents to type notes manually. The product was killed 18 months later for "low adoption." This pattern plays out repeatedly across enterprise AI deployments — technically excellent systems that nobody uses, measured by metrics that couldn't see the failure coming.

The problem is a systematic mismatch between what teams measure and what predicts product success. Engineering organizations inherit their measurement instincts from classical ML: accuracy, precision/recall, BLEU scores, latency percentiles, eval pass rates. These describe model behavior in isolation. They tell you almost nothing about whether your AI is actually useful.

The Measurement Gap Nobody Talks About

When an AI product fails, it rarely fails on the metrics the team was tracking. It fails on metrics nobody was watching.

A technically correct output can be completely useless in context. A fast response that requires five minutes of cognitive verification can increase total task time compared to a slightly slower response that's immediately actionable. An accuracy score measured on a held-out test set tells you how the model performs on examples similar to your labeled data — not on the messy, ambiguous, context-dependent queries users actually send.

Goodhart's Law is waiting around every corner. Once a team starts optimizing a benchmark as a proxy for quality, the benchmark stops measuring the thing it was supposed to measure. Model developers game leaderboards by over-fitting to benchmark formats. Internal teams game eval scores by cherry-picking test sets that match their best-performing cases. The metric moves up; the product gets worse.

The core failure is treating system performance metrics (accuracy, precision, latency) as proxies for product performance (task success, user value, retention). They correlate weakly at best, and the correlation breaks down under pressure.

Task Completion Rate: The Metric That Actually Predicts Retention

Task completion rate (TCR) — the percentage of initiated tasks that reach a successful end state — is the closest thing to a north star metric for AI products.

Some teams call this Intent Resolution Rate (IRR): the percentage of conversations where the user's stated or inferred intent was successfully resolved. The threshold that distinguishes healthy products from struggling ones sits around 70%. Products above this threshold show significantly better 30-day retention than those below 55%, even when DAU numbers look identical. Usage metrics can be flat or even growing while the product quietly erodes.

The implementation detail matters: a "completed task" must be defined at the product level, not at the model level. Model-level completion means "the model returned a response without error." Product-level completion means "the user got what they came for and didn't need to ask again." These are often wildly different numbers.

For a coding assistant, TCR means the user accepted code that actually worked, not just code that was generated. For a search product, it means the user found what they were looking for and didn't reformulate the query. For an AI writing tool, it means the document was saved or published, not just drafted.

Most teams measure the wrong end of this: they track suggestion impressions and token counts while ignoring whether the suggestions were used for anything.

Edit Rate: A High-Signal Trust Indicator

When users receive an AI output and make changes before using it, those edits are data. The pattern of edits reveals more about your product's real-world quality than any offline benchmark.

Edit rate (the percentage of outputs users modify before accepting) combined with edit distance (how much they change) gives you a trust signal that accuracy scores cannot provide. Low edit rate plus high task completion means the AI is producing output that users trust and can use directly. High edit rate plus task completion means users are spending cognitive overhead compensating for AI output that's directionally useful but not quite right. Low edit rate plus low task completion is the danger zone — users aren't editing because they're abandoning.

GitHub Copilot's enterprise research found that developers retained 88% of accepted characters in their editor — a stronger signal of trust than raw acceptance rate. Glean targets 70%+ response acceptance without modification as a benchmark for healthy trust calibration.

The important distinction is edit rate versus regeneration rate. Editing means users think the output is salvageable. Requesting a new response entirely means they've given up on the current one. Both carry information. Regeneration frequency climbing over time is an early warning signal that output quality is degrading relative to user expectations — often before any accuracy metric moves.

Session Depth: An Engagement Proxy That Misleads Without Segmentation

Session depth (average turns per conversation) is the metric that produces the most misleading interpretation when reported as a single number.

A 12-turn conversation could mean a user worked through something genuinely complex with the AI and found value at each step. It could also mean the user asked the same question five different ways, got confused responses each time, and abandoned without resolution. These look identical in aggregate.

The correct use of session depth is segmented by outcome. Split sessions into resolved and abandoned, then track depth separately. For resolved conversations in coding assistants, the productive range is 4–7 turns — enough to establish context and iterate, not so many that the user is stuck in a loop. For abandoned conversations, depth above 10 turns with no resolution rarely leads to re-engagement within 7 days.

The ratio of deep-resolved sessions to deep-abandoned sessions is a diagnostic metric for interaction quality. A rising ratio of deep-abandoned sessions signals a specific failure mode: users are trying hard and failing repeatedly, which is worse for trust than a quick failure. Session depth as a standalone engagement metric is almost meaningless. Session depth segmented by outcome is a strong health indicator.

Return Rate and the Activation Threshold

Retention in AI products concentrates around a single upstream variable: whether users reached a first meaningful success early. Notion found this clearly — retention correlated strongly with "time to first note" in their AI writing feature rollout. Users who hit a first success quickly came back. Users who didn't, didn't.

The activation metric that predicts 3-month retention most reliably is 7-day activation — whether users had a successful outcome during their first week. This is more predictive than session count, query volume, or any accuracy-related metric.

GitHub Copilot's license utilization sits around 80% in enterprise deployments, with roughly 1% monthly churn. ChatGPT Plus 6-month retention is 71%. Claude Pro is at 62%. The difference between these numbers isn't driven by accuracy benchmarks — it's driven by how quickly users reach outcomes they couldn't have achieved without the tool.

Return rate as a standalone metric is lagging. By the time you see churn, the failure happened weeks ago. The leading indicator is the Frustration Index — a composite behavioral signal that combines message repetition, short follow-ups after lengthy responses, explicit clarification requests, and session abandonment after AI responses. Users whose frustration index rises across two consecutive sessions are approximately 3x more likely to churn within 14 days. You can act on that before you lose them.

Instrumenting for User Value: What to Actually Track

The instrumentation gap between teams that measure system performance and teams that measure user value comes down to event schema. Teams that only measure system performance log generation events. Teams that measure user value log what happens after generation.

The event pairs that matter:

  • suggestion_shownsuggestion_accepted / suggestion_rejected / suggestion_edited
  • task_startedtask_completed / task_abandoned
  • output_generatedoutput_used_as_is / output_modified / output_regenerated
  • session_startedsession_completed_with_success / session_abandoned

Cognitive latency — the time between output generation and user acceptance or rejection — is a particularly underused signal. Fast acceptance means immediate trust. Long deliberation followed by editing means the user is performing manual verification. Long deliberation followed by rejection means the output failed to help. All three have the same latency shape at the generation level but very different shapes at the acceptance level.

Verification cost is the metric beneath cognitive latency: how much effort does a user need to validate that the AI output is correct? High verification costs indicate low user confidence regardless of technical accuracy. Contact center agents who continued typing their own notes rather than accepting AI summaries were paying extremely high verification costs — reviewing the AI output, finding inconsistencies, correcting them, and deciding it was faster to start from scratch.

Microsoft measures "Copilot-assisted hours" — an estimate of total time employees were assisted, derived from action counts and research-derived multipliers. Their employees reported saving 9 hours per month on average. This is a value metric. API call counts are not.

The Framework: System Performance vs. Product Performance

The diagnostic question for any AI product metric: does this metric move when the AI helps users succeed, or does it move when the AI generates more output?

System performance metrics move when the AI generates more output: accuracy, latency, token consumption, suggestion impressions, API invocations, benchmark scores. These are necessary for engineering quality control. They are insufficient for product quality signals.

Product performance metrics move when users succeed: task completion rate, edit rate and edit distance, time to first successful outcome, return rate, abandonment rate after AI responses. These are what determine whether users keep coming back.

The goal is not to abandon system performance metrics — accuracy matters, latency matters. The goal is to stop using them as proxies for user value. They're upstream inputs. Completion rate, edit rate, and return rate are the outputs that determine whether you have a product.

Most AI teams have rich infrastructure for system performance metrics and almost nothing for product performance metrics. The fix is instrumentation design: build the event schema that captures what users do with AI output, not just that AI output was produced. Then measure completion, trust, and return. Those are the numbers that tell you whether you're building something people will actually use.

References:Let's stay in touch and Follow me for more thoughts and updates