Skip to main content

The AI Feature Metric Trap: Why DAU and Retention Lie About Stochastic Surfaces

· 11 min read
Tian Pan
Software Engineer

A PM walks into the AI feature review with a slide that reads "+12% engagement, +8% session length, retention up 3 points." The room nods. Two desks over, the support lead is staring at a different chart: tickets touching the AI surface are up 22%, and the most common resolution code is "user gave up, agent helped manually." Both numbers are real. Both come from the same product. The PM's dashboard is built on the assumption that the AI feature emits the same shape of event as the button it replaced. It doesn't. And the gap between what the dashboard counts and what the user experienced is where AI features quietly fail in plain sight.

The deterministic-feature playbook treats interaction as a click stream: user fires an event, the system reacts, the user moves on. AI features have a different event shape — a task arc with phases, retries, side trips to a human, and an offline judgment the telemetry never sees. Importing the deterministic dashboard onto that arc is the analytics equivalent of running 2018's interview loop against 2026's job. The numbers go up. The thing the numbers were supposed to predict goes down.

DAU, Conversion, and Retention Were Built for a World That No Longer Applies

The standard product-analytics triad — daily active users, conversion through a defined funnel, retention curves over weeks — assumes three things that AI features violate.

First, that an event is atomic. A click either happened or didn't; the meaning is in the count. An AI request happened and produced an output and the output was acted on or ignored and the user may have re-asked because they didn't trust the answer. The single "AI feature used" event in your warehouse compresses four orthogonal facts into one bit. Anything you build on top of that bit will be wrong in ways the bit itself can't reveal.

Second, that engagement correlates with value. For a feed product, more sessions usually mean more value. For an AI feature, repeated sessions on the same task often mean the user couldn't get it right the first time. The same metric inverts its sign depending on whether the surface is deterministic or stochastic, and the dashboard has no idea which mode it's in.

Third, that retention measures satisfaction. Retention on a chat surface looks identical whether the user is delighted or whether the user is trapped in a fallback loop because the agent keeps almost-but-not-quite handling their request. The behavioral signature is the same: they came back. The interpretation is opposite. PostHog's own write-up on LLM product metrics buries the lede here — the quality of the output as rated by users correlates with growth and churn, but the rating is the input the standard dashboard never collected.

Two patterns make the failure mode concrete. The launched-and-celebrated pattern: a team ships an AI summary feature, sees a 30% lift in time-on-page during the first month, and books the win. By month three, the lift is gone, support tickets about "the summary was wrong but I didn't notice until I shared it" are up, and the team retroactively realizes they measured novelty curiosity, not utility. The "+12% engagement" pattern: a help center adds an AI agent, deflection metrics rise (fewer tickets opened by the same set of users), and the team declares victory — until the support lead points out that ticket severity is up because the easy ones got deflected and the hard ones now arrive with frustrated users who already burned twenty minutes on the agent.

Neither story is about the model being bad. Both are about measurement looking at the wrong shape.

The Event Shape You Actually Need: Task Arcs, Not Clicks

The unit of analysis for an AI feature is the task, not the event. A task has phases — request, response, follow-up, resolution — and the resolution may happen in a different session than the request, on a different surface, with a different actor (the user, a human agent, or no one at all because the user abandoned). Your event schema has to make the task a first-class entity that survives the cross-session arc.

Concretely, the instrumentation that makes the right metrics computable looks like this. Every AI request opens a task_id that persists. Every model response, every user reply, every escalation to a human, every silent abandonment after N minutes — all stamp that task_id. The OpenTelemetry semantic conventions for generative AI agentic systems went this direction explicitly: tasks are the minimal trackable unit, and actions roll up into them. If your warehouse can't answer "how did task X resolve" with a single key, you do not have AI feature analytics — you have AI feature logs.

Two boundary problems to handle on day one. The cross-session boundary: a user asks a question on Tuesday, doesn't trust the answer, comes back on Thursday and asks again — that's one task with two requests, not two unrelated requests. Without explicit task continuity (a thread id, a conversation id, a client-side replay heuristic on near-duplicate prompts within a window), retention goes up while satisfaction goes down. The cross-surface boundary: the user asks the agent, the agent fails, the user clicks the "talk to a human" link, and the human resolves the ticket. That's a task that resolved successfully — but the AI surface's metrics will record an abandonment, and the support tool's metrics will record a fresh ticket, and nobody's dashboard will say "the system worked." The escalation has to be a stamped event on the same task, not a new event on a new task.

This is unglamorous schema work. It is also the precondition for every metric below being meaningful. Skipping it and shipping the dashboard anyway is the most common version of the trap.

The Five Metrics That Survive Stochastic Surfaces

With the task-arc instrumentation in place, the metric set that actually predicts AI feature value looks different from the deterministic playbook.

Task-completion rate at the task level, not the session level. The denominator is "tasks the user opened," not "sessions the user started." A user who opens one task and resolves it counts as one successful task, not as a low-engagement session. A user who needed three sessions to resolve one task counts as one task, not as a high-engagement user. The shift in denominator catches the inversion that engagement metrics paper over. Industry benchmarks for well-implemented agents land around 85–95% on structured tasks; below 80% reliably correlates with adoption death regardless of how good the output looks in cherry-picked demos.

Escalation-to-human rate as a first-class number, not a failure flag. The temptation is to treat every escalation as a loss for the AI. Don't. Escalation is the right outcome for tasks the model shouldn't have attempted, and it's a wrong outcome for tasks the model should have nailed. The useful framing splits escalation by task class: for tasks the agent is supposed to handle, an escalation is a quality signal; for tasks the agent is supposed to refer, an escalation is a routing success. Reporting one number across both classes is how teams end up either over-tuning the model into refusing everything or under-tuning it into hallucinating answers it should have punted.

Repeat-task rate that distinguishes engaged users from broken trust. The same user asking three related but distinct questions is engagement. The same user re-phrasing the same question three times within an hour is failure. The cheap proxy is a near-duplicate detector on prompts within a window, with the duplicate flagged on the second request and the task arc rolled up so the metric reads as "fraction of resolved tasks that required user re-asks." Without this split, repeat usage looks like product-market fit when it's often product-market frustration.

Abandonment-after-output as a quality proxy. The user got an answer; the user did not act on the answer; the user did not re-ask either. They looked at it and walked away. This is the quietest failure mode and the hardest one to surface in a click-stream warehouse, because the absence of a follow-up event is the signal. The instrumentation has to assert a timeout on the task — N minutes after the response, with no follow-up and no abandonment-cause event from the client (closed tab, navigated away, copied output to clipboard), mark the task abandoned-after-output. Then watch the rate by feature surface, by task class, by model version. A spike in abandonment after a model upgrade is a regression even if every other metric improved.

Response edit-distance for AI-generated content the user keeps but rewrites. GitHub Copilot's published acceptance-rate metric — about 30% of suggestions accepted on average — is the headline number, but the more interesting number for product quality is what happens to the accepted suggestion in the next ten minutes. If the median accepted suggestion gets edited beyond a Levenshtein threshold within a short window, the acceptance rate is over-counting wins. The same logic applies to any AI surface that produces text the user can keep and edit: drafts, summaries, code, replies. Edit distance against the original output, computed at a sensible delay, separates "acceptance" from "approval."

None of these metrics replace the deterministic dashboard outright. They live next to it and constrain its interpretation. Engagement going up while abandonment-after-output is also going up means the engagement is fake. Repeat-task going up while task-completion is going up means engagement is real. The interaction matters more than any single number.

The Variance Problem the Deterministic Dashboard Was Allowed to Ignore

Deterministic features have measurement noise from sampling — yesterday's traffic is not exactly today's traffic — but the underlying event is stable. AI features add a second noise source: the model itself is stochastic, and the same prompt on the same day can produce different outputs with different downstream task outcomes. Recent work on decomposing LLM evaluation noise calls this prediction noise and shows it routinely exceeds data noise in eval suites, which means averaging across users doesn't smooth it the way analysts are used to.

In practice this kicks A/B tests in two ways. First, the minimum detectable effect inflates because per-task variance is higher than per-click variance. A 5% improvement that a deterministic test could call in two weeks may need six in a stochastic feature, and the team that doesn't budget for that ends up shipping prompt changes on noise. Second, day-to-day metric drift looks larger than it is, and the dashboard alarm that pages on a 10% dip will fire constantly. Sites that publish their LLM observability practice have settled on coarser windows, paired comparisons across model versions, and explicit confidence intervals on every reported number — none of which are habits the deterministic-feature analyst arrived with.

The cohort-of-judges pattern from the eval literature has a product-analytics analog: when measuring quality online, sample multiple independent quality signals — task-completion, escalation rate, abandonment, edit distance — and require that two of them move together before declaring a regression. A single metric moving is noise; a coordinated movement across orthogonal signals is signal. Teams that ship this discipline catch real regressions weeks before teams that wait for one number to cross a threshold.

The Org Failure Mode the Dashboard Hides

The dashboard says +12% engagement. The support team's queue says +22% AI-routed tickets with frustrated openers. The model evals say quality is flat. The CFO sees the dashboard. The support lead sees the queue. The model team sees the evals. None of them are wrong. None of them are looking at the same thing.

The fix isn't a better metric. It's a single owner of the AI surface's analytics who sits between the PM dashboard, the support telemetry, and the eval pipeline, and whose job is to flag when the three diverge. Most orgs don't have this role. They have an analyst who owns the dashboard, an ops lead who owns the queue, and an ML engineer who owns the evals — and the divergence between the three is everyone's problem and nobody's job. The AI feature metric trap is, at the org level, a coordination problem dressed up as a measurement problem.

Treat AI product analytics as its own discipline with its own event shape, its own metric set, its own variance budget, and a named owner of the cross-system view. The deterministic dashboard isn't going away — it still answers questions about funnels and pricing pages and onboarding. It just stops being the right tool the moment the surface it's pointed at can produce a different output every time it's asked.

References:Let's stay in touch and Follow me for more thoughts and updates