AI Feature PMF Signals: Why Your Metrics Are Lying to You
When your AI feature ships and the metrics light up — DAU spikes, NPS climbs, thumbs-up feedback floods in — you could be looking at genuine product-market fit. Or you could be watching the first act of a two-part story where the second act ends with a retention cliff nobody saw coming.
The problem is these signals are structurally broken for probabilistic AI features. They were designed for deterministic software where "activated" means something, where a five-star rating predicts future use, where the novelty fades in days rather than masking a six-month churn wave. AI features behave differently, and the standard PMF toolkit is calibrated for the wrong inputs.
What follows is a breakdown of why the conventional metrics mislead, what behavioral patterns actually distinguish genuine PMF from novelty effect, and how to instrument your cohort analysis to tell the difference before the cliff arrives.
Why Conventional Signals Break for AI Features
NPS conflates novelty with satisfaction. When a new AI feature launches, users are curious. They explore. The interactions feel surprising and valuable — partly because they are, and partly because anything novel registers as good before the baseline shifts. NPS collected in weeks two through six of a launch captures this novelty premium and embeds it in your dashboard as a signal of product quality. By month four, when the novelty premium has evaporated and you're comparing against inflated historical scores, the feature looks like it's declining. It wasn't declining — it was never where you thought it was.
Thumbs-up ratings reflect revealed politeness, not revealed preference. There's a well-documented gap between what people rate favorably and what they actually use. Users rate documentaries five stars and watch reality TV. For AI outputs specifically, users will thumbs-up a response that is competent and complete even if they don't trust it enough to act on it unsupervised, even if they corrected it before using it, even if they'd never ask this feature to do that job again. The binary rating captures sentiment in the moment; it says nothing about whether the feature earned a place in their workflow.
Activation rates measure whether users arrived, not whether they stayed. High activation combined with low D30 retention is the fingerprint of novelty, not PMF. Users try an AI feature because it's interesting — same reason they watch a product demo or click a TechCrunch article. The activation event tells you the marketing worked, or that curiosity is universal, or that the onboarding didn't immediately repel people. It does not tell you the feature solved a problem worth solving repeatedly.
The underlying issue across all three metrics is that they conflate a first impression with a durable relationship. For traditional software features — export to CSV, multi-factor auth, saved filters — that conflation is mostly harmless because there's no novelty premium to expire. AI features have a novelty premium that's large, temporary, and exactly the period when teams are most aggressively measuring PMF.
The Novelty Cliff in the Wild
The pattern is visible in the retention data for major AI models and products. Andreessen Horowitz's analysis of AI consumer retention curves shows a consistent phenomenon: early launch cohorts hold relatively well — around 35–40% D30 for top performers — while cohorts arriving two or three months later see dramatically steeper drop-off. By month five or six, the later cohorts have nearly entirely churned, even when the early cohorts still show reasonable retention.
This isn't because the product got worse. It's because the early cohort contained users who found a genuine fit — their use cases aligned with the model's strengths, they integrated it into workflows, they became dependent on it in the way that generates real retention. The later cohort arrived after those high-fit use cases were already claimed. They experimented, found weaker alignment, and left.
If you measure retention only on your first two cohorts, you'll read early numbers that suggest PMF. If you measure across cohorts and track how retention evolves over successive waves of new users, you'll see whether you're accumulating genuinely sticky users or just temporarily exciting new ones.
The session depth signal is related. Perplexity has long average sessions — around 23 minutes — because it's being used for high-friction research tasks where users have strong intent and are actively engaged in a workflow. Shallow sessions of two to three minutes indicate curiosity and browsing, not task completion and workflow integration. Activation rate alone can't distinguish these; session depth by use case can.
The Three Signals That Actually Matter
D30+ retention disaggregated by cohort. Not overall D30 — D30 broken out by launch cohort, by primary use case, and ideally by week of acquisition. If cohort three retains at 40% and cohort seven retains at 8%, the feature has a use-case saturation problem and the early number is not predictive. If cohort seven is close to cohort three, you're accumulating a genuinely large base of sticky users and the signal is real.
Override rate over time. For AI features that generate outputs users then act on, track how often users correct, modify, or ignore the output rather than accepting it. High override rates at launch are expected — users are calibrating trust, exploring edge cases, learning the model's failure modes. The signal is in the trajectory: if override rate declines over the first 60 to 90 days, users are developing trust and integrating the feature into their workflow at increasing depth. Flat or rising override rates after the initial calibration period signal that the model's quality isn't crossing the threshold users need to delegate to it.
Intent resolution rate — whether the user's underlying goal was actually achieved — is the metric that ties this together. Products with intent resolution above 70% show substantially stronger D30 retention than those below 55%. The gap isn't because better products keep users longer in some abstract sense; it's because resolved intent means the user got value this session and will come back to get value next session.
Task diversity expansion. Genuine PMF users don't just return to the same use case repeatedly — they discover adjacent use cases. A user who starts using an AI writing tool for email drafts and then begins using it for meeting notes, then for proposal outlines, is demonstrating that they've found a workflow fit deep enough to extend. That task diversity expansion is a leading indicator of retention: users who expand task types in weeks three through six retain materially better at D30 than users who stay in a single task type.
Novelty users don't expand task types. They try the headline use case, form an impression, and either retain or churn based on that impression alone. PMF users discover the feature's value in one context and naturally ask what else it can do — because the underlying value proposition resonates broadly, not just for the specific demo scenario that first attracted them.
How to Build Cohort Analysis for AI Features
Standard cohort analysis treats cohort date as the primary segmentation axis. For AI features, that's insufficient. You need two additional dimensions.
Use-case cohorts. Segment users by their primary use case at first meaningful engagement, not by acquisition date. A user who first uses your AI feature for data analysis and a user who first uses it for content generation are in fundamentally different situations — different quality thresholds, different override rate expectations, different task expansion trajectories. Aggregating them into a single cohort by week obscures which use cases have genuine PMF and which are novelty-driven.
Feature depth cohorts. Track how much of the relevant workflow the feature owns for each user. A feature that handles 90% of a user's document review process is structurally different from one that handles one step in a ten-step process. Deep workflow integration predicts retention; shallow integration is much more susceptible to displacement by alternatives or to being abandoned when novelty fades. Tools like Amplitude and PostHog support the behavioral segmentation needed to build these cohorts, though the analytical work of defining what "workflow depth" means for your specific use case is something you have to do yourself.
The goal is to find even one narrow use case where D30 retention is high, override rates are declining, and task diversity is expanding. That's more informative than average retention across all use cases being mediocre. A feature that has genuine PMF in a small segment can be expanded; a feature with average retention everywhere is likely novelty-driven across the board and will decline as the novelty wears off.
PMF Is Continuous, Not a Milestone
Traditional PMF is often framed as a threshold event — you either have it or you don't, and once you have it, you move to scaling. For AI features, this framing breaks down because model quality changes monthly, user expectations calibrate continuously to the current state of the art, and what constitutes a "solved" task keeps moving.
A feature that had genuine PMF on a model from six months ago may lose that PMF when users upgrade their internal baseline based on exposure to better models elsewhere. The override rate can climb back up. The task diversity expansion can stall. The D30 cohort retention can start diverging between cohorts that joined before and after the user's expectations shifted.
This means the behavioral signals aren't a one-time check — they're a continuous instrumentation requirement. Teams that instrument well in the first 90 days and then move on to other priorities will miss the model-expectation drift that erodes their PMF over the following two quarters. The signals that distinguish genuine PMF from novelty are the same signals that distinguish sustained PMF from decaying PMF, and they require the same ongoing monitoring.
The teams that get this right are the ones that stop treating AI feature PMF as a launch milestone and start treating retention signal analysis as part of the product's operational health monitoring — as fundamental as latency dashboards or error rates. The metrics are behavioral, they're cohort-disaggregated, and they're lagging by design. That's what makes them reliable in a way that thumbs-up ratings and activation spikes are not.
- https://a16z.com/the-cinderella-glass-slipper-effect-retention-rules-in-the-ai-era/
- https://a16z.com/ai-retention-benchmarks/
- https://a16z.com/state-of-consumer-ai-2025-product-hits-misses-and-whats-next/
- https://www.bvp.com/atlas/mastering-product-market-fit-a-detailed-playbook-for-ai-founders
- https://agnost.ai/blog/intent-resolution-rate-ai-quality-revenue/
- https://agnost.ai/blog/6-metrics-every-ai-native-product-should-track/
- https://andrewchen.substack.com/p/how-novelty-effects-and-dopamine
- https://andrewchen.substack.com/p/why-high-growth-high-churn-products
- https://www.bejoyous.ai/ceo-blog/pmf-to-aimf
- https://www.thevccorner.com/p/ai-product-market-fit-framework-openai
