Why LLMs Make Confident Mistakes When Analyzing Your Product Data
Product teams have started routing analytical questions directly to LLMs: "What's causing the churn spike?" "Why did conversion drop after the redesign?" "Which cohort should we focus retention spend on?" The outputs land in executive decks, drive roadmap decisions, and get presented to investors. The models answer confidently, in polished prose, with specific numbers. And a significant fraction of those answers are wrong in ways that don't announce themselves.
This isn't a general criticism of LLMs for data work. There are tasks where they genuinely help. The problem is that the failure modes are invisible — the model doesn't hedge, doesn't caveat, and doesn't distinguish between "I computed this from your data" and "I generated something that sounds like what this number should be." Practitioners who understand where the breakdowns happen can capture the genuine value and route around the landmines.
The Denominator Problem
The most consistent failure in product analytics is wrong denominator handling. LLMs struggle with counting problems — specifically with understanding which population a rate or percentage applies to. This shows up in two variants.
The first is straight inversion. An LLM asked to analyze Q4 purchase behavior might report "50% of your power users purchased in Q4" when the actual data shows "50% of Q4 purchasers were power users." The numbers are the same; the meaning is opposite. The model generates what sounds correct for the pattern without tracking the direction of conditioning.
The second variant is filtered-dataset blindness. If you hand an LLM a CSV containing only users who viewed a new checkout flow, it has no representation of the population that didn't view it. It will analyze the subset as if it's the full user base. When it reports a "45% conversion rate," the actual denominator excludes users who churned before seeing the feature, users who were shown the old flow, and users who encountered errors in routing. The model doesn't know what it can't see, and it doesn't ask.
In practice this means any metric reported by an LLM from a filtered or pre-aggregated dataset requires explicit denominator verification. The model should not be trusted to track which rows were excluded and whether those exclusions change interpretation.
Confident Spurious Correlations
LLMs identify statistical associations and present them as explanatory. When you show them event sequences in behavioral data, they map temporal ordering to causation: Feature X launched January 15. Churn spiked January 20. Conclusion: Feature X caused the spike.
This is the post-hoc fallacy, and research measuring it quantitatively shows roughly a 32-percentage-point accuracy drop in causal reasoning tasks when events are narrated in reverse order. The model's confidence doesn't shift with the reversal — it produces a confident wrong answer in both directions.
The deeper issue is that confounding variables are invisible. An LLM analyzing customer discount data might confidently assert "customers with high discounts churn more" without modeling that discounts were disproportionately given to customers who already had poor product fit. The model sees the correlation and reports the conclusion. It cannot see that the discount assignment was itself caused by the very thing that predicts churn.
Aggregation reversals — Simpson's Paradox — compound this. A product with higher overall satisfaction than a competitor can have lower satisfaction in every individual region, with the aggregate direction driven entirely by market mix. LLMs analyzing aggregate data will report the overall trend as the finding without decomposing by subgroup, and without flagging that subgroup reversal is even possible.
Selection Bias the Model Can't See
When you query behavioral data, the dataset you hand to an LLM is always a filtered slice of a larger population. Users who engaged with a feature. Customers who filed support tickets. Sessions that converted. The model has no representation of the users who never showed up, who abandoned silently, or who were excluded by upstream filtering logic.
This matters for retention analysis specifically. "Show me which features correlate with 90-day retention" on a dataset of users who survived 90 days is survivorship analysis, not retention analysis. The model will analyze the survivors and generate confident conclusions about what causes retention without flagging that the entire analysis population was selected by the outcome variable.
Survey and feedback data has the same problem. Users who submit NPS responses, churn surveys, or feature requests are systematically different from users who don't. An LLM analyzing submitted feedback will characterize your user base as the feedback-submitting subset. If power users submit and casual users don't, you'll build a product story around power user preferences while your broader base churns quietly.
There is no way for the model to correct for this without explicit information about what was excluded. The mitigation is to describe the dataset construction clearly in the prompt and ask the model to characterize what population the analysis applies to before drawing any conclusions. Most LLMs will comply with this constraint when it's explicit; none will enforce it on their own.
The Hallucinated Metric
A distinct failure mode from the ones above: LLMs sometimes generate specific numbers that aren't in your data at all. This is the most straightforward hallucination type, but in analytics contexts it's particularly dangerous because the numbers are plausible.
"Conversion rate improved from 3.2% to 4.8% following the redesign" is the kind of statement that passes a casual read. If the actual conversion rate was 3.5% and the model generated the second number from a pattern that matched what post-redesign improvement "should" look like, the error won't surface in the output itself. The model has no mechanism for distinguishing "I calculated this from your data" from "I generated a number that fits the narrative."
When LLMs are used to write reports about pre-computed dashboards, this is lower-risk because the metrics were computed by deterministic tools and the LLM is interpreting them. When LLMs are asked to compute metrics directly from raw data or to describe trends they see, the risk is much higher. Treat any specific number in an LLM-generated analysis as requiring verification against the source before it moves into any decision-making context.
When SQL Gets Generated
- https://arxiv.org/html/2510.04023
- https://arxiv.org/html/2410.23884v1
- https://arxiv.org/html/2406.12158v1
- https://arxiv.org/html/2405.19616v2
- https://pmc.ncbi.nlm.nih.gov/articles/PMC10287014/
- https://arxiv.org/html/2511.14591
- https://www.tinybird.co/blog/why-llms-struggle-with-analytics-and-how-we-fixed-that
- https://towardsai.net/p/machine-learning/engineering-the-semantic-layer-why-llms-need-data-shape-not-just-data-schema
- https://arxiv.org/html/2404.04326v1
- https://arxiv.org/html/2504.05496v1
- https://arxiv.org/html/2502.11028
- https://arxiv.org/html/2508.06225v1
- https://blogs.sas.com/content/iml/2023/03/27/simpsons-paradox.html
