Skip to main content

Why LLMs Make Confident Mistakes When Analyzing Your Product Data

· 11 min read
Tian Pan
Software Engineer

Product teams have started routing analytical questions directly to LLMs: "What's causing the churn spike?" "Why did conversion drop after the redesign?" "Which cohort should we focus retention spend on?" The outputs land in executive decks, drive roadmap decisions, and get presented to investors. The models answer confidently, in polished prose, with specific numbers. And a significant fraction of those answers are wrong in ways that don't announce themselves.

This isn't a general criticism of LLMs for data work. There are tasks where they genuinely help. The problem is that the failure modes are invisible — the model doesn't hedge, doesn't caveat, and doesn't distinguish between "I computed this from your data" and "I generated something that sounds like what this number should be." Practitioners who understand where the breakdowns happen can capture the genuine value and route around the landmines.

The Denominator Problem

The most consistent failure in product analytics is wrong denominator handling. LLMs struggle with counting problems — specifically with understanding which population a rate or percentage applies to. This shows up in two variants.

The first is straight inversion. An LLM asked to analyze Q4 purchase behavior might report "50% of your power users purchased in Q4" when the actual data shows "50% of Q4 purchasers were power users." The numbers are the same; the meaning is opposite. The model generates what sounds correct for the pattern without tracking the direction of conditioning.

The second variant is filtered-dataset blindness. If you hand an LLM a CSV containing only users who viewed a new checkout flow, it has no representation of the population that didn't view it. It will analyze the subset as if it's the full user base. When it reports a "45% conversion rate," the actual denominator excludes users who churned before seeing the feature, users who were shown the old flow, and users who encountered errors in routing. The model doesn't know what it can't see, and it doesn't ask.

In practice this means any metric reported by an LLM from a filtered or pre-aggregated dataset requires explicit denominator verification. The model should not be trusted to track which rows were excluded and whether those exclusions change interpretation.

Confident Spurious Correlations

LLMs identify statistical associations and present them as explanatory. When you show them event sequences in behavioral data, they map temporal ordering to causation: Feature X launched January 15. Churn spiked January 20. Conclusion: Feature X caused the spike.

This is the post-hoc fallacy, and research measuring it quantitatively shows roughly a 32-percentage-point accuracy drop in causal reasoning tasks when events are narrated in reverse order. The model's confidence doesn't shift with the reversal — it produces a confident wrong answer in both directions.

The deeper issue is that confounding variables are invisible. An LLM analyzing customer discount data might confidently assert "customers with high discounts churn more" without modeling that discounts were disproportionately given to customers who already had poor product fit. The model sees the correlation and reports the conclusion. It cannot see that the discount assignment was itself caused by the very thing that predicts churn.

Aggregation reversals — Simpson's Paradox — compound this. A product with higher overall satisfaction than a competitor can have lower satisfaction in every individual region, with the aggregate direction driven entirely by market mix. LLMs analyzing aggregate data will report the overall trend as the finding without decomposing by subgroup, and without flagging that subgroup reversal is even possible.

Selection Bias the Model Can't See

When you query behavioral data, the dataset you hand to an LLM is always a filtered slice of a larger population. Users who engaged with a feature. Customers who filed support tickets. Sessions that converted. The model has no representation of the users who never showed up, who abandoned silently, or who were excluded by upstream filtering logic.

This matters for retention analysis specifically. "Show me which features correlate with 90-day retention" on a dataset of users who survived 90 days is survivorship analysis, not retention analysis. The model will analyze the survivors and generate confident conclusions about what causes retention without flagging that the entire analysis population was selected by the outcome variable.

Survey and feedback data has the same problem. Users who submit NPS responses, churn surveys, or feature requests are systematically different from users who don't. An LLM analyzing submitted feedback will characterize your user base as the feedback-submitting subset. If power users submit and casual users don't, you'll build a product story around power user preferences while your broader base churns quietly.

There is no way for the model to correct for this without explicit information about what was excluded. The mitigation is to describe the dataset construction clearly in the prompt and ask the model to characterize what population the analysis applies to before drawing any conclusions. Most LLMs will comply with this constraint when it's explicit; none will enforce it on their own.

The Hallucinated Metric

A distinct failure mode from the ones above: LLMs sometimes generate specific numbers that aren't in your data at all. This is the most straightforward hallucination type, but in analytics contexts it's particularly dangerous because the numbers are plausible.

"Conversion rate improved from 3.2% to 4.8% following the redesign" is the kind of statement that passes a casual read. If the actual conversion rate was 3.5% and the model generated the second number from a pattern that matched what post-redesign improvement "should" look like, the error won't surface in the output itself. The model has no mechanism for distinguishing "I calculated this from your data" from "I generated a number that fits the narrative."

When LLMs are used to write reports about pre-computed dashboards, this is lower-risk because the metrics were computed by deterministic tools and the LLM is interpreting them. When LLMs are asked to compute metrics directly from raw data or to describe trends they see, the risk is much higher. Treat any specific number in an LLM-generated analysis as requiring verification against the source before it moves into any decision-making context.

When SQL Gets Generated

LLMs asked to write analytics queries produce syntactically valid SQL that has wrong semantic meaning more often than practitioners expect. The failures are specific and recurring.

The wrong-denominator problem appears in GROUP BY logic. A request for "churn rate by acquisition cohort" can produce a query that uses the total user count as the denominator rather than the cohort-specific count, generating percentages that look reasonable but are computed over the wrong population. The query runs, returns rows, and nothing signals that the logic is wrong.

Aggregation mistakes — applying MAX where SUM is needed, or COUNT when COUNT DISTINCT is required — are a second category. Without explicit semantics about what your event table's rows represent (one row per event vs. one row per session vs. one row per user), the model generates plausible-looking aggregations that may not match your data model.

The most reliable mitigation is a semantic layer: explicit documentation of what each field means, how each table is keyed, and what filtering assumptions apply. When teams provide schema alone (table names, column names, types), text-to-SQL accuracy improves modestly over pure guessing. When teams provide explicit business logic — "each row is one purchase event, users can have multiple rows, user_id is the join key to the users table" — accuracy improves substantially. The model needs the business meaning, not just the data shape.

What LLMs Actually Do Well: Hypothesis Generation

The failure modes above share a common pattern: the LLM is asked to produce a conclusion that requires statistical rigor it doesn't have. The safe inversion of this is using LLMs for hypothesis generation, where the output is candidate explanations that statistical tools then validate or falsify.

"Here's a cohort with unexpectedly high 30-day churn. Generate ten hypotheses about what might explain it" is a task LLMs perform well. The output is a set of candidates, not a conclusion. The candidates go through holdout validation, A/B testing, or proper causal analysis before any of them become findings. The LLM is doing ideation, not inference.

This framing captures the genuine value. LLMs have broad pattern recognition from training on large volumes of product, engineering, and business content. They can surface hypotheses that a time-pressed analyst might not have considered. They can connect a behavioral pattern to a mechanism that exists in the literature. They can suggest which control variables to include in a regression model.

None of that requires the model to be statistically correct. It requires the model to be creative and to generate enough plausible candidates that the right ones are in the list. Hypothesis recall matters more than precision when the cost of a false negative is missing a real explanation. Statistical validation handles precision downstream.

The workflow that works:

  1. Give the LLM a well-described anomaly with the population explicitly specified
  2. Ask for multiple competing hypotheses, not a single conclusion
  3. Rank hypotheses by testability and expected effect size
  4. Run statistical validation on the top candidates using proper experimental design
  5. Report findings that survived validation, not LLM-generated interpretations

Grounding Patterns That Reduce Risk

For use cases where LLMs are generating narrative around pre-validated metrics, several practices reduce the hallucination rate.

Provide the numbers explicitly. Paste the actual metric values into the prompt. "Conversion was 3.5% in Q3 and 3.8% in Q4. Write an interpretation of this trend for the weekly report." The model is now narrating a fact, not computing one. Hallucinations still occur — the model might manufacture a cause for the trend — but they're more constrained when it isn't also responsible for the numbers themselves.

Require citations back to the provided data. "Your interpretation should reference only the data provided and identify if any claim you're making isn't supported by the numbers given." This doesn't eliminate hallucination but creates an output structure where unsupported claims are more visible.

Specify the denominator explicitly. "This dataset contains 12,400 users who completed onboarding in Q3. The total onboarding starts in Q3 were 38,200. Analyze retention in this cohort with awareness that it represents 32% of onboarding starts." This gives the model the information it needs to avoid mischaracterizing the population.

Validate any output metric before it moves downstream. Any specific number in an LLM-generated analysis that will be used for a decision should be traced back to the source before it leaves the team. This is not a prompt engineering fix — it's an operational constraint.

The Organizational Risk

The dangerous property of LLM-generated analytics is the presentational quality of the output. A wrong conclusion wrapped in confident, well-organized prose is harder to challenge than a wrong conclusion presented tentatively in a spreadsheet comment. The polish creates authority. Teams that treat the LLM as an analyst — routing questions to it and presenting its outputs in decision contexts — are betting that the model's failure modes won't hit them.

That bet goes fine until a roadmap decision gets made based on a hallucinated conversion lift, or a retention investment gets directed at a cohort that was an artifact of survivorship selection. Post-hoc discovery that the analysis was wrong is costly in proportion to the decisions it influenced.

The framing that works is treating LLMs as junior analysts who are fast, broad in pattern knowledge, and unreliable on statistical rigor. You review their outputs before they go into a deck. You verify the numbers. You validate causal claims through experiments, not by asking the model again with more confidence. The speed benefit remains. The risk gets managed.

The specific heuristic: if the question is "generate hypotheses about what might explain X," LLMs are useful. If the question is "tell me definitively what is causing X," statistical tools own that answer and LLMs review the narrative after the analysis is done.

References:Let's stay in touch and Follow me for more thoughts and updates