Free-Tier Traffic Is Your Real Eval Set
The team optimizing the model against paid-cohort traces is grading itself on the easy distribution. Paying users have a workflow. They self-selected into the product because something about it justified pulling out a credit card, which means by the time they're in the eval set, they've already learned which prompts work, which features deliver, and which corners not to wander into. Free-tier users do none of that. They're anonymous, exploratory, often adversarial, often non-native English speakers stress-testing a product in their second language, and they exercise the long tail of failure modes the eval set was never built to cover.
This is the asymmetry that quietly eats the conversion funnel of every freemium AI product. The team grades the model against a curated sample drawn disproportionately from paid traces. The free-tier weird traces — the ones with no template, the ones where someone is genuinely trying to figure out what the product does — never get labeled, never get a regression test, and never inform the next prompt edit. The model gets better against the paid distribution and slowly worse against the distribution that decides whether free users ever upgrade.
The conversion data backs this up uncomfortably. Roughly two-thirds of freemium upgrades come from users hitting a usage limit on something they were already doing successfully — not from users discovering a premium feature. The pre-upgrade experience is the free-tier experience. If the model feels broken in that experience, the funnel doesn't recover by sending them a discount email.
The Paid-Cohort Sample Is a Sanitized Highlight Reel
Paying users behave differently in ways that are obvious once you look. They've already burned a free trial figuring out the prompt syntax. They've already discovered which questions get good answers and which ones get confused responses. They have a workflow built around the product — a Notion doc with prompt templates, a Slack reminder to re-run the same daily query, a habit of breaking long requests into smaller ones. The traces they generate are well-formed, narrowly scoped, and concentrated in a handful of intents.
Free-tier users have none of this scaffolding. They arrived from a tweet, a side-by-side comparison, a coworker's recommendation, a curiosity. They type the way they'd type into a search bar, or into a friend, or into a model they're trying to break. They speak English as a second language at a rate the paid cohort doesn't, because the paid cohort over-indexes toward markets where the product has marketing presence and the language matches. They ask questions the product doesn't support and don't know they shouldn't. They jailbreak it for fun.
These two cohorts produce qualitatively different traces. Aggregate accuracy on a paid-weighted sample tells you the model handles its happy path. It tells you nothing about whether the cohort still deciding whether to pay is having the experience that earns the upgrade.
Why the Eval Set Drifts Toward the Paying User by Default
Eval sets don't get sampled randomly — they get sampled where the team is looking, and the team looks where the support tickets come from. Paid users complain through formal channels: a Zendesk ticket, an account manager, a Slack escalation from someone with a procurement contact. Free users complain by bouncing. A free user who got a bad answer doesn't write a ticket — they close the tab and remember the product as "not very good." The signal disappears before anyone on the AI team can label it.
The labeling budget compounds this. Domain experts are expensive, eval cases are time-consuming to write, and the natural impulse is to spend that budget on the cases that came in through ticketed channels. Six months in, the eval set has hundreds of carefully curated examples from paid traces and a handful of synthetic adversarial cases someone added during a red-team week. The free-tier weird traces — the ones that would actually surface the regressions that lose conversions — never made it into the dataset because the labeling budget was already spoken for.
There's also a quieter dynamic where free-tier traces are seen as "noise" by reviewers. A junior PM scanning failed cases sees an obviously malformed query, marks it as "user error," and moves on. But malformed queries are the product's job to handle gracefully, especially in the cohort that hasn't yet learned what well-formed looks like. The model that bails on these queries is the model that fails the upgrade test.
The Discipline: Stratified Sampling With a Deliberate Free-Tier Over-Weight
The fix is not to sample randomly across all traffic. Random sampling would give you a representative distribution, which is almost as bad — it lets the paid happy path drown out the long tail just by volume. The discipline is stratified sampling with explicit over-weighting of the free-tier slice.
A workable shape:
- https://hamel.dev/blog/posts/evals-faq/
- https://hamel.dev/blog/posts/evals/
- https://newsletter.pragmaticengineer.com/p/evals
- https://vadim.blog/2026/02/03/building-production-evals-for-llm-systems
- https://langfuse.com/blog/2025-08-29-error-analysis-to-evaluate-llm-applications
- https://www.evidentlyai.com/llm-red-teaming
- https://amplitude.com/blog/freemium-free-trial-metrics
- https://www.growthunhinged.com/p/free-to-paid-conversion-report
