Skip to main content

2 posts tagged with "sampling-bias"

View all tags

The Eval Set That Sampled Production Traffic at 3am EST

· 10 min read
Tian Pan
Software Engineer

A team I worked with had an eval set that quietly drifted into being a survey of their batch automation. The sampling cron ran at 3am Eastern, scooped 5,000 traces out of the production log table, and dropped them into the eval corpus. The leaderboard was clean. The new prompt won by four points. They shipped it. Within a day, the support queue filled with a kind of complaint they had never seen during regression testing — pricing questions that the model now hedged on, in a customer segment whose entire workday started after the eval window closed.

The eval was not wrong about what it measured. It was wrong about who it measured. At 3am EST, the production fleet was dominated by overnight batch retries, scheduled report generation, and a handful of APAC daytime sessions that mostly asked navigational questions. The new prompt was genuinely better on that slice. The slice was twelve percent of weekly traffic and zero percent of revenue-weighted traffic. Nobody had asked the question "what shape of user is in this dataset" because the dataset was constructed by a cron job that ran when the warehouse was quietest, and quietness was the only sampling criterion anyone had thought to optimize for.

Free-Tier Traffic Is Your Real Eval Set

· 10 min read
Tian Pan
Software Engineer

The team optimizing the model against paid-cohort traces is grading itself on the easy distribution. Paying users have a workflow. They self-selected into the product because something about it justified pulling out a credit card, which means by the time they're in the eval set, they've already learned which prompts work, which features deliver, and which corners not to wander into. Free-tier users do none of that. They're anonymous, exploratory, often adversarial, often non-native English speakers stress-testing a product in their second language, and they exercise the long tail of failure modes the eval set was never built to cover.

This is the asymmetry that quietly eats the conversion funnel of every freemium AI product. The team grades the model against a curated sample drawn disproportionately from paid traces. The free-tier weird traces — the ones with no template, the ones where someone is genuinely trying to figure out what the product does — never get labeled, never get a regression test, and never inform the next prompt edit. The model gets better against the paid distribution and slowly worse against the distribution that decides whether free users ever upgrade.