3 posts tagged with "sampling-bias"

The Canary Cohort Your Rollout Hashed by ID That Clustered Power Users Into One Arm

June 3, 2026 · 10 min read

Software Engineer

A rollout team ships a new model behind a percentage flag. The flag bucket is computed as hash(user_id) % 100, the canary is buckets 0–4, the lift on per-user engagement is large and stable for two weeks, and the team ramps to 20%, then 50%, then global. The lift evaporates somewhere between 50% and global, and the post-mortem traces it back to the canary cohort. The treatment didn't move the metric. The canary arm was a different population.

The team thought it had been sampling users. It had been sampling IDs.

The Eval Set That Sampled Production Traffic at 3am EST

June 2, 2026 · 10 min read

Tian Pan

Software Engineer

A team I worked with had an eval set that quietly drifted into being a survey of their batch automation. The sampling cron ran at 3am Eastern, scooped 5,000 traces out of the production log table, and dropped them into the eval corpus. The leaderboard was clean. The new prompt won by four points. They shipped it. Within a day, the support queue filled with a kind of complaint they had never seen during regression testing — pricing questions that the model now hedged on, in a customer segment whose entire workday started after the eval window closed.

The eval was not wrong about what it measured. It was wrong about who it measured. At 3am EST, the production fleet was dominated by overnight batch retries, scheduled report generation, and a handful of APAC daytime sessions that mostly asked navigational questions. The new prompt was genuinely better on that slice. The slice was twelve percent of weekly traffic and zero percent of revenue-weighted traffic. Nobody had asked the question "what shape of user is in this dataset" because the dataset was constructed by a cron job that ran when the warehouse was quietest, and quietness was the only sampling criterion anyone had thought to optimize for.

Free-Tier Traffic Is Your Real Eval Set

May 14, 2026 · 10 min read

Tian Pan

Software Engineer

The team optimizing the model against paid-cohort traces is grading itself on the easy distribution. Paying users have a workflow. They self-selected into the product because something about it justified pulling out a credit card, which means by the time they're in the eval set, they've already learned which prompts work, which features deliver, and which corners not to wander into. Free-tier users do none of that. They're anonymous, exploratory, often adversarial, often non-native English speakers stress-testing a product in their second language, and they exercise the long tail of failure modes the eval set was never built to cover.

This is the asymmetry that quietly eats the conversion funnel of every freemium AI product. The team grades the model against a curated sample drawn disproportionately from paid traces. The free-tier weird traces — the ones with no template, the ones where someone is genuinely trying to figure out what the product does — never get labeled, never get a regression test, and never inform the next prompt edit. The model gets better against the paid distribution and slowly worse against the distribution that decides whether free users ever upgrade.

About Tian Pan