Skip to main content

The Eval Set That Sampled Production Traffic at 3am EST

· 10 min read
Tian Pan
Software Engineer

A team I worked with had an eval set that quietly drifted into being a survey of their batch automation. The sampling cron ran at 3am Eastern, scooped 5,000 traces out of the production log table, and dropped them into the eval corpus. The leaderboard was clean. The new prompt won by four points. They shipped it. Within a day, the support queue filled with a kind of complaint they had never seen during regression testing — pricing questions that the model now hedged on, in a customer segment whose entire workday started after the eval window closed.

The eval was not wrong about what it measured. It was wrong about who it measured. At 3am EST, the production fleet was dominated by overnight batch retries, scheduled report generation, and a handful of APAC daytime sessions that mostly asked navigational questions. The new prompt was genuinely better on that slice. The slice was twelve percent of weekly traffic and zero percent of revenue-weighted traffic. Nobody had asked the question "what shape of user is in this dataset" because the dataset was constructed by a cron job that ran when the warehouse was quietest, and quietness was the only sampling criterion anyone had thought to optimize for.

This failure mode is not exotic. It is the default outcome of constructing an eval from logs without explicitly stating which population the eval is meant to represent. The cron is convenient. The off-peak window is convenient. The resulting dataset looks reasonable in aggregate. And the temporal slice it captures is silently encoding choices — about geography, about user intent, about workload shape — that nobody made on purpose.

The Sampling Window Is a Population Filter

Production traffic is not stationary. It varies on at least four axes that all matter for evaluation: geography, customer segment, intent mix, and human-versus-automation. A sampling cron that runs at a fixed clock time picks a single point along each of those axes and freezes it.

At 3am EST you are sampling a population that is heavily APAC daytime, heavily overnight batch, heavily automated retries, and almost entirely missing North American business-hours traffic. That cocktail has a specific shape. Navigational queries dominate over advisory ones. Token counts skew short. Latency tolerance is high because nobody is watching. Refusal sensitivity barely matters because the requests are routine and the requesters are scripts.

If your real users are mostly North American knowledge workers asking advisory questions at 2pm local time, the 3am EST sample does not represent them in any axis that matters. The eval will tell you the truth about a population that is not your customer.

This is the classic exposure-and-sampling bias problem that recommender-system evaluation has chewed on for years. Recent work formalizes it for offline evaluation: the choice of how to sample logged interactions interacts with the underlying exposure bias to produce eval results that may not predict online behavior at all. LLM eval pipelines have inherited the failure mode without inheriting the literature.

Why Off-Peak Sampling Feels Safe

The 3am cron is not a mistake of analysis. It is the result of three reasonable engineering instincts converging on a wrong answer.

First, warehouse load. Sampling production logs hits the same tables that customer-facing analytics queries hit, and the platform team would rather not contend with the morning dashboard load. So the sampling job moves to whenever the warehouse is quietest.

Second, log completeness. Some teams worry that mid-day samples will miss late-arriving log records from out-of-region writes. A late-night cron gives the ingestion pipeline a buffer. This is a real concern, but the fix should be a watermark, not a clock time chosen to avoid it.

Third, the eval is a side project. The first version of the offline eval is almost always built by one engineer in a sprint, and "run nightly" is the path of least resistance. Nobody designs the v1 eval pipeline as a sampling experiment because at v1 the pipeline does not exist yet.

The trouble is that v1 calcifies. The leaderboard gets wired to dashboards. Release decisions start citing it. By the time anyone audits where the data came from, the sampling cron is load-bearing for production deploys, and nobody is going to volunteer to change the methodology because every historical comparison would have to be re-baselined.

What the Time-of-Day Slice Actually Encodes

When the eval is silent about its population, you have to reconstruct it by reading what the dataset over-represents. The 3am EST window typically over-represents:

  • Automated retries from earlier outages. A flaky tool call at 9pm EST cascades into retry storms at midnight, then the long tail bleeds into your sample. The eval set ends up rewarding models that are good at handling a specific failure-recovery shape that has nothing to do with normal usage.
  • APAC daytime navigational traffic. Asia-Pacific users hitting the product during their business hours. The intent distribution skews toward "find a thing" rather than "advise me." Models that hedge well on advisory questions lose points; models that retrieve well on navigational ones win.
  • Internal employee usage. Engineers in California testing fixes at 11pm Pacific. The query shape is technical, the patience for slow responses is high, and the queries are not representative of any external user.
  • Scheduled jobs and batch consumers. Cron-driven pipelines hitting your API on the hour. These have stable, scriptable query shapes and they punish any model behavior that varies stochastically.

None of those are bad to evaluate on. They are bad to evaluate on as a proxy for the whole product. A 3am sample treats the long tail as the trunk.

The Leaderboard Story That Hides the Population

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates