Skip to main content

Eval Sets Have Seasons: Why Quality Drops on the First Monday of Tax Season

· 12 min read
Tian Pan
Software Engineer

The dashboard fired its first regression alert on a Monday morning in late January. Quality score on the support assistant dropped three points overnight. No prompt change shipped over the weekend. No model swap. The eval suite — a hand-curated 800-row gold set that the team had built six months earlier — was unchanged. Somebody opened an incident.

Two days of bisecting later, the answer was uninteresting and structural. It was the first business Monday after the IRS opened tax filing for the year. Half the inbound queries had shifted from "where is my paycheck deposit" to "how do I report a 1099-K from a payment app." The eval set, sampled in summer, had nothing to say about a 1099-K. The model wasn't worse. The customer was different. The gate was calibrated against a customer who no longer existed.

This pattern repeats every quarter in every product that has a seasonal user — fintech in tax season, sales tools at end-of-quarter, education at back-to-school, e-commerce in returns season, travel at booking season, healthcare at enrollment season. The eval-set-as-fixed-asset is a comfortable abstraction, and it is wrong on a calendar that nobody updates.

Eval sets are time-stamped samples of a non-stationary distribution

The classical ML monitoring literature has taught us to think about drift in two flavors. Data drift is when the inputs change — the distribution of features the model sees in production diverges from the distribution it was trained on. Concept drift is when the relationship between inputs and outputs changes — the same input now warrants a different answer. The literature is also clear that not all drift is equal: recurring seasonal drift is expected and predictable, while genuine concept drift is the dangerous kind (Evidently AI).

What teams shipping LLM features have not internalized is that the eval set is itself a frozen sample of a moving distribution. A golden dataset, by construction, captures what production looked like on the day you stopped sampling. When you run it three months later, you are not measuring "is the model still good." You are measuring "is the model still good at the kind of work people were sending in March." If August traffic doesn't look like March traffic, the answer is irrelevant to the question your PM is actually asking, which is whether August users are getting good answers.

This problem has been measured. The LENS framework explicitly treats prompt distribution as something that evolves over time, between user groups, and across regions, and shows that natural prompt shift degrades fine-tuned LLM performance in ways that don't show up against the original eval set (arXiv 2604.17650). The literature on golden datasets is converging on the same point from a practitioner direction: a key drawback is curation overhead and risk of staleness, and the recommended remedy is a 90-day expiry on golden rows unless re-verified, with frequent small refreshes pulling fresh scenarios from production (Confident AI, Maxim).

The metaphor that survives best is from credit risk modeling, where teams have been monitoring distribution stability for decades. The Population Stability Index reports a single number for how far the current population has drifted from the reference population. PSI under 0.1 is no meaningful change; 0.1 to 0.25 is a minor shift that warrants attention; above 0.25 is a major shift that should trigger investigation and likely action (Fiddler, NannyML). The trick is to apply that same instinct to your eval set, not just to your features. The reference population is whatever your eval set was sampled from. The current population is your live traffic. When PSI between them crosses 0.25, the eval set is no longer a representative gate, and the score you read off it is reporting on a customer cohort that has moved on.

The seasonal calendar is the engineering calendar

The first discipline this changes is how the team picks refresh dates. Most teams refresh their eval set when somebody on the team has time, which is to say, never on schedule. The refresh cadence ends up keyed to the engineering calendar — sprint planning, quarterly OKR cycles, the lull between launches. None of these have anything to do with when the user's input distribution actually moves.

The replacement is a seasonal eval refresh keyed to the product's traffic calendar. For a fintech, that means the eval set is refreshed before tax season opens, again at the April filing deadline, and again at quarterly estimated-tax due dates. For a sales tool it means the eval set is refreshed two weeks before each end-of-quarter scramble. For an education product it means an August refresh before back-to-school. The point is not the specific dates; it is that the team can name them in advance, because they map to the customer's calendar rather than to whatever the team happened to be doing that week.

The artifact that makes this concrete is a one-page document called something like "the traffic calendar." It lists the four to six dates per year when the input distribution is known to shift, the products or surfaces affected, and the eval-refresh deadline that precedes each one by two to three weeks. This document lives next to the runbook for the on-call rotation, because the refresh deadline is an on-call obligation, not a stretch goal.

A traffic-distribution dashboard, not just a quality dashboard

Quality dashboards typically report one number — the eval score on the gold set — and segment it by maybe two or three slices. That is a quality story but it does not tell you whether the input distribution has moved. The complement that has to live next to the quality dashboard is a traffic-distribution dashboard.

The minimum useful version of this has three panels.

The first panel is a per-month input mix, sliced by intent or by topic. If your support assistant has fifteen common intents, the panel shows what fraction of last month's traffic landed in each one, and what fraction of the eval set covers each one. The visual you want is two overlapping bar charts that should look the same. When they stop looking the same, that is your signal.

The second panel is a single PSI number computed between the current month's input distribution and the eval set's input distribution. The number changes slowly day to day, and that is the point. When it crosses 0.1, it is on the dashboard for awareness. When it crosses 0.25, it is paging somebody.

The third panel is a list of the top intents or topics whose share in production has grown the most relative to their share in the eval set. This is the answer to "what specifically is the eval set under-covering," and it is the thing the team will use to decide which fresh examples to label first.

The team that has these three panels can argue about eval-refresh priority with evidence. The team that does not has to argue from intuition, and intuition is wrong about input distributions in roughly the same way it is wrong about latency tail shapes — humans systematically miss the slow-moving structural drift.

Shadow eval as a continuous distance measure

A traffic dashboard is necessary but not sufficient, because it tells you the inputs have moved without telling you whether the model is keeping up. The complement is a shadow eval — a sample of live production traffic, scored every week against whatever judging method the team trusts, and reported as a continuous distance measure from the held-out gold set.

The implementation is unglamorous. Sample one to ten percent of production requests, depending on volume — the typical production sampling rate for online evals is in that range, with the exact number tuned to traffic volume and the per-request cost of scoring (Statsig, LangChain). Run the same judge or rubric you run on the gold set. Report a weekly number alongside the gold-set score on the same dashboard. Two numbers, one trending stable and one trending down, is a much louder signal than a single number whose interpretation depends on which Monday you read it.

The shadow eval is also where the team finds the new failure modes. If 1099-K queries land in production in late January and the assistant is dropping them, the shadow eval surfaces that pattern before the gold-set score notices, because the gold-set score never sees a 1099-K. The shadow eval becomes the staging area for new gold-set rows: production samples scored low, reviewed by a human, labeled, and promoted into the next refresh cycle (Klu, Arize).

A small calibration note. Statistical drift on the production side is best monitored on a rolling baseline — a 7-day window against the previous 7 days, or month-over-month — rather than against a static training snapshot, which prevents alert fatigue from long-passed seasonal patterns (Evidently). The eval-set comparison is different: you do want a static reference, because the question is precisely how far the current world is from the world your gate was calibrated against. Both belong on the dashboard. They answer different questions.

A release gate that knows how old its eval set is

The last piece is the release-gate clause, and it is the place where most of the engineering judgment lives. Every team running prompt or model PRs through CI has some version of an eval-score threshold that the PR has to clear. The bug is that the threshold is treated as constant, when the appropriate threshold should be a function of how stale the eval set is relative to current traffic.

The simplest version of this is a clause that says: if the eval set was last refreshed more than N weeks ago, and the PSI between current traffic and the eval-set distribution is above a threshold, the non-regression bar tightens. The team is shipping into a world the eval set didn't sample, so a smaller measured regression is more likely to be a real one, and the gate has to compensate.

A worked example, with deliberately round numbers. Default policy: a PR may regress global eval score by no more than 2 percent and no slice may regress by more than 5 percent. Stale-eval policy, triggered when the eval set is more than 8 weeks old or PSI exceeds 0.25: a PR may regress global eval score by no more than 1 percent and no slice may regress at all. The team can ship through the stale-eval gate, but it is harder, and the harder gate is itself a forcing function — the cheapest way to get the easier gate back is to refresh the eval set.

The clause does two useful things at once. It hardens the release process during the periods when the team is most likely to be flying blind, and it puts a tax on the team's failure to keep the eval set fresh, which is the failure mode the calendar is supposed to prevent in the first place (ZenML, Inference.net).

What this looks like when it works

The team that has internalized this stops talking about the eval set as a thing they built. They talk about it as a thing they keep building. The traffic calendar lives in the runbook. The traffic-distribution dashboard sits next to the quality dashboard. The shadow eval reports a weekly number that the on-call engineer reads. The release gate has a stale-eval clause that everyone has read at least once. The 1099-K query that broke the assistant in January is in the gold set by February, and when the next tax season opens, the gate is calibrated for it.

The team that has not internalized this ships PRs through a green gate, watches the dashboard tick down three points on the first Monday of tax season, opens an incident, finds nothing in the diff, closes the incident with a "must be data drift" line in the postmortem, and does the same thing again next year. The eval set sits unchanged in the repo, the customer keeps moving, and the gap between what the gate is measuring and what the user is experiencing widens by another season (Braintrust, VentureBeat).

The architectural realization is small and load-bearing. An eval set is not a benchmark. It is a time-stamped sample of a non-stationary distribution, and the team that runs an unchanged gold set across a fiscal year is calibrating a quality gate against a customer who has already left.

References:Let's stay in touch and Follow me for more thoughts and updates