Eval Sets Have Seasons: Why Quality Drops on the First Monday of Tax Season
The dashboard fired its first regression alert on a Monday morning in late January. Quality score on the support assistant dropped three points overnight. No prompt change shipped over the weekend. No model swap. The eval suite — a hand-curated 800-row gold set that the team had built six months earlier — was unchanged. Somebody opened an incident.
Two days of bisecting later, the answer was uninteresting and structural. It was the first business Monday after the IRS opened tax filing for the year. Half the inbound queries had shifted from "where is my paycheck deposit" to "how do I report a 1099-K from a payment app." The eval set, sampled in summer, had nothing to say about a 1099-K. The model wasn't worse. The customer was different. The gate was calibrated against a customer who no longer existed.
This pattern repeats every quarter in every product that has a seasonal user — fintech in tax season, sales tools at end-of-quarter, education at back-to-school, e-commerce in returns season, travel at booking season, healthcare at enrollment season. The eval-set-as-fixed-asset is a comfortable abstraction, and it is wrong on a calendar that nobody updates.
