Eval Set Decay: Why Your Benchmark Becomes Misleading Six Months After You Build It
You spend three weeks curating a high-quality eval set. You write test cases that cover the edge cases your product manager worries about, sample real queries from beta users, and get a clean accuracy number that the team aligns on. Six months later, that number is still in the weekly dashboard. You just shipped a model update that looked great on evals. Users are filing tickets.
The problem isn't that the model regressed. The problem is that your eval set stopped representing reality months ago—and nobody noticed.
This failure mode has a name: eval set decay. It happens to almost every production AI team, and it's almost never caught until the damage is visible in user behavior.
