The Golden Dataset Decay Problem: When Your Eval Set Becomes a Liability
Most teams treat their golden eval set like a constitution — permanent, authoritative, and expensive to touch. They spend weeks curating examples, getting domain experts to label them, and wiring them into CI. Then they move on.
Six months later, the eval suite reports 87% pass rate while users are complaining about broken outputs. The evals haven't regressed — they've decayed. The dataset still measures what mattered in October. It just no longer measures what matters now.
This is the golden dataset decay problem, and it's more common than most teams admit.
