Annotation Drift: How Your Eval Set Stops Measuring the Product You Ship
The eval set that scored 92% last quarter is now scoring 94%, and the team is calling that progress. It isn't. The labels in that eval set were written against a rubric the annotators no longer hold in their heads. The product the model is being graded on has moved. The standards have moved. The annotators' own calibration has moved. What looks like a two-point improvement is the silent gap between a frozen artifact and a living product, and that gap widens every week the team doesn't refresh.
Annotation drift is the quiet failure mode of mature LLM eval programs. It doesn't show up as a regression — regressions are the easy case, because the number goes down and somebody investigates. It shows up as a number that stays green while the thing it's supposed to measure decays underneath it. Teams that have already built an eval set, written a rubric, and recruited annotators are the most exposed, because they trust the system they built and stop auditing the foundation.
