Annotation Drift: How Your Eval Set Stops Measuring the Product You Ship
The eval set that scored 92% last quarter is now scoring 94%, and the team is calling that progress. It isn't. The labels in that eval set were written against a rubric the annotators no longer hold in their heads. The product the model is being graded on has moved. The standards have moved. The annotators' own calibration has moved. What looks like a two-point improvement is the silent gap between a frozen artifact and a living product, and that gap widens every week the team doesn't refresh.
Annotation drift is the quiet failure mode of mature LLM eval programs. It doesn't show up as a regression — regressions are the easy case, because the number goes down and somebody investigates. It shows up as a number that stays green while the thing it's supposed to measure decays underneath it. Teams that have already built an eval set, written a rubric, and recruited annotators are the most exposed, because they trust the system they built and stop auditing the foundation.
The Rubric in Your Head Is Not the Rubric on the Page
A rubric is a written artifact. The rubric the annotator actually applies is a fuzzy posterior over examples they've seen, conversations they've had with the lead, edge cases they've adjudicated, and tacit preferences they've absorbed about what the product is supposed to do. Those two rubrics start out aligned and then diverge — not because anyone is doing anything wrong, but because every week the annotator sees a new failure mode that the written rubric didn't anticipate, and they make a judgment call about it. The next week, they make a slightly different judgment call about a similar case, because their memory of the first decision has decayed. By month three, two annotators looking at the same case will disagree, and neither of them can tell you why with reference to the written rubric — they'll explain their decision in terms of cases they remember, not rules they're following.
Practitioners call this "rubric drift" and it shows up most clearly in inter-rater reliability metrics. Cohen's kappa and Krippendorff's alpha are the standard tools — they measure how often two annotators agree after accounting for chance. Production-grade datasets target agreement above 0.7, and below that you don't have a rubric, you have two different rubrics with the same name. The number to watch isn't the absolute agreement; it's the slope. A program where agreement was 0.82 in January and 0.71 in April hasn't gotten worse at annotation — the annotators have drifted apart, and nobody noticed because nobody re-measured.
The fix is mechanical: a calibration ritual where two annotators independently score the same sample of cases, the disagreements are pulled into a meeting, and every disagreement that traces to ambiguity in the written rubric becomes a rubric edit. The rubric becomes a versioned artifact — rubric_v3.7.md checked into the repo, with a changelog. Every eval result gets tagged with the rubric version it was graded against. When somebody asks "is this eval comparable to last quarter's," the answer is a version diff, not a guess.
The Product Moved and the Eval Didn't
The other half of annotation drift comes from the other direction: the labels are stable, but the thing being labeled has moved. The eval set was assembled when the product handled five intents. The product now handles eleven. The user expectations were calibrated against an agent that took thirty seconds to respond; the new agent streams in eight seconds and users are now comparing it against a different mental baseline. The label "good" in the eval set meant something specific in October that means something different in May, even if nobody changed the rubric.
This is concept drift, and it's the eval-set version of the well-documented model-side concept drift problem. The HR-chatbot example that keeps showing up in industry write-ups is illustrative: a chatbot passes 99% of an offline eval set covering standard payroll questions, the company rolls out a new equity plan on Monday, and by Tuesday the production traffic is dominated by vesting-schedule questions that don't exist in the eval set at all. The eval is still scoring 99%. The product is failing in production. The disconnect is the eval set's failure to track the product's capability surface.
The discipline that addresses this is a rolling refresh. A percentage of the eval set — call it 10–20% per quarter — gets retired and replaced with new cases drawn from recent production traces. The retired cases go into an archive, not into the trash, because you want to be able to ask "did we get worse on the old distribution," and that question requires keeping the old set readable. The new cases get annotated against the current rubric, and the rubric itself gets re-examined every time a new capability ships. The eval set becomes a depreciating asset with a known half-life, the way training data is, the way documentation is, the way any artifact pinned to a moving product is.
The Two Failure Modes Compound
The dangerous case is when both drifts happen at once, because they hide each other. If only the rubric drifts, the eval score gets noisier — annotators disagree, agreement metrics fall, somebody investigates. If only the product drifts, the eval starts missing important capabilities — production incidents surface failures the eval didn't catch, somebody investigates. But when both drift together, the eval score stays remarkably stable. The annotators have re-calibrated against the product's new behavior without writing it down, the rubric on the page no longer matches what the annotators are doing, and the score on the dashboard is measuring an undocumented agreement between drifted annotators and a drifted product. Everything looks healthy. Nothing is.
This is the eval-program equivalent of a thermometer whose mercury has separated. The reading is steady. The thermometer is broken. You only find out when you cross-check against another thermometer — in eval terms, when a new annotator joins and their scores diverge wildly from the established team, or when a stakeholder reviews a sample of cases and asks why "good" was assigned to something they consider unambiguously bad. By that point, the team has been making decisions against a broken instrument for months, and the instinctive response (re-train the new annotator to match the established team) is exactly wrong — the established team is the source of the drift.
The defensive move is to maintain a small "anchor set" of cases whose labels do not move. These are cases the team has discussed exhaustively, where the rubric application is unambiguous, and where any disagreement is a signal that the rubric or the annotators have moved. The anchor set is graded every batch, by every annotator, and the agreement against the canonical labels is the calibration metric. When that agreement falls, the team has a clear signal — earlier than the production incident, earlier than the angry stakeholder — that the measurement system needs work.
The Org Cost Nobody Budgets For
The hardest part of fixing annotation drift is that the work is structurally unfunded. Eval sets get built during a project. The build is a line item with a deliverable. The maintenance is not. The annotator headcount gets justified against the build; once the build is done, the team that owns the eval has to keep a partial allocation of annotator time forever, and that allocation competes with the next project's build in every quarterly planning conversation. Maintenance loses, because maintenance doesn't ship anything visible. By the time the team realizes the eval has rotted, the budget required to refresh it is bigger than the budget required to maintain it would have been, and the conversation with leadership is now about "why did this break" rather than "let's keep funding this."
The reframe that helps is treating the eval set as infrastructure with depreciation, not as a one-time asset. Cloud infrastructure has a maintenance budget. Documentation has a maintenance budget. The eval set is not different. A reasonable starting point is to budget 10–15% of the original build cost annually as maintenance — recurring annotator hours, recurring rubric-revision cycles, recurring refresh batches. That number is rough, and it should be calibrated against the team's drift metrics over time, but the principle is non-negotiable: an unmaintained eval set has a useful life of about a year, and budgeting against zero maintenance is budgeting against the calendar.
The leadership ask is different from the engineering ask. Engineering wants tools — rubric versioning, inter-rater agreement dashboards, anchor sets, refresh tooling. Leadership wants a contract: a written agreement that the eval program will report drift metrics quarterly, that the rubric is versioned, and that the eval set has a documented refresh cadence. The contract is what survives team turnover. The tools without the contract get deprioritized by the next manager. The contract without the tools is theater. Both are necessary.
What to Build, in Order
If a team has an eval set today and no defense against annotation drift, the cheapest first move is to measure. Pull a recent batch and have two annotators re-grade the same hundred cases. Compute the agreement. Compare it against whatever baseline exists from the build. If the number has dropped, that's the surface evidence. Then sample a dozen of the disagreements and read them together — the source of the drift is usually obvious once two annotators are looking at the same case and noticing they had different answers in mind.
The second move is to version the rubric. Whatever rubric exists today becomes v1.0. Every change after that bumps a version. Every eval result gets tagged. This is a one-day engineering project that pays for itself the first time somebody asks "is this comparable to last quarter."
The third move is the anchor set. Pull twenty cases the team agrees on, lock the labels, grade them every batch. The agreement-against-anchor metric is the early warning system.
The fourth move is the refresh cadence. Set a quarterly date. Retire 10–20% of the set. Replace with production-trace samples. Annotate against the current rubric. Track the agreement.
None of these moves are exotic. They're the kind of thing a team would design if they thought of the eval set as a piece of measurement infrastructure rather than a deliverable. The reason they don't get built is that the failure mode they prevent is silent, and silent failures are structurally hard to fund. The team that funds them anyway is the team whose eval scores still mean something a year from now.
The takeaway for engineers is simple: the eval set is not the rubric. The rubric is not the annotation. The annotation is not the product. Every link in that chain drifts, and a measurement system that doesn't account for the drift will eventually report a number that has stopped meaning anything. The work isn't to prevent the drift — drift is a property of the system, not a bug. The work is to instrument it, so that the day the number stops meaning something is the day a graph turns red, not the day a stakeholder finds out by accident.
- https://kili-technology.com/blog/keys-to-successful-llm-as-a-judge-and-hitl-workflows
- https://labelyourdata.com/articles/llm-as-a-judge
- https://galileo.ai/blog/domain-specific-llm-evaluation-expert-annotations
- https://www.twine.net/blog/what-is-an-llm-evaluation-rubric/
- https://medium.com/@adnanmasood/rubric-based-evals-llm-as-a-judge-methodologies-and-empirical-validation-in-domain-context-71936b989e80
- https://www.braintrust.dev/articles/llm-evaluation-guide
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns
- https://vadim.blog/2026/02/03/building-production-evals-for-llm-systems
- https://medium.com/@falvarezpinto/evaluation-first-ai-product-engineering-golden-sets-drift-monitoring-and-release-gates-for-llm-2c3bfb3f1e7b
- https://medium.com/@jiyang.kang/your-ai-evaluation-is-broken-four-technical-solutions-to-fix-golden-sets-auto-raters-and-data-5cab61fd2516
- https://orq.ai/blog/model-vs-data-drift
- https://nexla.com/ai-infrastructure/data-drift
- https://www.evidentlyai.com/ml-in-production/concept-drift
- https://surge-ai.medium.com/inter-annotator-agreement-an-introduction-to-cohens-kappa-statistic-dcc15ffa5ac4
- https://imerit.net/resources/blog/human-vs-model-agreement-how-inter-rater-consistency-shapes-benchmark-reliability/
- https://www.newline.co/@Dipen/standardizing-llm-evaluation-with-a-unified-rubric--b1aea7fb
