Your Gold Labels Learned From Your Model: Eval-Set Contamination via Production Leakage
Your eval suite passed. Quality dashboards are green. A week later, users are quietly churning and nobody can explain why. The eval set did not lie by being wrong — it lied by being a mirror. The labels you graded against were, traceably, produced or filtered by the very model family you were trying to evaluate. Passing that eval is not evidence of quality. It is evidence that your model agrees with its own past outputs.
This is the quiet failure mode of mature LLM pipelines: eval-set contamination via production leakage. Not the famous benchmark contamination where a model trained on GSM8K also gets graded on GSM8K — that story is well told. The subtler one is downstream. Your gold labels come from user feedback, from human annotators who saw the model's draft first, from RLHF reward traces, from LLM-as-judge preference data. Each of those pipelines carries a fingerprint of the current model's idiom back into your "ground truth." Over a few quarters, the test set quietly memorizes your model's biases, and the eval becomes a self-congratulation loop.
The painful part: the loop looks healthy from the outside. Coverage grows, reviewer agreement goes up, scores trend up and to the right. Every signal your dashboard tracks confirms that the eval set is improving. What the dashboard cannot see is that the direction of improvement has rotated away from "what users wanted" and toward "what the model has learned to produce."
The Three Production Pipelines That Leak
Contamination rarely arrives through one fat pipe. It leaks through three thin ones, each one individually reasonable, collectively corrosive.
User feedback as labels. Thumbs-up / thumbs-down, regenerate-rate, dwell time on a response — product teams rightly mine these signals and promote them into eval inputs. The hidden assumption is that "user approved" means "objectively correct." It does not. Users approve outputs that match their priors, their phrasing, and the length conventions the model has trained them to expect. If your model trended toward verbose bullet-heavy answers last quarter, users adapted, and their thumbs-up now rewards the same shape. Promote those as labels and your eval set now encodes "verbose and bullet-heavy" as correct. A terser, better model will score lower. You will read that as a regression and not ship it.
Human labelers who see the draft. The common annotation UI shows the model's candidate response and asks a labeler to edit, accept, or rank. It feels efficient — labelers anchor on the draft and only correct the obvious misses. That anchoring is the contamination vector. Research on RLHF labeling traps documents exactly this: if labelers lean on model output as a starting point, their "gold" labels inherit model-specific phrasings, structural choices, and failure blind spots. The label teaches the model to produce what the labeler already expected to see, which was the model's own draft. Independent labeling — labeler writes the answer before looking at any model output — is expensive and slow, so almost nobody does it consistently. That's how the drift starts.
RLHF and judge-model artifacts. Preference pairs collected during RLHF, reward-model scores, and LLM-as-judge evaluations all carry the stylistic preferences of the model that generated the candidates. Reuse those preference traces to build your eval or your reward shaping, and the eval set becomes a reflection of the policy that produced the candidates. Cross-contamination between the reward model, the policy, and the eval set is especially insidious because all three share an ancestry, and that ancestry is not logged anywhere a normal eng team would look. Subsequent generations of models get graded against a rubric the previous generation wrote.
Any one of these pipelines, on its own, introduces modest skew. Combined and running for a year, they produce an eval set that is statistically indistinguishable from "outputs this model family would have produced, filtered for surface-level correctness."
Why the Loop Feels Like Progress
The mechanics of the self-congratulation loop are straightforward once you name them. At each iteration: the model produces outputs, a filtered slice of those outputs becomes labels, labels go into evals, evals are used to decide which next-generation models ship. Selection pressure on the next model is therefore "match the previous model's filtered outputs." Run this a few times and preference drift compounds in one direction.
Recent work on self-consuming performative loops shows that non-creative preference bias increases faster under incremental fine-tuning when the training data partially originates from the model's own outputs. The eval case is the same dynamic running slower and quieter, because evals are used for go/no-go decisions rather than weight updates. The loop does not train the model to be more like itself in one shot. It trains the selection process to reject anything that is not like the incumbent.
The symptom most teams notice first is not low quality — it is silent plateau. Eval scores tick up reliably, but real-world outcome metrics (task completion, user retention, complaint rate) stop correlating with eval deltas. That decoupling is the smoke signal. When your eval has become a mirror, moving eval scores requires the new model to converge on the old model's idiom; moving outcome metrics requires the opposite. You cannot do both at once.
A second symptom is that challenger models from outside your family perform inexplicably poorly on your internal evals, even when their public benchmark numbers are better. Teams often interpret this as "we have higher standards" or "our eval catches something generic benchmarks miss." Sometimes that is true. More often, your eval rewards the specific phrasings and output shapes of your incumbent family, and a challenger gets penalized for producing valid answers in a different voice.
The Provenance Discipline That Fixes It
The fix is not one magic rotation of labels — it is treating eval-set provenance as a first-class engineering artifact, with the same rigor you would apply to a training dataset. Three practices matter more than any specific tooling choice.
Source metadata on every label. Every item in the eval set should record where its label came from: independently authored, post-hoc human review of model output, user feedback, judge-model score, synthetic generation. Stamp each label with the labeling method, the model version it saw (if any), the labeler role, and the date. Without this metadata you cannot audit contamination, and you cannot answer the question that matters: "what fraction of our current eval set was ever exposed to model output before being labeled?" For most teams that fraction, once measured honestly, is north of 80%. That alone is usually enough to force action.
Rotation windows. Treat evals like security credentials — old labels decay in authority, not just relevance. Cap the share of the eval set that may come from any single labeling generation (say, no more than 30% from labels produced during the last model version's deployment). Retire older segments on a cadence even if they still seem valid. Rotation is not about throwing away good data; it is about making sure no single model's fingerprint dominates the signal.
Adversarial reseeding. Periodically commission a slice of the eval set from independent labelers who never see any model output. Pay for the expensive version. Keep this slice small but load-bearing: it is your contamination canary. When scores on the reseeded slice diverge from scores on the mainline eval, you know mainline has drifted. When they track, you have earned the right to trust mainline for another quarter. Reseeded data also makes a reasonable acceptance criterion for promoting a challenger model: does it beat incumbent on the untainted slice, not just the mainline?
A stronger but less common practice: contamination-resilient regeneration, where eval items are rebuilt against fresh, date-stamped world knowledge rather than frozen historical answers. This is overkill for most teams, but for domains where facts move — regulation, pricing, product state — it is the only way to keep evals meaningful. Recent work on automated contamination-resilient dataset construction takes this approach and is worth watching even if you do not adopt it wholesale.
The Organizational Failure Mode
None of the provenance practices work if the org chart fights them. The single most predictive contamination signal is not technical — it is whether the team that writes the prompts also writes the evals that grade those prompts. When they are the same team, selection pressure is unopposed: a prompt change that happens to align with how the eval already scored will always "win," because the eval was designed with that prompt's outputs in mind.
Separating prompt authorship from eval authorship is cheap structurally and hard culturally. The prompt team wants evals that reward their intuitions; the eval team wants to test things the prompt team did not think to optimize for. That tension is the feature, not the bug. If there is no tension, you have a rubber-stamp process, and your reported quality is whatever the prompt team believes quality should look like this quarter.
For small teams that cannot fully separate roles, a lighter-weight version works: enforce that the person labeling an eval item is not the person who wrote (or iterated on) the prompt that generated it, and that labelers rotate across prompt versions so no single labeler becomes a stylistic echo chamber for a single prompt author. It is less robust than true separation but meaningfully better than the default of "one engineer writes the prompt, runs it, looks at outputs, and decides they look good."
A common and costly mistake is to delegate this separation entirely to LLM-as-judge systems on the theory that "the judge is independent." A judge model from the same family as the generator shares most of its stylistic priors; using it as the arbiter of quality just launders the same contamination through a different pipe. Judge models are useful for scale, not for independence. If you use them, pair them with a human-labeled reseeded slice as calibration and treat divergence between the two as a contamination alarm, not a judge-model bug to be patched away.
What to Watch For Next Week
Eval-set contamination is a slow failure — it will never page you at 3 AM. But a few concrete checks can surface it in one sitting. First, audit how many items in your current eval set have source-metadata fields at all; if the answer is "none" or "I don't know," that is the project. Second, compute the correlation between your last six eval deltas and your last six outcome-metric deltas. If correlation is under 0.3 you are already decoupled, and chasing the eval number is not going to help your users. Third, commission a small independently-labeled slice and run your last three shipped models against it. If the ordering differs from your internal eval's ordering, you have your receipt.
The thing to internalize: an eval is not a test, it is a compressed statement of what you think good looks like. When that statement is written, even indirectly, by the model you are evaluating, the eval stops being an external check and becomes a feedback amplifier. The work is not to build a cleverer eval; it is to keep the authorship of "what good looks like" structurally separate from the authorship of the outputs being graded. Get the pipelines and the org chart right and the evals take care of themselves. Get them wrong and no amount of rubric engineering will stop the mirror from winning.
- https://arxiv.org/html/2406.04244v1
- https://arxiv.org/html/2502.14425v2
- https://arxiv.org/html/2411.03923v1
- https://arxiv.org/html/2601.05184
- https://arxiv.org/html/2504.12501v3
- https://aclanthology.org/2025.emnlp-main.1173.pdf
- https://arxiv.org/html/2511.18889
- https://arxiv.org/html/2502.08177v2
- https://neurotechnus.com/2025/09/21/llm-as-a-judge-bias-reliability/
- https://medium.com/@Quaxel/preference-pairs-7-rlhf-labeling-traps-762d9aa94082
