Skip to main content

Your Gold Labels Learned From Your Model: Eval-Set Contamination via Production Leakage

· 10 min read
Tian Pan
Software Engineer

Your eval suite passed. Quality dashboards are green. A week later, users are quietly churning and nobody can explain why. The eval set did not lie by being wrong — it lied by being a mirror. The labels you graded against were, traceably, produced or filtered by the very model family you were trying to evaluate. Passing that eval is not evidence of quality. It is evidence that your model agrees with its own past outputs.

This is the quiet failure mode of mature LLM pipelines: eval-set contamination via production leakage. Not the famous benchmark contamination where a model trained on GSM8K also gets graded on GSM8K — that story is well told. The subtler one is downstream. Your gold labels come from user feedback, from human annotators who saw the model's draft first, from RLHF reward traces, from LLM-as-judge preference data. Each of those pipelines carries a fingerprint of the current model's idiom back into your "ground truth." Over a few quarters, the test set quietly memorizes your model's biases, and the eval becomes a self-congratulation loop.

The painful part: the loop looks healthy from the outside. Coverage grows, reviewer agreement goes up, scores trend up and to the right. Every signal your dashboard tracks confirms that the eval set is improving. What the dashboard cannot see is that the direction of improvement has rotated away from "what users wanted" and toward "what the model has learned to produce."

The Three Production Pipelines That Leak

Contamination rarely arrives through one fat pipe. It leaks through three thin ones, each one individually reasonable, collectively corrosive.

User feedback as labels. Thumbs-up / thumbs-down, regenerate-rate, dwell time on a response — product teams rightly mine these signals and promote them into eval inputs. The hidden assumption is that "user approved" means "objectively correct." It does not. Users approve outputs that match their priors, their phrasing, and the length conventions the model has trained them to expect. If your model trended toward verbose bullet-heavy answers last quarter, users adapted, and their thumbs-up now rewards the same shape. Promote those as labels and your eval set now encodes "verbose and bullet-heavy" as correct. A terser, better model will score lower. You will read that as a regression and not ship it.

Human labelers who see the draft. The common annotation UI shows the model's candidate response and asks a labeler to edit, accept, or rank. It feels efficient — labelers anchor on the draft and only correct the obvious misses. That anchoring is the contamination vector. Research on RLHF labeling traps documents exactly this: if labelers lean on model output as a starting point, their "gold" labels inherit model-specific phrasings, structural choices, and failure blind spots. The label teaches the model to produce what the labeler already expected to see, which was the model's own draft. Independent labeling — labeler writes the answer before looking at any model output — is expensive and slow, so almost nobody does it consistently. That's how the drift starts.

RLHF and judge-model artifacts. Preference pairs collected during RLHF, reward-model scores, and LLM-as-judge evaluations all carry the stylistic preferences of the model that generated the candidates. Reuse those preference traces to build your eval or your reward shaping, and the eval set becomes a reflection of the policy that produced the candidates. Cross-contamination between the reward model, the policy, and the eval set is especially insidious because all three share an ancestry, and that ancestry is not logged anywhere a normal eng team would look. Subsequent generations of models get graded against a rubric the previous generation wrote.

Any one of these pipelines, on its own, introduces modest skew. Combined and running for a year, they produce an eval set that is statistically indistinguishable from "outputs this model family would have produced, filtered for surface-level correctness."

Why the Loop Feels Like Progress

The mechanics of the self-congratulation loop are straightforward once you name them. At each iteration: the model produces outputs, a filtered slice of those outputs becomes labels, labels go into evals, evals are used to decide which next-generation models ship. Selection pressure on the next model is therefore "match the previous model's filtered outputs." Run this a few times and preference drift compounds in one direction.

Recent work on self-consuming performative loops shows that non-creative preference bias increases faster under incremental fine-tuning when the training data partially originates from the model's own outputs. The eval case is the same dynamic running slower and quieter, because evals are used for go/no-go decisions rather than weight updates. The loop does not train the model to be more like itself in one shot. It trains the selection process to reject anything that is not like the incumbent.

The symptom most teams notice first is not low quality — it is silent plateau. Eval scores tick up reliably, but real-world outcome metrics (task completion, user retention, complaint rate) stop correlating with eval deltas. That decoupling is the smoke signal. When your eval has become a mirror, moving eval scores requires the new model to converge on the old model's idiom; moving outcome metrics requires the opposite. You cannot do both at once.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates