Skip to main content

The Eval Set That Started Leaking Into Your Prompt

· 10 min read
Tian Pan
Software Engineer

The benchmark number went up for four quarters in a row. User satisfaction did not. Nobody on the team could explain the gap until someone diffed the prompt template and noticed that the few-shot examples were being pulled from the same CSV that the evaluator was reading. The eval set had quietly become the in-context examples. The number was no longer measuring generalization. It was measuring how well the model could copy the nearest neighbor of a question whose answer it had just been shown.

This is the failure mode I want to name: eval-to-prompt leakage. It is structurally identical to test-set contamination in classical machine learning, but it happens through a back channel the team built deliberately. Few-shot retrieval is a reasonable engineering move. Eval banks are a reasonable engineering artifact. The contamination emerges when the two converge on the same storage layer without anyone naming the boundary.

How the two corpora become one

Most teams do not set out to leak their eval set. The merge happens incrementally, and each step is locally justified.

Quarter one, a small team curates two hundred examples covering the request distribution. The CSV lives in the same repo as the prompt template, because there is no separate eval infrastructure yet. The headline accuracy number is computed by feeding each example through the current prompt and comparing the output to the labeled answer.

Quarter two, someone adds dynamic few-shot retrieval to the serving prompt. The retrieval index needs a source of labeled examples. The eval CSV is the only labeled corpus the team has, so the retrieval index points at it. The serving prompt now selects three examples from the eval CSV at runtime and includes them in context. The eval pipeline does not change; it still iterates over the same CSV, calling the same prompt template, which now retrieves from the CSV it is iterating over.

Quarter three, the team adds more examples to grow few-shot coverage. The examples land in the same CSV because that is where examples live. The eval set grows in lockstep with the few-shot pool because they are the same file.

By quarter four, the eval number has decoupled from anything a user would notice. The headline accuracy is a measure of nearest-neighbor lookup against an index that contains the answer. Every prompt change that improves retrieval relevance moves the benchmark. None of them improve generalization.

The root cause is not the few-shot retrieval move. The root cause is that the example bank and the eval bank are the same physical artifact, queried by both the serving path and the evaluator path, with no enforced separation. The two corpora were never named as distinct things, and so they merged.

Why the contamination is hard to see from the dashboard

The leak does not look like a bug. It looks like the benchmark working as designed.

Recent contamination research has shown that even partial exposure to benchmark content gives the model enough signal to score high without learning the task. Large models leverage small patterns in prompts, and a few-shot retrieval path that surfaces lexically similar eval examples is doing exactly what it was built to do. The model is not cheating; it is using the context window as the engineering team instructed. The benchmark output and the serving output diverge from user experience together, and from inside the dashboard there is no anomaly to flag.

The team's instinct is to trust the trend line. The trend line is real — the model is genuinely better at the task of "produce an output that matches the labeled answer for an input that closely resembles a labeled example you were just shown in context." That is a different task from the one users perform, but the dashboard has no field to surface the distinction.

The gap shows up first in qualitative review. A product manager opens a sampled session, sees a confident wrong answer to a question whose phrasing is slightly off-distribution, and asks why the benchmark did not catch it. The answer is that the benchmark cannot catch it, because the benchmark only measures performance on inputs that resemble the few-shot pool the model is now retrieving from. The off-distribution case is the entire user-visible failure mode, and it is the one case the eval pipeline structurally cannot see.

This is the back-channel version of the contamination problem that academic benchmark teams have been chronicling for the past two years. The same mechanism that inflates leaderboard scores — implicit overlap between training corpus and test corpus — is now operating inside a single production pipeline, on a much shorter feedback loop, with much less visibility.

Symptoms before someone names it

Three signals tend to surface before a team identifies the leak.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates