The Eval Set Your Prompt Engineers Turned Into Production Few-Shots
The eval dashboard had been climbing for three sprints. Quality up six points on the hard slice, up nine on the regression slice, up twelve on the slice the support team had hand-curated from last quarter's worst tickets. The team shipped a model promotion off the back of it. Two days later, a customer asked a question that looked nothing like anything in the eval set, and the answer was worse than what they had been getting six months ago.
The forensic was quick once someone thought to run it. The prompt engineers had been working out of the same repo as the eval team. They had found the curated examples — the painstaking ones, the ones where someone had argued for an hour about the correct phrasing of the ideal answer — and over a few sprints they had copy-pasted the strongest of them as few-shot demonstrations into the production system prompt. The dashboard kept going up because the model was being graded on inputs it had seen verbatim at inference time. Nobody flagged it. Nobody owned the boundary between "the examples we measure quality against" and "the examples we ship in the prompt." Both teams were doing exactly the job they had been hired to do.
This is not a story about anyone being careless. It is a story about a team-level data leak that no individual contributor can recognize as one, because the leak only exists when you look at the system from above. The eval team did not put their examples into production. The prompt-engineering team did not corrupt the eval. The repo policy did not say "these examples are sacred." The CI did not check for substring overlap between system prompts and eval inputs. The leak was the absence of a boundary that nobody had been asked to draw.
Why "Best-Available Examples" Is Exactly the Wrong Material
Here is the uncomfortable structural fact. The work product of a good eval team and the work product of a good prompt-engineering team look identical at the level of files in a repo. Both are collections of carefully chosen input-output pairs where the output is the answer your product wants the model to produce. The eval team curates them to grade the model on its hardest known failure modes. The prompt-engineering team curates them to demonstrate the behavior they want the model to copy. Same shape, opposite purpose.
So when a prompt engineer goes looking for "high-quality examples that capture what good looks like on the cases we keep getting wrong" — exactly the brief — and the highest-quality examples in the entire org are sitting in the eval team's shared folder, what happens next is not a violation of any rule. It is the rational behavior of a competent engineer who has not been told that those examples are radioactive. The cases the eval team curated are, by construction, the cases where the model used to fail. Of course they are the best teaching material. That is also why they are the worst teaching material: if you put them in the prompt, you have just trained your evaluation harness to grade open-book.
The literature on benchmark contamination has been documenting this for years, but at the level of pretraining data. The recent work on data contamination in LLM evaluation finds that contamination levels reach as high as 45% on commonly used benchmarks, and that exposure to evaluation content causes 5–15% absolute accuracy drops to disappear from headline metrics. What is less discussed is the same phenomenon recapitulated inside a single product team's workflow over the course of a few sprints, where the contamination vector is not the training run but the system prompt.
The mechanism is the same. The model is being asked to demonstrate behavior on inputs it has already been shown the correct answer for. The fact that the showing happened at inference time rather than during training does not change what the metric is measuring. It is measuring memorization, not generalization, and your dashboard cannot tell the difference because the dashboard was built to measure quality on this exact set of inputs.
The Failure Mode Hides Inside a Real Improvement Curve
The thing that makes eval-set-to-few-shot contamination especially insidious is that the dashboard does not look broken. It looks great. It looks the way a team that is doing the work hopes their dashboard will look. There is a real improvement happening in some of the per-example scores, because few-shot demonstrations genuinely do improve the model's behavior on similar inputs. The contamination does not flatten the curve. It bends it the wrong way.
What you cannot see from inside the metric is the bifurcation between two populations of inputs. On the contaminated population — anything that resembles the few-shot demos closely enough that the model can pattern-match its way to the answer — scores climb. On the rest of the production traffic, which is the part the eval set was supposed to be a proxy for, scores stay flat or get slightly worse, because the prompt is now longer, more cluttered, and weighted toward a narrow phrasing style that may not generalize. Your overall metric averages these together, weighted by the composition of the eval set, which has been transparently optimized for over the same sprints.
There is a growing body of research on what happens when the model behaves differently on data it has seen versus data it has not. The Memorization Generalization Index work argues that you cannot trust a single aggregate accuracy number without a procedure that quantifies how much of the lift is memorization. In a team-level few-shot leak, the same principle applies: the aggregate is hiding two populations, and the population that matters in production is the one that is not represented in the few-shots.
What Independence Actually Requires
The standard advice — "use a held-out eval set" — is correct but underspecified for this failure mode. A held-out set that lives in the same repo as the prompt is held out only by convention. A held-out set that is rotated quarterly but whose rotation cadence is known to the prompt-engineering team is held out only against the previous quarter. A held-out set whose contents can be inferred from the public docs the prompt engineers also read is held out only against direct copy-paste, not against the broader category of behaviors the held-out set is sampling from.
- https://arxiv.org/html/2502.14425v2
- https://arxiv.org/abs/2402.15938
- https://aclanthology.org/2025.emnlp-main.1173.pdf
- https://aclanthology.org/2025.eval4nlp-1.3.pdf
- https://arxiv.org/pdf/2407.07565
- https://arxiv.org/html/2312.16337v1
- https://arxiv.org/pdf/2509.13196
- https://www.prompthub.us/blog/the-few-shot-prompting-guide
- https://montecarlo.ai/blog-llm-as-judge/
- https://www.truefoundry.com/blog/prompt-management-tools
