The Eval Set That Started Leaking Into Your Prompt
The benchmark number went up for four quarters in a row. User satisfaction did not. Nobody on the team could explain the gap until someone diffed the prompt template and noticed that the few-shot examples were being pulled from the same CSV that the evaluator was reading. The eval set had quietly become the in-context examples. The number was no longer measuring generalization. It was measuring how well the model could copy the nearest neighbor of a question whose answer it had just been shown.
This is the failure mode I want to name: eval-to-prompt leakage. It is structurally identical to test-set contamination in classical machine learning, but it happens through a back channel the team built deliberately. Few-shot retrieval is a reasonable engineering move. Eval banks are a reasonable engineering artifact. The contamination emerges when the two converge on the same storage layer without anyone naming the boundary.
