Skip to main content

3 posts tagged with "contamination"

View all tags

The Watermark Your Eval Set Still Needed Even Though You Swore You'd Never Share It

· 11 min read
Tian Pan
Software Engineer

Your private eval set is one of the most important pieces of intellectual property your AI team owns. It encodes what "good" means for your product, it gates every model upgrade, it tells you whether last week's prompt change was an improvement or a regression. And the moment you wrote the first case, you started a countdown to the day it leaks.

Not because you'll publish it. Not because you'll demo it at a conference. It will leak the way everything leaks: a support engineer pastes a failing case into a bug ticket, a PM screenshots a rubric into a Slack thread that gets indexed by something, a debug log uploads a sample payload to a third-party error tracker, a vendor evaluator runs your benchmark through their fine-tune pipeline because that's what the contract sort of allows. Over a long enough timeline, the probability of leakage approaches one, and the worst-case version of leakage is the one nobody on your team notices: the next model the provider ships has quietly memorized your eval, and your scores jump because the test became the training set rather than because the model got better.

The Eval Set That Started Leaking Into Your Prompt

· 10 min read
Tian Pan
Software Engineer

The benchmark number went up for four quarters in a row. User satisfaction did not. Nobody on the team could explain the gap until someone diffed the prompt template and noticed that the few-shot examples were being pulled from the same CSV that the evaluator was reading. The eval set had quietly become the in-context examples. The number was no longer measuring generalization. It was measuring how well the model could copy the nearest neighbor of a question whose answer it had just been shown.

This is the failure mode I want to name: eval-to-prompt leakage. It is structurally identical to test-set contamination in classical machine learning, but it happens through a back channel the team built deliberately. Few-shot retrieval is a reasonable engineering move. Eval banks are a reasonable engineering artifact. The contamination emerges when the two converge on the same storage layer without anyone naming the boundary.

The benchmark leak: how your eval set quietly joins the training corpus

· 11 min read
Tian Pan
Software Engineer

The benchmark you trust most is the one most likely lying to you. Public evals are a closed loop: you publish the test, someone scrapes it, the next generation of models trains on the scrape, and the score on your trusted yardstick rises by ten points without anyone touching the underlying capability. The measurement apparatus stays still while the thing it measures shifts under it, and the gap between "the model is better at this benchmark" and "the model is better at this task" widens every quarter. By the time the divergence is loud enough to argue about, the eval has already shipped six leaderboard updates and three product roadmaps that all assumed the number meant something.

This is not a hypothetical failure mode. The non-public pre-RLHF GPT-4 base model has been shown to reproduce the BIG-Bench canary GUID verbatim, and Claude 3.5 Sonnet has done the same, both indicating that supposedly-quarantined task data ended up in training. Roughly 40% of HumanEval examples have been identified as contaminated, and removing the contaminated subset from GSM8K drops measured accuracy by about 13 points. SWE-bench Verified now shows a documented 10.6% data leakage rate, and OpenAI publicly stopped reporting it in late 2025 after their internal audit found every major frontier model could reproduce verbatim gold patches for some tasks. The numbers we use to compare models are increasingly numbers about memorization, not capability.