Skip to main content

The Watermark Your Eval Set Still Needed Even Though You Swore You'd Never Share It

· 11 min read
Tian Pan
Software Engineer

Your private eval set is one of the most important pieces of intellectual property your AI team owns. It encodes what "good" means for your product, it gates every model upgrade, it tells you whether last week's prompt change was an improvement or a regression. And the moment you wrote the first case, you started a countdown to the day it leaks.

Not because you'll publish it. Not because you'll demo it at a conference. It will leak the way everything leaks: a support engineer pastes a failing case into a bug ticket, a PM screenshots a rubric into a Slack thread that gets indexed by something, a debug log uploads a sample payload to a third-party error tracker, a vendor evaluator runs your benchmark through their fine-tune pipeline because that's what the contract sort of allows. Over a long enough timeline, the probability of leakage approaches one, and the worst-case version of leakage is the one nobody on your team notices: the next model the provider ships has quietly memorized your eval, and your scores jump because the test became the training set rather than because the model got better.

If you cannot tell the difference between a real capability gain and a contamination artifact, your eval has stopped measuring the thing you built it to measure. The fix is not "be more careful with the file" — careful is the floor, not the ceiling. The fix is to assume leakage and design the eval to detect it.

Contamination Is Not a Hypothetical

Public benchmarks already live in the contaminated world your private eval is heading toward. Recent surveys put benchmark contamination at up to 45% on commonly used suites, with some multilingual benchmarks showing contamination signals in 91.8% of cases across major models. MMLU sees double-digit contamination across leading model releases. Roughly a quarter of HumanEval has been found in GPT-4's training data, and 8–18% of HumanEval overlaps with public pretraining corpora like RedPajama and StarCoder-Data. Four of the benchmarks the industry rode for years — MMLU, HumanEval, HellaSwag, the original GSM8K — are now widely considered retired for frontier evaluation because their top scores no longer separate models that are good at the task from models that have seen the task.

Those are public sets. They got contaminated despite the BIG-bench-style discipline of embedding a canary GUID and asking trainers to filter on it. Researchers ran the canary check against GPT-4-base and found it had memorized several BIG-bench tasks anyway, because canaries only work when whoever scrapes the data also chooses to filter on them, and a meaningful number of scrapers don't. Your private eval has no canary, no filter list, no upstream agreement. It is structurally less defended than a benchmark that was already losing.

The mental model to keep is that an eval set is a contract whose value depreciates the moment it is exposed, and exposure is a continuous variable, not a binary one. Some leakage paths you can close. The ones you can't close are the ones the watermark is for.

How Eval Sets Actually Leak

Engineers tend to imagine leakage as a single dramatic event — a repo pushed public, a credentialed dataset downloaded — but in practice it's an aggregation of small, defensible-at-the-time decisions that each move a few cases across a boundary they shouldn't have crossed.

A short list of paths that are routinely live in production teams:

  • A failing eval case is pasted verbatim into a Jira or Linear ticket so the bug filer can describe what went wrong. The ticket tracker has an integration that mirrors descriptions into a search index that has third-party crawler access.
  • A rubric is pasted into a Slack channel during a calibration discussion. The channel is exported into a tooling pipeline whose retention policy nobody has read. Months later, the export shows up in a vendor's training corpus through a path none of the original participants could have predicted.
  • An LLM-judge prompt that contains golden references is uploaded to a hosted evaluation platform that, deep in the terms, retains prompts for "service improvement."
  • A debug log that captures full request payloads is shipped to an error tracker that fans out to several downstream analytics tools. One of those tools is sold a year later, and the buyer's data-use policy is more permissive than the seller's was.
  • A vendor evaluator is contracted to "calibrate against your internal benchmark." The contract scopes them to read-only use of the eval. Nothing in the contract prevents them from sampling the eval into their own training mixture as an unintended side effect of how they structure their own scoring pipeline.
  • A team member runs an eval case through a chatbot to debug why the model refused it, and the chatbot's provider retains user inputs by default unless someone toggled the setting off six months ago.

Each path is a defensible local decision. None of them feels like "publishing the eval." Together, they ensure that any sufficiently old, sufficiently used private eval has crossed at least one of these boundaries somewhere. You will not catch all of them. You don't need to. You need to detect that one of them happened.

What Watermarking Means for a Private Eval

The discipline from the benchmark-contamination research community translates surprisingly directly. Watermarking, in this context, means embedding signals into your eval cases that the team would never write naturally, that a model trained on the eval is likely to reproduce, and that you can check for in model outputs on unrelated inputs.

The watermark is not a label on the case. It is a fingerprint inside the case, shaped so that the act of memorizing the case forces the model to also memorize the fingerprint.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates