The Watermark Your Eval Set Still Needed Even Though You Swore You'd Never Share It
Your private eval set is one of the most important pieces of intellectual property your AI team owns. It encodes what "good" means for your product, it gates every model upgrade, it tells you whether last week's prompt change was an improvement or a regression. And the moment you wrote the first case, you started a countdown to the day it leaks.
Not because you'll publish it. Not because you'll demo it at a conference. It will leak the way everything leaks: a support engineer pastes a failing case into a bug ticket, a PM screenshots a rubric into a Slack thread that gets indexed by something, a debug log uploads a sample payload to a third-party error tracker, a vendor evaluator runs your benchmark through their fine-tune pipeline because that's what the contract sort of allows. Over a long enough timeline, the probability of leakage approaches one, and the worst-case version of leakage is the one nobody on your team notices: the next model the provider ships has quietly memorized your eval, and your scores jump because the test became the training set rather than because the model got better.
