Skip to main content

One post tagged with "data-contamination"

View all tags

When Your Test Set Leaks Into Fine-Tuning: The Contamination You Cause Yourself

· 9 min read
Tian Pan
Software Engineer

Everyone in AI knows the cautionary tale of benchmark contamination: a model vendor scrapes the open web, GSM8K and MMLU end up in the pretraining corpus, and the reported scores measure recall instead of reasoning. It is treated as somebody else's sin — the foundation lab's problem, an artifact you inherit. So you build your own held-out eval set, keep it in a private repo, and assume you are clean.

You are probably not. The most damaging contamination in a production AI system is rarely inherited. It is manufactured, in-house, by well-meaning engineers following a sensible-looking workflow. Your eval set leaks into your training pipeline through doors you built yourself, and the leak is silent: every dashboard turns green at exactly the moment your benchmark stops measuring anything real.

This is the contamination you cause yourself. It deserves more attention than the kind you inherit, because you are the only one who can detect it — and almost nobody audits for it.