Continuous Fine-Tuning Without Data Contamination: The Production Pipeline
Most teams running continuous fine-tuning discover the contamination problem the same way: their eval metrics keep improving each week, the team celebrates, and then a user reports that the model has "gotten worse." When you investigate, you realize your evaluation benchmark has been quietly leaking into your training data for months. Every metric that looked like capability gain was memorization.
The numbers are worse than intuition suggests. LLaMA 2 had over 16% of MMLU examples contaminated — with 11% severely contaminated (more than 80% token overlap). GPT-2 scored 15 percentage points higher on contaminated benchmarks versus clean ones. These are not edge cases. In a continuous fine-tuning loop, contamination is the default outcome unless you architect explicitly against it.
