Skip to main content

One post tagged with "machine-unlearning"

View all tags

The Dataset License That Retroactively Poisoned Your Fine-Tune

· 10 min read
Tian Pan
Software Engineer

The fine-tuned checkpoint that has been running in production for nine months is now sitting in a Slack thread between your CTO and outside counsel. A data source that you scraped under what looked like a permissive license has changed its terms, sent a notice, and named your model. Your engineers want to know whether the model can simply be "untrained" on the offending records. Counsel wants to know whether the weights file itself is now a regulated artifact. Nobody on the call has a good answer, because your training pipeline treated the license as an event — read once at ingestion time — instead of a state that the world can edit after you have already paid for the H100s.

This is the failure mode that very few fine-tuning playbooks bother to discuss. The license under which a dataset was distributed is not a static gate that you walk through at ingestion. It is an ongoing claim by a third party that you do not control, and the half-life of that claim is shrinking. Hugging Face's own legal repository quietly logs DMCA takedowns against named datasets every few weeks — AoPS pulling the MATH benchmark, PaperDemon pulling scraped artwork, Archive of Our Own removing a fanfiction dump within hours of notice. Each takedown is a downstream signal that some model somewhere was trained on data whose redistribution rights have since evaporated.