Right-to-Erasure Meets Fine-Tuning: When Deletion Stops at the Snapshot
A customer files a subject-access request asking for their data to be deleted. The data engineer purges the production database, the analytics warehouse, the support ticket archive, the cold-storage backups. Every system the legal team listed in the data inventory comes back clean. Then somebody in the room asks the question that nobody wants to answer first: what about the model?
Three months ago that customer's support transcripts went into a fine-tuning run. The resulting adapter has been serving predictions to other customers ever since, with their phrasing, their account names, occasionally their literal sentences embedded in the weights. You can prove deletion in the warehouse. You cannot prove deletion in the model — and the more honest member of the team is the one who says so out loud.
This is the failure mode every team that fine-tunes on customer data eventually runs into, usually for the first time at the worst possible moment. The teams that scope it before the first request arrives have a chance of answering it well. The teams that don't end up choosing between two bad options under deadline pressure: an honest disclosure that erases nothing, or an emergency rebuild that costs a quarter.
Why "delete from the model" is not the same kind of operation
In every system your privacy team has dealt with before, deletion is a row operation. You find the records keyed to the customer, you remove them, you verify they're gone, you sign off. The semantics are clean because the storage is structured: data goes in as discrete units and comes out as discrete units.
Models are not row stores. Once a row is in the loss function, it's been smeared across the parameters along with millions of others. The customer's information isn't sitting in a memory cell labeled with their email — it's a perturbation in the weight matrix that contributed to the gradient on the day their batch was processed. There is no DELETE WHERE customer_id = X for a fine-tuned adapter, because the adapter does not have a WHERE clause.
Most regulators understand this in principle. The European Data Protection Board's January 2025 guidance on data subjects' rights acknowledges that AI systems present new technical challenges, and Article 17(3) of the GDPR carries a "disproportionate effort" exception that some teams hope to lean on. But "we explored alternatives and they were too expensive" is a defensible position only if you actually explored alternatives and documented the exploration. A team that walks into the conversation with no lineage, no extractability test, and no alternative considered is going to find the disproportionate-effort argument very hard to make.
The lineage chain you wish you had built before the request landed
The minimum artifact you need to answer "is this customer's data in this artifact?" is a lineage chain that connects raw data to deployed inference. The chain has roughly six links:
- Raw ingestion: every record arrives with a customer ID and a timestamp, and those identifiers survive every later transformation.
- Training shard: the snapshot of the corpus used for a particular training run is hashed and named, and the customer-ID-to-shard mapping is queryable.
- Preprocessing pipeline: the deterministic transformations from raw data to tokens are versioned, and the version is recorded alongside the shard hash.
- Checkpoint: the model artifact carries the training-shard hash and the preprocessing version in its metadata, not just in a wiki page.
- Deployed adapter: the routing layer that decides which adapter answers a request can name the checkpoint it was built from, and that name resolves backward through the chain.
- Live inference: every request log carries the adapter ID that served it, so usage of an adapter that contains a deleted customer is auditable after the fact.
Teams that have this chain can answer the lawyer's question in an afternoon: "yes, this customer's data is in adapters A, C, and D; here is the deployment status of each; here are the customers currently being served by them." Teams that don't have the chain end up reverse-engineering it from S3 timestamps and Slack messages while a deadline ticks down.
The work to build the chain is unglamorous and mostly happens at training-pipeline boundaries. The temptation is to defer it because no customer has ever asked, and the team can't quantify the savings. The forcing function is usually the first request. The teams that deferred it discover that their preprocessing pipeline mutated the customer ID into something the warehouse no longer recognizes, that one of their training shards was assembled by a contractor who left, and that the adapter currently in production was built from a checkpoint whose lineage is "I think Karen ran it that week."
The four policy choices, and what each one buys you
Once a customer's data is in a fine-tuned artifact, you have four substantive options. None of them are free, and the team's job is to pick deliberately rather than drift into one.
- https://www.mdpi.com/1999-5903/17/4/151
- https://www.techpolicy.press/the-right-to-be-forgotten-is-dead-data-lives-forever-in-ai/
- https://gdprlocal.com/large-language-models-llm-gdpr/
- https://arxiv.org/pdf/2307.03941
- https://www.relyance.ai/blog/llm-gdpr-compliance
- https://www.influencers-time.com/erasing-personal-data-from-ai-models-challenges-and-solutions/
- https://www.sciencedirect.com/science/article/pii/S026736492300095X
- https://james.grimmelmann.net/files/articles/unlearning-doesnt-do.pdf
- https://arxiv.org/html/2411.11315v1
- https://www.edpb.europa.eu/system/files/2025-01/d2-ai-effective-implementation-of-data-subjects-rights_en.pdf
- https://arxiv.org/html/2503.01630v1
- https://www.scirp.org/journal/paperinformation?paperid=133625
- https://www.ijcai.org/proceedings/2025/1156.pdf
- https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html
- https://research.google/blog/fine-tuning-llms-with-user-level-differential-privacy/
- https://arxiv.org/html/2504.21036v2
- https://openai.com/policies/data-processing-addendum/
- https://atlan.com/regulatory-data-lineage-tracking/
