Fine-Tuning Dataset Provenance: The Audit Question You Can't Answer Six Months Later
Six months after you shipped your fine-tuned model, a regulator asks: "Which training examples came from users who have since revoked consent?" You open a spreadsheet, search a Slack archive, and find yourself reconstructing history from annotation batch emails and a README that hasn't been updated since the first sprint. This is the norm, not the exception. An audit of 44 major instruction fine-tuning datasets found over 70% of their licenses listed as "unspecified," with error rates above 50% in how license categories were actually applied. The provenance problem is structural, and it bites hardest when you can least afford it.
This post is about building a provenance registry for fine-tuning data before you need it — the schema, the audit scenarios that drive its requirements, and the production patterns that make it tractable without becoming a second job.
Why Provenance Debt Compounds
Fine-tuning pipelines accumulate training data from multiple sources simultaneously: scraped production logs, outputs from human annotation vendors, synthetic augmentation from a previous model version, and user corrections routed through a feedback queue. Each source has different consent properties, different license terms, and different removal semantics. But in most teams, these sources are tracked through a combination of S3 bucket naming conventions, comments in training scripts, and tribal knowledge.
The problem isn't that teams don't care. It's that the cost of tracking provenance feels hypothetical on day one and concrete only when something goes wrong. Regulatory pressure is making that calculation shift. The EU AI Act, which entered force in August 2024 with enforcement obligations for GPAI systems beginning August 2025, explicitly requires training data to be documented with governance practices covering design choices, collection methods, and bias correction. GDPR's right to erasure creates a separate pressure: data subjects can demand deletion, and if their data was used in fine-tuning, you need to know which examples, which model versions, and what your remediation path is.
Copyright litigation is the third vector. The legal exposure isn't theoretical. Once a court establishes that unauthorized source material (pirated books, scraped content without terms-compliant consent) prevents a fair-use defense, your provenance records become the difference between a defensible position and a settlement. You cannot reconstruct that lineage retroactively with confidence.
The Four Audit Scenarios That Drive Requirements
Before designing a provenance schema, it helps to be specific about what questions the system must answer. Four scenarios dominate.
GDPR data subject deletion. A user whose production conversations were included in an annotation batch files an erasure request. Your response requires: identifying every training example derived from their data, listing which model versions included those examples, and executing a remediation plan — retraining, unlearning, or output filtering — with documented verification. Without provenance, this takes weeks and still produces a probabilistic answer, not a verified one.
Copyright compliance inquiry. A rights holder or their legal team asks whether copyrighted material from a specific publication was used in training. You must produce a list of affected examples, their inclusion date, the license or consent documentation at the time of ingestion, and whether those examples were later removed and from which model versions.
Security or confidentiality breach. A business customer discovers that proprietary data they shared with your product — assuming it was used only for inference — found its way into an annotation batch used for fine-tuning. The breach response requires scoping which models are affected, assessing model memorization risk, and executing emergency remediation. The timeline for a credible breach response is measured in days. Reconstructing training data lineage in that window without a provenance system is genuinely impossible at any scale above a few thousand examples.
Model version governance. Your team ships a new fine-tuned model version quarterly. When a compliance review asks what changed between v3 and v4 — what new sources were added, what was removed, whether any sources changed consent status — you need a machine-readable diff, not a narrative written by whoever happened to own the training run.
The Provenance Registry Schema
A provenance registry isn't a data warehouse. It's a structured record that maps each training example to the information needed to answer the four scenarios above. The minimum viable schema has five fields per example:
example_id: A stable, unique identifier for each training example that persists across model versions and preprocessing steps. Hashing the raw content works for deduplication but breaks when examples are augmented. A UUID assigned at ingestion time, stored alongside a content hash, is more robust.
source: The origin of this example — the dataset name and version, the annotation vendor and batch identifier, the production log date range, or the synthetic generation run. Include a URI or path that resolves to the original artifact. This is the field most teams skip because it seems obvious at ingestion time and becomes unclear six months later when the annotation vendor renames their exports.
collection_method: An enumerated value: scrape, annotation_service, user_upload, synthetic_llm, user_correction. This field determines which removal and consent logic applies. A user-uploaded example has different GDPR exposure than a synthetic example. An annotation service example may have a vendor contract that governs retention. Getting this wrong collapses three different audit workflows into one.
consent_basis: The legal justification for including this example in training. Not a free-text field — an enumeration: explicit_user_consent, terms_of_service_training_clause, annotator_work_for_hire, open_license, synthetic_no_personal_data, fair_use_claim. Include a reference to the specific document or contract version that supports the claim. This field is the one that matters most in a GDPR audit and is the one most often missing.
removal_triggers: A list of conditions under which this example must be excluded from future training runs. Common values: gdpr_erasure_request, copyright_claim, quality_threshold_failed, source_license_revoked. When a trigger fires, the registry records the trigger event, the date, and which model versions this affects retroactively.
- https://arxiv.org/pdf/2510.09655
- https://www.nature.com/articles/s42256-024-00878-8
- https://artificialintelligenceact.eu/article/10/
- https://gdprlocal.com/gdpr-machine-learning/
- https://arxiv.org/abs/2406.16257
- https://arxiv.org/html/2503.01854v2
- https://openlineage.io/docs/
- https://huggingface.co/docs/hub/en/datasets-cards
- https://arxiv.org/html/2604.01904v1
- https://ai.stanford.edu/~kzliu/blog/unlearning/
