Fine-Tuning Dataset Provenance: The Audit Question You Can't Answer Six Months Later
Six months after you shipped your fine-tuned model, a regulator asks: "Which training examples came from users who have since revoked consent?" You open a spreadsheet, search a Slack archive, and find yourself reconstructing history from annotation batch emails and a README that hasn't been updated since the first sprint. This is the norm, not the exception. An audit of 44 major instruction fine-tuning datasets found over 70% of their licenses listed as "unspecified," with error rates above 50% in how license categories were actually applied. The provenance problem is structural, and it bites hardest when you can least afford it.
This post is about building a provenance registry for fine-tuning data before you need it — the schema, the audit scenarios that drive its requirements, and the production patterns that make it tractable without becoming a second job.
Why Provenance Debt Compounds
Fine-tuning pipelines accumulate training data from multiple sources simultaneously: scraped production logs, outputs from human annotation vendors, synthetic augmentation from a previous model version, and user corrections routed through a feedback queue. Each source has different consent properties, different license terms, and different removal semantics. But in most teams, these sources are tracked through a combination of S3 bucket naming conventions, comments in training scripts, and tribal knowledge.
The problem isn't that teams don't care. It's that the cost of tracking provenance feels hypothetical on day one and concrete only when something goes wrong. Regulatory pressure is making that calculation shift. The EU AI Act, which entered force in August 2024 with enforcement obligations for GPAI systems beginning August 2025, explicitly requires training data to be documented with governance practices covering design choices, collection methods, and bias correction. GDPR's right to erasure creates a separate pressure: data subjects can demand deletion, and if their data was used in fine-tuning, you need to know which examples, which model versions, and what your remediation path is.
Copyright litigation is the third vector. The legal exposure isn't theoretical. Once a court establishes that unauthorized source material (pirated books, scraped content without terms-compliant consent) prevents a fair-use defense, your provenance records become the difference between a defensible position and a settlement. You cannot reconstruct that lineage retroactively with confidence.
The Four Audit Scenarios That Drive Requirements
Before designing a provenance schema, it helps to be specific about what questions the system must answer. Four scenarios dominate.
GDPR data subject deletion. A user whose production conversations were included in an annotation batch files an erasure request. Your response requires: identifying every training example derived from their data, listing which model versions included those examples, and executing a remediation plan — retraining, unlearning, or output filtering — with documented verification. Without provenance, this takes weeks and still produces a probabilistic answer, not a verified one.
Copyright compliance inquiry. A rights holder or their legal team asks whether copyrighted material from a specific publication was used in training. You must produce a list of affected examples, their inclusion date, the license or consent documentation at the time of ingestion, and whether those examples were later removed and from which model versions.
Security or confidentiality breach. A business customer discovers that proprietary data they shared with your product — assuming it was used only for inference — found its way into an annotation batch used for fine-tuning. The breach response requires scoping which models are affected, assessing model memorization risk, and executing emergency remediation. The timeline for a credible breach response is measured in days. Reconstructing training data lineage in that window without a provenance system is genuinely impossible at any scale above a few thousand examples.
Model version governance. Your team ships a new fine-tuned model version quarterly. When a compliance review asks what changed between v3 and v4 — what new sources were added, what was removed, whether any sources changed consent status — you need a machine-readable diff, not a narrative written by whoever happened to own the training run.
The Provenance Registry Schema
A provenance registry isn't a data warehouse. It's a structured record that maps each training example to the information needed to answer the four scenarios above. The minimum viable schema has five fields per example:
example_id: A stable, unique identifier for each training example that persists across model versions and preprocessing steps. Hashing the raw content works for deduplication but breaks when examples are augmented. A UUID assigned at ingestion time, stored alongside a content hash, is more robust.
source: The origin of this example — the dataset name and version, the annotation vendor and batch identifier, the production log date range, or the synthetic generation run. Include a URI or path that resolves to the original artifact. This is the field most teams skip because it seems obvious at ingestion time and becomes unclear six months later when the annotation vendor renames their exports.
collection_method: An enumerated value: scrape, annotation_service, user_upload, synthetic_llm, user_correction. This field determines which removal and consent logic applies. A user-uploaded example has different GDPR exposure than a synthetic example. An annotation service example may have a vendor contract that governs retention. Getting this wrong collapses three different audit workflows into one.
consent_basis: The legal justification for including this example in training. Not a free-text field — an enumeration: explicit_user_consent, terms_of_service_training_clause, annotator_work_for_hire, open_license, synthetic_no_personal_data, fair_use_claim. Include a reference to the specific document or contract version that supports the claim. This field is the one that matters most in a GDPR audit and is the one most often missing.
removal_triggers: A list of conditions under which this example must be excluded from future training runs. Common values: gdpr_erasure_request, copyright_claim, quality_threshold_failed, source_license_revoked. When a trigger fires, the registry records the trigger event, the date, and which model versions this affects retroactively.
A sixth field, model_versions, completes the schema: an array of model version identifiers that included this example in their training set. This connects the example-level record to the deployment-level audit question.
Tooling Gaps and What Actually Works
No existing tool fully solves fine-tuning provenance out of the box. OpenLineage defines a standard for data pipeline lineage and has adoption in enterprise data platforms, but its model is optimized for batch processing jobs, not example-level training data semantics. DVC handles dataset versioning well — it gives you Git-like history for large files — but it doesn't track consent basis or removal triggers. MLflow and similar experiment trackers record which dataset version was used for a training run but don't go below the dataset level to individual examples.
The practical answer for most teams is a purpose-built registry: a database table or append-only event log with the schema above, versioned alongside your training code. It doesn't need to be complex. A PostgreSQL table with JSONB for the removal_triggers array, a foreign key to a model_versions table, and an append-only audit log for trigger events is sufficient for teams with fewer than 10 million training examples.
For teams using Hugging Face datasets, the dataset card system provides a useful documentation layer but not a queryable registry. Generate the dataset card from your registry, not the other way around — otherwise the card becomes the authoritative record and you're back to reconstructing lineage from documentation.
Two implementation choices matter more than the technology: assign example IDs at ingestion time, before any preprocessing that might change the content, and write the registry record before the data reaches the training pipeline. Teams that write provenance records after training runs discover they can't backfill consent basis reliably, because the person who knew whether a dataset was scrape versus annotated has moved on.
The Deletion Cost Problem
Machine unlearning — removing the effect of specific training examples from a deployed model — is still expensive at scale. Full retraining of a production-scale fine-tuned model runs into the tens to hundreds of thousands of dollars in compute. Recent research has reduced this substantially for narrower models using gradient-based techniques, but the evaluation methodology for verifying that unlearning actually worked remains contested.
The practical production strategy for most teams is PEFT-based periodic retraining: fine-tune only adapter layers using parameter-efficient methods like LoRA, which reduces the compute cost of retraining by roughly 98% compared to full fine-tuning. When a deletion request arrives, accumulate affected examples, rebuild the fine-tuning dataset excluding them, and retrain the adapters on the next scheduled cycle. This produces a model version that demonstrably excludes the deleted data, at a cost that's operationally sustainable.
The provenance registry enables this workflow directly: query for examples matching the removal trigger, generate a filtered dataset, retrain, and record the resulting model version in the registry as the first version that excludes the affected examples. The entire workflow becomes a query and a build job rather than an investigation.
What "Survives Scrutiny" Actually Means
When a regulator, a security auditor, or opposing counsel asks about your training data, they're not asking for documentation. They're asking for verifiable lineage. The difference is significant.
Documentation — a README, a dataset card, an engineering wiki — tells a story. Verifiable lineage means that for any training example, you can produce a timestamped, immutable record of its source, the consent basis at ingestion time, and any subsequent removal events, with cryptographic or database-level guarantees that the record wasn't altered after the fact.
An append-only event log satisfies this requirement. Each ingestion event, each removal trigger, each model version inclusion is a timestamped record that can't be edited — only superseded by a new record. This is the same principle behind financial audit trails, and the reasoning transfers directly: the record proves what you knew, when you knew it, and what you did about it.
Teams that have gone through regulatory review consistently report that auditors care less about whether everything was perfect and more about whether the process was systematic. A provenance registry with honest records of what you tracked and when you started tracking it is more defensible than retroactively reconstructed documentation that claims completeness it can't verify.
Starting Without Starting Over
If your fine-tuned model is already in production and your provenance records are in a Slack archive, you're not starting from zero — you're starting from a known gap. The right move is to document what you know, mark what you can't reconstruct, and begin tracking new ingestion systematically from a fixed date.
The first version of your registry doesn't need to be complete. It needs to be honest about what it covers. An audit response that says "examples ingested before [date] have source documentation only, examples from [date] forward have full provenance records" is defensible and sets a clear improvement trajectory. An audit response that claims complete provenance and then can't produce records is a liability.
The useful provenance question isn't "do we have perfect lineage" — it's "could we answer the four scenarios above with confidence?" Start with the scenario most likely to hit your team, build the schema to answer it, and expand from there. The regulator's question lands eventually. The teams that treat it as a systems design problem rather than a documentation exercise are the ones who answer it in hours, not weeks.
The most expensive provenance system is the one you build after the audit request arrives.
- https://arxiv.org/pdf/2510.09655
- https://www.nature.com/articles/s42256-024-00878-8
- https://artificialintelligenceact.eu/article/10/
- https://gdprlocal.com/gdpr-machine-learning/
- https://arxiv.org/abs/2406.16257
- https://arxiv.org/html/2503.01854v2
- https://openlineage.io/docs/
- https://huggingface.co/docs/hub/en/datasets-cards
- https://arxiv.org/html/2604.01904v1
- https://ai.stanford.edu/~kzliu/blog/unlearning/
