What Your Fine-Tuned LLM Is Leaking About Its Training Data
When a team fine-tunes an LLM on customer support tickets, internal Slack exports, or proprietary code, the instinct is to treat data ingestion as a one-way door: data goes in, a better model comes out. That's not how it works. A researcher with API access and $200 can systematically pull verbatim text back out, often including content the model was never supposed to surface. This isn't a theoretical edge case — it's a documented attack pattern that has been demonstrated against production systems including one of the world's most widely deployed language models.
The core problem is that fine-tuned models are fundamentally different from base models in their privacy posture. They've been trained on smaller, more distinctive datasets where individual examples are far more distinguishable from background model behavior. That distinctiveness is exactly what attackers exploit.
Three Ways Your Model Leaks Training Data
Privacy attacks on fine-tuned LLMs fall into three distinct categories, each with a different mechanism and different attacker profile.
Verbatim extraction is the most visceral. The model reproduces sequences from its training data word-for-word. Research on GPT-J found that at least 1% of its training dataset is memorized in this way — a small fraction that still represents thousands of recoverable sequences in any large model. More troubling is the rate at which targeted attacks can surface that memorized content. A divergence attack — prompting the model to repeat a word like "poem" or "send" indefinitely until it derails — causes models to emit training data at roughly 150x the rate of normal operation. Researchers demonstrated this against a production model, extracting over 10,000 verbatim training examples for under $200 in API costs. Roughly 5% of outputs under divergence conditions turn out to be direct 50-token-or-longer copies from training data.
Memorization scales with model size (log-linearly) and with how often a sequence appeared in training (super-linearly). Sequences that appear multiple times get burned in disproportionately deeply. If you've fine-tuned on a dataset with repeated records — as customer support data almost always contains, because the same issues recur — the memorization risk compounds.
Membership inference attacks are subtler. Instead of asking "can I extract this text?", they ask "was this specific document used in training?" An adversary who wants to know whether a competitor's proprietary contract or a specific user's record was in your fine-tuning corpus can now answer that question with high confidence using only black-box API access. A 2024 method raised the AUC for these attacks from 0.7 to 0.9 across multiple model types by calibrating against the target model's own outputs rather than an external reference dataset. That 0.9 figure means the attack is roughly nine times better than random guessing at the document level.
Attribute inference attacks don't extract text at all — they reconstruct private facts about people whose data contributed to training. By querying the model with carefully structured questions about behavioral patterns and preferences, attackers have demonstrated 85% top-1 accuracy and 95% top-3 accuracy on personal attributes like location, income, and demographics, at 100x lower cost and 240x less time than human annotators. For models fine-tuned on user-generated content, this means an adversary can profile users who contributed to training without needing a single verbatim sentence.
Who Is Actually Attacking Fine-Tuned Models in Production
The threat model for base models trained on public internet data is relatively forgiving — there's nothing secret to extract. Fine-tuned models on internal data face a fundamentally different attacker profile.
Competitors are the clearest threat. A company that fine-tunes on proprietary engineering documentation, internal pricing strategy discussions, or customer behavior data has a business interest in keeping that dataset private. Extracting even partial training data through a public-facing product reveals what the company considered important enough to train on, which is itself valuable intelligence.
System prompt extraction is a related but distinct attack. Testing of 50 real-world applications on a major AI platform found that system prompts — often containing business logic, proprietary instructions, and internal operational details — were successfully extracted from 68% of them through adversarial queries. System prompts weren't designed to be secret by default, but most teams treat them as confidential. That 68% figure suggests most teams are overestimating how much protection the model provides by default.
The insider threat is underappreciated. When employees use unapproved AI tools with internal documents, roughly 1 in 12 prompts contains sensitive corporate information. That data is ingested by services that may use it for training, creating leakage pathways that bypass any controls on the official fine-tuned model entirely.
OWASP's 2025 ranking of LLM application risks moved sensitive information disclosure from 6th to 2nd. That shift reflects where production exploits are actually concentrating.
Differential Privacy: Real Numbers on a Real Tradeoff
Differential privacy (DP) is the only mechanism that provides a mathematical bound on leakage. DP-SGD works by adding calibrated Gaussian noise to gradients during fine-tuning and clipping per-example gradients to bound sensitivity. The privacy guarantee ε controls the tradeoff: lower ε means stronger privacy but higher utility loss.
The practical question is whether the utility cost is survivable. Recent work on fine-tuning large models with user-level DP — where the guarantee covers all of a given user's contributions, not just individual examples — shows it's viable at scale. User-level DP provides stronger guarantees than example-level DP and is preferred in two scenarios: when strong privacy guarantees are genuinely required, or when compute budget is large enough to amortize the training overhead.
LoRA changes the economics substantially. Research published in 2024 showed that LoRA adapters inherently provide differential privacy-like protection through their random sketching properties, with the noise variance decreasing as the adaptation rank increases. This means lower-rank LoRA provides better implicit privacy. Explicitly combining DP-SGD with LoRA yields better privacy-utility tradeoffs than applying DP-SGD to full fine-tuning, because the reduced parameter count shrinks the attack surface without requiring the same level of gradient noise to achieve the same ε.
The practical guidance that emerges: if you're fine-tuning on sensitive data and can accept modest utility degradation, DP-LoRA with a low rank and ε around 4–8 provides meaningful protection against extraction attacks without catastrophic performance collapse. If you can't accept any utility loss, you need a different strategy.
- https://arxiv.org/abs/2311.06062
- https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html
- https://arxiv.org/pdf/2311.17035
- https://research.google/blog/fine-tuning-llms-with-user-level-differential-privacy/
- https://arxiv.org/abs/2407.07737
- https://arxiv.org/abs/2409.17538
- https://arxiv.org/abs/2503.06808
- https://owasp.org/www-project-top-10-for-large-language-model-applications/
- https://arxiv.org/html/2406.17975
- https://arxiv.org/html/2502.05087v1
