The PII Redactor Whose Own Training Corpus Was the Leak Vector
A team stands up a fine-tuned redaction model in front of their log pipeline. It strips names, emails, account numbers, and IP addresses before anything lands in long-term storage. The model is small, fast, and easy to deploy alongside the ingestion workers. The privacy review approves it. Six months later a customer support engineer pastes a strange-looking log line into a debugging tool, and the redactor produces an output that contains a real customer's email address — one that does not appear anywhere in the input.
The pipeline did exactly what it was built to do. The redactor was the leak.
This failure mode is not exotic. It follows directly from the construction. A redactor trained on real customer data is a model that has learned, at some level of fidelity, the distribution of real customer data. Anything a model learns it can be made to emit. The team that put an ML model on the privacy boundary added a new disclosure surface — one whose threat model the privacy team usually does not own, because the privacy team thinks of the redactor as a control, not as an asset that itself contains the protected class of data.
The redactor is not a wall, it is a witness
The mental model that gets teams into trouble is: "the redactor sits between the data and the store, so the data does not reach the store." That mental model is true for a regex. It is not true for a neural network whose weights were updated by gradient descent on examples that contained the very strings the regex would have matched.
Production language models can be made to emit verbatim training data. The 2023 work on scalable extraction from ChatGPT showed that around $200 of API queries to gpt-3.5-turbo yielded over 10,000 unique verbatim training examples, and the extracted set covered PII, code, news headlines, and log-shaped strings. The 2024 PII-Scope study found that fine-tuned models leak PII at rates significantly higher than their pretrained counterparts, with simple template-prompt attacks reaching above 50% extraction rates within 256 queries. A redactor that was fine-tuned on a corpus of real log lines is — empirically, not hypothetically — a fine-tuned model that leaks above the pretrained baseline.
The threat is not symmetrical with normal model risks. A misclassifying classifier produces a wrong label; an extracting redactor produces protected text. The blast radius of an error is the data itself.
Why fine-tuned redactors memorize the values they redact
Three structural facts make redaction models particularly leaky as a class.
The training labels are the values. A standard supervised redaction dataset pairs raw text with span-level labels marking which substrings are PII. Both halves of the example contain the literal protected string. Loss is calculated against tokens, so the model is rewarded for fitting the exact characters of names, account numbers, and email addresses. This is not incidental memorization in the way a foundation model memorizes copyrighted prose — it is the supervised objective.
Rare entities have nowhere to hide. Most PII in a real log corpus follows a long-tail distribution. A handful of high-frequency entities — internal test accounts, customer success rep emails, the CTO's name — appear hundreds of times. Most real customer identifiers appear once or twice. Repetition is the dominant predictor of memorization, so the high-frequency tail gets memorized first and hardest. Those high-frequency strings are exactly the ones an attacker can extract with the cheapest probe.
The deployment shape rewards extraction. A redactor is called per log line, often per request, with attacker-controllable input on at least one path (user-submitted form fields, request bodies, URL parameters). The attacker can issue thousands of probe prompts through normal application traffic. They do not need API keys to the model; they just need to write log lines that reach the redactor. The 2022 work on unintended memorization in NER models showed that even task-specific redactors expose memorized phrases through both output content and inference-time timing differences — a side channel the team is unlikely to have considered.
How the leak actually surfaces in production
The first sign is usually not a privacy alarm. It is something noisy and ambiguous.
- https://arxiv.org/html/2508.05545v1
- https://arxiv.org/pdf/2410.06704
- https://arxiv.org/pdf/2311.17035
- https://arxiv.org/pdf/2211.02245
- https://arxiv.org/html/2507.05578v1
- https://arxiv.org/html/2506.10024v1
- https://aclanthology.org/2023.acl-long.74.pdf
- https://microsoft.github.io/presidio/
- https://www.nature.com/articles/s41598-025-04971-9
- https://aclanthology.org/2025.findings-acl.1174.pdf
