The PII Redactor That Protected Your Logs and Let the Model Leak the Outputs
A PII redactor that runs only on inbound traffic is a one-way valve installed at the wrong end of the pipeline. It catches user-submitted names, emails, and account numbers before they reach your logs. It does nothing about the model's own outputs — the place where the same model is now actively assembling text that may contain those same identifiers, drawn from RAG retrievals, tool returns, conversation history, or content the user pasted from another tenant's data. Every team I've watched ship an input-side redactor has a follow-up ticket in the backlog labeled "output-side parity." Most of those tickets never close, because no incident surfaces the gap for six months, and after six months the ticket has accumulated enough re-prioritization to look like a feature request rather than a missing half of a security control.
The failure mode is invariant: input redaction is treated as the canonical control because it is the easier engineering problem and the easier audit story. You wrote a regex set, you ran a labeled benchmark, you proved precision and recall on a fixed corpus, you shipped it behind a feature flag, and the security review accepted it as the PII boundary. The output side has none of that benefit. The model's response is generative, the surface area is unbounded, and the test methodology — "what should it not say in any of infinitely many contexts" — is structurally harder than "what should we strip from a known input." So the team that ships the inlet treats the outlet as future work and the future never arrives until a customer reports another customer's email landing in their transcript.
The redactor's trust assumption never survived the model becoming a writer
The original argument for input-only redaction was an architectural one. The model, the reasoning went, is a stateless function: it cannot leak what it never sees. Strip the PII from the prompt, and the response cannot contain it. This was approximately true in the pre-agent era, when an LLM was a single-call summarizer or classifier and the user's message was the only meaningful input. The redactor sat at the API gateway, scrubbed the inbound payload, and the architecture was symmetrical: clean in, clean out.
Agents broke the symmetry the day they started reading from RAG corpora, calling tools, and persisting conversation state. A model that retrieves a chunk from a vector database is reading content the input redactor never saw. A model that calls a customer-lookup tool is being fed an email address from a system the redactor never touched. A model that maintains a multi-turn transcript is composing its current response over a history whose earliest entries predate the redactor's deployment. The model is no longer a function of the current prompt; it is a function of every byte of context that flows into the request, and most of those bytes never crossed the inlet the redactor was protecting.
The path that matters most is the one the architecture explicitly trusts. A support agent needs the customer's email to look up the account. A coding assistant needs the user's code, which may contain identifiers the redactor would have stripped. A document-Q&A bot needs to quote from documents that contain names. The non-redacted path is not an oversight — it is the system's value proposition. And once you have a value proposition that requires the model to see real PII, the model can write that PII back into any output it produces, against any other conversation it serves, and the redactor at the inlet has no view into that.
Cross-tenant content arrives wearing the current tenant's clothes
The clean failure mode is the one where a user pastes another customer's email into their own conversation. The redactor at the inlet sees what looks like in-tenant content — the user is authenticated, the request is on their own session, the payload is their own message — and either does not redact it (because the path is trusted) or redacts and re-inserts a token the agent will faithfully decode back to the original value in its output. The model summarizes the pasted thread, the summary includes the other customer's email verbatim, the user notices, and the question of whether your platform is GDPR-compliant becomes a board-level conversation rather than an engineering one.
The Dutch Data Protection Authority received a notification of exactly this shape in 2024: a telecom employee pasted a file containing customer addresses into an AI chatbot, and the act of entering the data into the chatbot was treated as an unauthorized disclosure under GDPR. The 72-hour notification clock starts from the moment the controller becomes aware of the breach, and the standard for awareness is not the security team's awareness — it is the organization's awareness, which a support ticket from an affected user clears trivially. The fine ceiling is the larger of €20 million or 4% of global annual turnover, and 2024's aggregate GDPR fines crossed €1.2 billion. The incident does not need to be malicious to count. It does not need to be technically large. It just needs to involve personal data crossing a boundary the data processor was supposed to enforce.
The architecture's failure here is that the redactor was designed against a threat model where the adversary is the user-as-attacker, trying to get their own PII into your logs. The actual threat model in a multi-tenant agent system is the user-as-data-mule, carrying another tenant's PII into the model's context and asking the model to do something normal with it. The redactor was looking the wrong direction.
Output redaction is not a smaller version of input redaction
The seductive fix is to take the inbound redactor, point it at the outbound path, and call it done. This is harder than it sounds, and the differences are why most teams treat output redaction as a separate, deferred project rather than a flag-flip.
Input redaction operates on a known schema. The user message has a structure, the redactor can tokenize it, run NER over it, regex-match emails and phone numbers, and produce a tagged span list with high recall. The output is unstructured natural language whose content distribution is whatever the model decided to produce. The same string [email protected] can appear in an output as [email protected], as john at example dot com, as the email johnatexampledotcom, or as a paraphrase like the customer's contact email starts with john and uses the example domain — each of which evades a regex tuned for the canonical form. Microsoft's Presidio framework is the most widely deployed open-source toolkit for this work, and even Presidio's documentation distinguishes between the inbound output_parse_pii mode (which un-masks tokens after the model call) and the outbound presidio_filter_scope: output mode (which actively scans the model's response). They are different code paths because they are different problems.
Latency is the second difference. Input redaction runs once, on a request whose size is bounded by the user's typing speed. Output redaction has to run on every chunk of a streaming response, with a budget measured against the model's tokens-per-second, or it has to buffer the entire response and stall the first paint until the scan completes. The team that bolts the inlet's redactor onto the outlet in an afternoon usually discovers that the user-perceived latency doubles, the streaming UX breaks, and the deployment gets rolled back the next morning. The output-side redactor is a real piece of infrastructure, not a config change.
The third difference is the semantic gap. Input redaction can be strict because the cost of a false positive is the user retyping a query. Output redaction cannot be strict because the cost of a false positive is the model's response losing the information the user is trying to retrieve. A support agent that helpfully redacts the customer's own email out of the response it just produced about that customer's account is a worse product than the one that leaked. The fix is not "redact more"; the fix is to know whose PII you are redacting against. That requires a content-provenance signal — a tag on every span in the output that says which tenant's data the model drew it from — and that signal does not exist in any off-the-shelf redaction library.
Provenance is the control the redactor was supposed to be a proxy for
The architectural realization that closes the gap is that PII is not a class of content; it is a class of relationship. An email is sensitive when it belongs to a customer you do not own. The same email is fine to surface when it belongs to the user who pasted it. The redactor's job was never to find emails; it was to enforce a boundary between data that belongs to the conversation's owner and data that does not. Input redaction was a useful proxy for that boundary as long as the only PII entering the system was the user's own. Once the input includes RAG retrievals, tool returns, and pasted cross-tenant content, the proxy breaks because the data's origin is no longer the data's identity.
Content provenance is the control the redactor was approximating. Every chunk that enters the model's context — whether from a tool, a retrieval, a transcript, or a user paste — should arrive with a label that says which tenant authored it. The model's response can then be scanned for spans that match a labeled chunk, and the output-redaction policy can decide on a per-span basis whether to allow, mask, or quarantine. A span that matches a chunk authored by the current conversation's owner is allowed. A span that matches a chunk authored by a different tenant is masked or refused. A span that matches no labeled chunk is scrutinized by a generic PII detector as a fallback. This is the same control the redactor was trying to be, except it is correctly scoped to the boundary that matters.
Tool-allowlisting is the same idea applied to outbound calls. An agent's URL-fetch tool scoped to "any HTTPS endpoint" is a generic-purpose data egress channel that the agent will happily use to exfiltrate context to wherever the model is told to send it. Scoping the tool to a per-tenant allowlist of known vendor domains turns the same tool into a per-tenant control surface. The cost is a small ergonomic loss; the benefit is that the model's URL-fetch instruction is no longer a freeform exfiltration primitive.
The conversation-transcript layer is the third place provenance has to land. A transcript is a persistent system of record, and persisting an agent message that contains another tenant's PII into the wrong tenant's transcript is the act that converts a runtime leak into a stored breach. A policy at the transcript-write layer that refuses to persist any message containing PII not labeled as belonging to the conversation's owner catches the leak even when the redactor missed it and the model produced it.
What to do this quarter
Three patterns close the gap without rearchitecting the platform. First, run an adversarial output-redaction test once a week: paste a message into the agent that contains cross-tenant PII (your team's own emails will do for the canary), ask the agent a question that requires summarizing the message, and assert that the response does not echo the address. This is the cheapest detector you can build, and it catches the regression class that broke the Dutch telecom case. Second, add an output-side scan with a tool like Presidio or your vendor's equivalent, even if it only runs in non-streaming mode at first; an outbound scan that adds 200 milliseconds to a response is a worse UX than a streaming response, and a better breach posture. Third, label every retrieved chunk and tool return with a tenant tag at ingest time, store the tag alongside the content, and propagate it into any audit trail your transcript layer writes; this is the foundation provenance-based output filtering will eventually need, and it is cheap to do early.
The harder question is the contract one. A redactor at the inlet is a control you ship once and forget. An output-side privacy posture is a continuous engineering commitment, because the model's response distribution shifts with every prompt change, every retrieval change, and every model upgrade. A team that treats output privacy as a one-shot ticket is a team that will be writing a 72-hour notification the first time the model paraphrases an email around the regex it was supposed to catch. The boundary the redactor was supposed to enforce is the boundary between data that belongs to the current conversation and data that does not, and the model is the actor most capable of crossing it. Pointing your privacy controls only at the side where the user types is leaving the side where the model writes wide open, and the model is the side that writes the most.
- https://genai.owasp.org/llmrisk/llm02-insecure-output-handling/
- https://owasp.org/www-project-top-10-for-large-language-model-applications/assets/PDF/OWASP-Top-10-for-LLMs-v2025.pdf
- https://github.com/microsoft/presidio/
- https://microsoft.github.io/presidio/samples/docker/litellm/
- https://docs.litellm.ai/docs/proxy/guardrails/pii_masking_v2
- https://www.alstonprivacy.com/dutch-data-protection-authority-warns-that-using-ai-chatbots-can-lead-to-personal-data-breaches/
- https://gdprlocal.com/data-breach-notification-requirements/
- https://layerxsecurity.com/generative-ai/multi-tenant-ai-leakage/
- https://airblackbox.ai/blog/ai-agent-pii-leaking
- https://futureagi.com/glossary/pii-leak/
- https://www.giskard.ai/knowledge/cross-session-leak-when-your-ai-assistant-becomes-a-data-breach
