PII in LLM Pipelines: The Leaks You Don't Know About Until It's Too Late
Every engineer who has built an LLM feature has said some version of this: "We're careful — we don't send PII to the model." Then someone files a GDPR inquiry, or the security team audits the trace logs, and suddenly you're looking at customer emails, account numbers, and diagnosis codes sitting in plaintext inside your observability platform. The Samsung incident — three separate leaks in 20 days after allowing employees to use a public LLM — wasn't caused by reckless behavior. It was caused by engineers doing their jobs and a data boundary that wasn't enforced anywhere in the stack.
The problem is that "don't send PII to the API" is a policy, not a control. And policies fail the moment your system does something more interesting than a single-turn chatbot.
The Five Places PII Actually Enters Your LLM Stack
Most teams audit the obvious paths — the user's input message, the system prompt — and stop there. The real exposure surface is wider.
Prompt templates with dynamic injection. System prompts often pull from user context at construction time: account tier, past interactions, current state. When that fully-assembled prompt gets logged — which every framework does by default — the log entry contains all of it. Your error tracker, your APM dashboard, your LangSmith trace: all of them now hold a copy of whatever you injected. A 2024 study found that 8.5% of production LLM prompts contain potentially sensitive data, and most teams didn't know it was there.
The embedding step in RAG pipelines. When you embed documents for retrieval, you send the raw text to an embedding API — OpenAI, Cohere, Voyage. If your document corpus contains employee records, patient notes, or customer contracts, all of it passes to a third-party provider in plaintext as part of the embedding call. Most teams conduct a PII audit on what goes into the LLM context. Almost none conduct one on what goes to the embedding endpoint, even though it's the same data in the same format.
Retrieved chunks injected into context. When your RAG system does a top-k retrieval, it may pull documents that contain data belonging to other users. A support ticket retrieval for one customer can surface another customer's order details if access controls at the chunk level are absent. The LLM then reasons over mixed data from different access domains, all in the same context window. This is the vector the DEAL attack exploits: an adversary can craft a query that induces the model to surface private documents it retrieved with near-99% accuracy.
Tool call results and agent memory. When an LLM agent calls a database query tool or an external API, the results land in the context window and get logged like everything else. An agent that fetches a user's medical history to answer a benefits question has now deposited that PHI into your trace log, your vector memory store, and potentially your fine-tuning dataset if you're doing any feedback collection. A 2024 ACL paper on privacy risks in LLM agent memory documented how sensitive data persists across sessions in ways that violate reasonable user expectations.
Fine-tuning amplification. If you fine-tune on production data without a PII scrub pass first, the model memorizes it. Research found roughly 19% PII leakage in outputs from models fine-tuned on datasets containing personal information. Unlike a data breach, you cannot patch a model weight.
Why Detection Is Harder Than It Looks
The instinct is to run a regex scan on outbound requests and call it done. Regexes catch structured PII well — SSNs, credit cards, email addresses — but they miss most of what matters in production.
Names embedded in sentences don't match a pattern. "The appointment was scheduled for Sarah Chen at 3pm" passes a regex filter with no flags. Organization names, job titles, location references, and relationship context ("the patient's husband") are all semantically PII but syntactically invisible to pattern matching. In healthcare or legal pipelines, the combination of non-PII tokens can identify an individual even when each token individually is safe.
The current production-grade approach is a tiered pipeline:
- Tier 1 (regex): Fast path for structured entities — emails, phone numbers, SSNs, credit card numbers with Luhn validation, API key patterns (
sk-...,AKIA...). Sub-millisecond, deterministic, and catches 60–70% of PII in well-structured input. - Tier 2 (NER model): Named Entity Recognition for unstructured PII. Microsoft Presidio is the standard open-source framework, combining spaCy NER with the regex tier and checksum validation for financial identifiers. For production accuracy, practitioners in 2025 are replacing Presidio's default spaCy model with GLiNER or a DeBERTa fine-tune, gaining 10–20% better recall on real-world text.
- Tier 3 (LLM disambiguation, optional): For high-stakes pipelines, a lightweight LLM pass can resolve ambiguous cases — "Michael" as a person vs. a project codename, "Apple" as a company vs. a fruit in a recipe. This adds latency and cost, justified only for regulated data.
Roblox published their production numbers in late 2025: using XLM-RoBERTa-Large with GPU inference, they hit 98% recall at 1% false positive rate, 94% F1, at 200,000 queries per second. The benchmark comparison was stark — simpler off-the-shelf models came in at 13–27% accuracy on the same corpus.
The Pseudonymization Architecture That Actually Scales
Blocking all PII from reaching the model is often too restrictive. The LLM frequently needs the data to do its job — a medical coding assistant that can't see the diagnosis can't code anything. The solution is to let the model see the data without seeing the real data.
Vault-based pseudonymization works like this:
- The detection layer scans the inbound prompt and identifies PII entities.
- Each entity is replaced with a typed token —
<PERSON_1>,<DATE_OF_BIRTH_1>,<MRN_1>— and the mapping is stored in a session-scoped vault. - The de-identified prompt goes to the LLM API. The model reasons over tokens, not real data.
- The LLM response contains the same tokens in its output.
- The de-anonymization layer restores the originals before returning the response to the user.
LangChain's PresidioReversibleAnonymizer implements this pattern. For production security requirements, commercial vault services like Skyflow provide deterministic tokenization (the same PII value always maps to the same token, enabling cross-session referential integrity) and fine-grained access controls over who can detokenize.
One non-obvious implementation detail: naive placeholder substitution degrades model quality. "The doctor prescribed medication to [REDACTED]" produces different reasoning than "The doctor prescribed medication to <PERSON_1: female, 34>." Using typed placeholders with lightweight metadata, or Faker-generated realistic stand-ins, preserves semantic coherence in the model's reasoning chain while keeping the real data out of the prompt.
The Observability Trap
Here is a failure mode that catches almost every team eventually: you implement PII controls on the data going to the model, then log everything on the way out for debugging, and your trace logs become a copy of every sensitive document your RAG system ever retrieved.
LangSmith captures all inputs and outputs from every chain run by default. If you haven't explicitly configured masking, your LangSmith workspace has a complete record of every prompt, including the context injected from your RAG retrieval. The same applies to Langfuse, Datadog LLM traces, and OpenTelemetry-based pipelines with LLM auto-instrumentation.
Langfuse has the cleanest architecture here: it ships a native masking API where you pass a masking function to the client constructor, and all trace data is processed through it client-side before transmission. That means PII never leaves your process in unmasked form. Langfuse also supports full self-hosting, which keeps traces within your security boundary entirely.
For OpenTelemetry pipelines, the 2024 GenAI semantic conventions store prompt content in span events rather than attributes specifically so they can be dropped at the OTel Collector level with a custom processor, without changing application code. This is the architecturally correct approach for regulated environments: redact at the collector, log structural metadata (latency, token counts, error codes) separately from content.
The principle: treat your observability backend the same way you treat the LLM API. Your LLM provider has a DPA. Does your tracing SaaS?
What the Regulatory Landscape Actually Requires
The Italian data protection authority fined OpenAI €15 million in late 2024 for GDPR violations including inadequate privacy notices, missing age verification, and failure to report a data breach. The case is under appeal, but the signal is clear: regulators are no longer treating LLM privacy violations as a novel gray area.
The EDPB's April 2025 opinion set the bar explicitly: an LLM is considered "anonymous" only if extraction of personal data from it is insignificant taking into account all means reasonably likely to be used. Most production LLMs fail this standard. That means models trained on data containing personal information are processing personal data under GDPR, and the controller deploying the model needs a lawful basis.
For US healthcare, HIPAA requires a Business Associate Agreement with any vendor that processes Protected Health Information — including LLM API providers. Azure OpenAI, AWS Bedrock, and Google Vertex AI all offer BAA-eligible enterprise tiers. Consumer-tier API access does not qualify. Sending patient notes through a tracing platform without a BAA is an unauthorized disclosure under the Breach Notification Rule, regardless of whether anyone actually accesses those logs.
Practically, this means:
- Document every data flow where personal data could touch an LLM or embedding API, including the observability layer.
- Use enterprise-tier API products with signed DPAs or BAAs. Validate that your observability vendor has one too.
- Do not fine-tune on individual user data unless you have a clear retention and erasure policy — and understand that once data is in the weights, you cannot erase it.
- Implement audit logging of all LLM interactions after PII redaction, not before.
Putting It Together: The Defense Stack
A practical PII defense for an LLM product isn't a single control — it's a stack of interlocking layers:
- At ingestion: Run tiered NER detection on all user inputs and RAG-retrieved chunks before they enter the prompt. Apply redaction or pseudonymization depending on whether you need the data to be recoverable.
- At embedding: Apply the same detection pipeline to document text before sending to embedding APIs. This step is commonly forgotten because it feels like "preprocessing," but the data exposure is identical.
- At retrieval: Implement access controls at the chunk level, scoped to the requesting user's permissions. Don't rely on the LLM to enforce access boundaries — it can't.
- At the observability layer: Configure masking before traces are exported, or self-host your observability stack. Log structure, not content, wherever possible.
- At fine-tuning: Never fine-tune on raw production data. Run a PII scrub pass first and validate the output before training.
IBM's 2025 Cost of a Data Breach Report found that shadow AI incidents — employees routing sensitive data through personal LLM accounts that bypass enterprise controls — added an average of $670,000 per incident. LayerX Security found that 40% of files uploaded to AI chatbots contain PII or PCI data. The gap between "we have a policy" and "the data doesn't actually leave" is where most of these incidents live.
The teams that close that gap aren't doing anything exotic. They're applying the same defense-in-depth principles that work for SQL injection or SSRF: assume the boundary will be crossed, build controls at every layer, and validate that they're actually running in production.
The "just don't send PII" policy sounds right. It's also incomplete by design the moment you add retrieval, tool use, or any kind of session context. The stack above is what it takes to make that policy real.
- https://github.com/microsoft/presidio
- https://python.langchain.com/v0.1/docs/guides/productionization/safety/presidio_data_anonymization/reversible/
- https://www.skyflow.com/post/generative-ai-data-privacy-skyflow-llm-privacy-vault
- https://openreview.net/forum?id=sx8dtyZT41
- https://aclanthology.org/2024.findings-acl.267/
- https://about.roblox.com/newsroom/2025/11/open-sourcing-roblox-pii-classifier-ai-pii-detection-chat
- https://langfuse.com/docs/observability/features/masking
- https://docs.langchain.com/langsmith/mask-inputs-outputs
- https://medium.com/secludy/fine-tuning-llm-on-sensitive-data-lead-to-19-pii-leakage-ee712d8e5821
- https://www.dataprotectionreport.com/2025/01/the-edpb-opinion-on-training-ai-models-using-personal-data-and-recent-garante-fined-openai-e15-million-for-gdpr-violations/
- https://arxiv.org/html/2412.04697v2
- https://www.protecto.ai/blog/ai-data-privacy-breaches-incidents-analysis/
- https://ijcjournal.org/InternationalJournalOfComputer/article/view/2458
