The Redaction Layer Your Agent Cannot Reason Through

May 31, 2026 · 9 min read

Software Engineer

A privacy review approves your redaction layer. Names, emails, account numbers, phone numbers — all scrubbed before the prompt reaches the model. Your single-turn classifier still hits 94% accuracy. Six weeks later your multi-step agent starts giving confidently wrong answers to questions like "is the email Sarah used to log in the same as the one on her billing record?" and nobody can reproduce it in dev.

The redaction layer did exactly what infosec asked it to do. It also quietly destroyed the property your agent's reasoning depended on: that two mentions of the same entity in different turns refer to the same thing. The agent isn't hallucinating. It's reading a transcript where Sarah has become three different people and the "same" email address has become two distinct placeholders.

This is the failure mode privacy reviews don't catch because they audit what leaves the boundary, not what the boundary preserved. The placeholder is opaque to the auditor — which is the point — and also opaque to the team measuring agent quality, who see a regression they cannot trace because the transformation that caused it happened upstream of every log they kept.

Redaction preserves classification utility and destroys reasoning utility

The original intuition behind PII placeholders comes from single-turn tasks. "Classify this support ticket as billing/technical/other" works fine after Sarah Chen becomes [NAME_1], because the classifier never needed to know it was Sarah. The replacement is information-lossy in a way that doesn't matter for the question being asked.

Multi-step agents are doing a different kind of work. They're tracking entities across turns, comparing them, joining them across tool calls, and deciding whether two mentions corefer. The placeholder format that classifiers tolerate breaks all three of those operations the moment the mapping isn't stable.

Three common transformations, three different failure shapes:

Per-utterance random tokens ([NAME_4f9a] in turn 1, [NAME_7b2c] for the same person in turn 3) — coreference is destroyed. The agent reads two strangers where the user typed one name twice.
Per-type fixed tokens (every name becomes [NAME]) — coreference is over-collapsed. The agent reads "the customer and the agent" as the same person because both are [NAME].
Per-session stable tokens ([PERSON_47] always means the same person within a conversation) — coreference is preserved, but only if the upstream tokenizer correctly resolved which spans were coreferent in the first place. If "Sarah" and "Ms. Chen" got different tokens because the rule-based extractor didn't link them, the agent inherits the linking error.

The privacy review measures whether Sarah Chen ever left the boundary. The reasoning bug lives in which of these three regimes you picked, and whether your tokenizer's coreference judgment was right.

The infosec frame and the reasoning frame are not the same review

When a security team approves a redaction layer they're asking: does PII ever reach the model provider, does it ever land in logs we don't control, does the audit trail demonstrate compliance? Those are real questions and the layer answers them well.

The reasoning team is asking a question the security review doesn't have a column for: after redaction, can the agent still answer entity-relationship questions correctly? "Did this user contact us before?" "Is this the same payment method on file?" "Did the same account open the prior ticket?" Each of these is an identity-continuity question, and the answer depends on whether redaction preserved the equality relation between two spans.

The two frames are not contradictory but they are not interchangeable. An infosec review can approve a layer that passes its tests while degrading the agent in ways no privacy metric measures. The teams typically don't catch it because the eval suite ran on real data (before redaction) and prod ran on redacted data, so the eval was measuring a different system than the one in production.

The patterns that keep both teams honest

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Redaction Layer Your Agent Cannot Reason Through

Redaction preserves classification utility and destroys reasoning utility

The infosec frame and the reasoning frame are not the same review

The patterns that keep both teams honest

Recommended Reading

About Tian Pan

Redaction preserves classification utility and destroys reasoning utility​

The infosec frame and the reasoning frame are not the same review​

The patterns that keep both teams honest​

Recommended Reading

About Tian Pan

Redaction preserves classification utility and destroys reasoning utility

The infosec frame and the reasoning frame are not the same review

The patterns that keep both teams honest