Skip to main content

How PII Redaction Sentinels Quietly Collapse Your Vector Index

· 10 min read
Tian Pan
Software Engineer

A support engineer pulled up your RAG console to debug a complaint. The customer had asked "what does my account look like right now," the answer had come back coherent and confident, and it had been about somebody else's account entirely. The top-3 retrieved chunks all belonged to other customers. The engineer ran the same query against a fresh corpus snapshot to rule out indexing lag. Same result. Then she ran it against a snapshot from six months ago, before the privacy redactor had shipped. The right customer's chunk came back at rank 1.

The redactor was working as designed. Every name was a [NAME], every email an [EMAIL], every account number an [ACCOUNT]. The legal team had a clean audit trail and the security team had a closed compliance ticket. What nobody on either team had modeled was that those sentinels, dropped into the same syntactic slots across millions of documents, were being seen by the embedding model as ordinary tokens — tokens that co-occurred more reliably with each other than any real content did. The redactor had not just removed information. It had added a new, very strong signal that every redacted document shared and nothing else did.

The retrieval index did exactly what retrieval indices do. It found the documents most similar to the query. And once enough records had been pushed through the redactor, the most similar documents were not the ones with the most relevant content — they were the ones with the most redactor artifacts. The top-k results started to look like a uniform sludge of other customers' redacted records, all sitting in a tight neighborhood the embedding model had inadvertently learned to recognize. The system was doing privacy correctly and retrieval wrong, and the two failures were the same failure seen from different angles.

The Seam Between Privacy and Retrieval Nobody Owns

The reason this failure is so durable is that no single team has end-to-end visibility into it. Privacy redaction is owned by security or legal. They evaluate the redactor on precision and recall against a PII detection benchmark: did it catch the names, did it leave the non-names alone. The benchmark says yes, and the team moves on.

Retrieval quality is owned by the AI platform team. They evaluate the embedder on a clean test set: queries with known relevant documents, NDCG at 10, mean reciprocal rank. The benchmark says nothing has regressed, because the test set was built before the redactor existed and the redactor was not in the loop when the evals ran.

The redactor's second-order effect on embedding geometry sits squarely between the two teams. Security has no model of what a transformer does with a [NAME] token. The platform team has no model of how often [NAME] will appear in a document or how many other sentinels it will appear next to. The seam is invisible to both org charts and to both eval suites, and the symptom — retrieval returning records that share nothing semantically except their redaction history — only shows up at serving time, in production, on real customer queries, where neither team is looking.

Some teams catch this when the support volume spikes. Many do not catch it at all, and instead conclude that "the embedding model is just bad on our data" and start shopping for a replacement embedder. The replacement embedder, of course, exhibits the same behavior on the same redacted corpus, because the problem was never the embedder.

Why the Embedding Model Treats Sentinels as Strong Signal

A transformer embedding model is trained to produce vectors that are close together when the underlying texts are semantically similar. "Semantically similar" in practice means "tend to appear in similar contexts." The training objective rewards the model for noticing reliable co-occurrence patterns and embedding them as geometric proximity.

A redactor sentinel is, from the model's perspective, an extremely reliable token. [NAME] appears next to [EMAIL] constantly. [ACCOUNT] appears near [NAME] constantly. The phrases around them are also templated by the redactor — "the customer, [NAME], with email [EMAIL], has an account [ACCOUNT]" repeats with mechanical consistency across millions of records. The model picks up on this and learns a strong "this is a redacted customer record" representation. That representation is geometrically tighter than any of the actual semantic clusters in your corpus, because the actual semantic clusters are noisier and less templated than your redactor's output.

The result is what the literature on high-dimensional spaces calls hubness — a small region of the embedding space that nearest-neighbor search keeps returning. Hubness is a well-studied pathology of nearest-neighbor retrieval in high dimensions: a few points become close to a disproportionate share of all queries, and retrieval quality degrades because those points are returned over and over while semantically relevant points get pushed out of the top-k. The redaction artifact is functionally a hubness-inducing feature you injected into your own corpus.

What makes this worse is that the queries are usually not redacted. A user asking "what does my account look like" types the unredacted question. The query vector is positioned by the embedder according to ordinary semantic content. The corpus vectors are positioned by the embedder according to ordinary semantic content plus a pile of artificial co-occurrences from the redactor. The asymmetry means the queries that should match real records get pulled toward the redaction cluster, because that cluster is dense and central and your real records are not.

The Audit You Have To Invent

No standard observability surface in your stack will tell you that your top-1 results are 80% redaction-artifact and 20% semantic. Vector databases report latency, recall against a known set, and storage utilization. Embedding pipelines report throughput and dimension. Retrieval evals report NDCG against benchmarks that were never designed with redactor sentinels in mind.

You have to build the audit yourself, and the team that has to build it is whichever team is closest to the customer complaint when it arrives. A workable starting point looks like this:

  • Sample a few hundred production queries and pull the top-k for each.
  • For each retrieved chunk, count the density of known sentinel patterns ([NAME], [EMAIL], [ACCOUNT], plus whatever bespoke ones your redactor uses).
  • Compare that density against the corpus-wide average density.
  • If the top-k chunks have systematically higher sentinel density than a random sample of the corpus, you have artifact contamination — the index is sorting on the redactor's vocabulary rather than your content.

A second, sharper test: take a redacted chunk and a non-redacted chunk that you know describe the same real customer fact. Embed both. Measure the distance. Then measure the distance between two redacted chunks describing totally unrelated customer facts. If the unrelated-but-both-redacted pair is closer than the same-content-different-redaction-state pair, you have proven the embedder is treating redaction state as a stronger signal than content. That is your problem in its purest form, and once you can produce that demonstration, both the privacy team and the platform team will finally understand they are looking at the same bug.

Patterns That Close the Gap

Once the problem is named, the fixes split along three different time horizons.

The most direct fix is to make the redactor produce sentinels that do not all collide in token space. Instead of using a single [NAME] for every redaction, hash the original value (under a key the security team holds) into a token from a large vocabulary of plausible-looking but meaningless sentinels — [NAME-7a2c], [NAME-d191], and so on. The embedding model now sees a diverse population of tokens in those slots, the templated co-occurrence pattern dissolves, and the hubness collapses. The privacy property is preserved because the hash is non-reversible without the key. The cost is a slightly larger token vocabulary and a redactor that has to carry a hashing dependency. The benefit is that the geometry of your vector index goes back to measuring meaning.

A more careful long-term fix is embedding-aware redaction. Before deploying a new redactor or a new sentinel scheme, run a synthetic eval that measures the cluster impact of the proposed sentinels — embed a representative slice of the corpus with and without redaction, measure the change in pairwise distance distribution and the change in hubness statistics, and gate the deployment on those numbers the same way you gate it on PII recall. This makes the redactor a first-class consumer of retrieval metrics, which is the only way to break the org-chart seam permanently.

The retrieval-time mitigation is the cheapest to bolt on and the easiest to get wrong. You can down-weight matches that score highly because of sentinel density, either by post-filtering the top-k or by training a small reranker that has explicitly seen redacted text. The risk is that you also down-weight legitimately redacted records that the user actually wants. A user asking about their own account does want to retrieve their own redacted record. The reranker has to learn to distinguish "redaction is part of why this matched" from "redaction is irrelevant to why this matched," which is a real problem that "just penalize sentinels" does not solve. Treat retrieval-time fixes as the short-term tourniquet, not the long-term answer.

What This Generalizes To

The PII redactor is one instance of a broader pattern that shows up whenever a pipeline component upstream of the embedder produces structurally repetitive output. Templated boilerplate from a CMS does this. Auto-generated header and footer text in PDFs does this. A summarization step that always opens with "this document discusses" does this. Anything that injects a consistent, low-entropy pattern into a large fraction of your corpus is a potential hubness source, and the embedder will treat it as the strongest feature in the room.

The lesson is not "be afraid of redaction" or "stop sanitizing your data." The lesson is that an embedding model is a sensitive instrument that responds to whatever is most reliably present in its input, and that "most reliably present" is rarely the same thing as "most informative." Every pipeline stage upstream of the embedder is, whether the team owning it knows it or not, a participant in your retrieval quality. The teams that ship durable RAG systems are the ones that treat retrieval geometry as a shared concern, audit it the way they audit latency and cost, and check the embedding space for artifacts the same way a database team checks an index for bloat.

The customer whose account got swapped for somebody else's deserved better than a system that confidently retrieved the wrong record. The fix was not a smarter embedder. The fix was noticing that the redactor had become a louder signal than the content, and giving both teams a shared way to see it.

References:Let's stay in touch and Follow me for more thoughts and updates