Skip to main content

Permission-Aware Retrieval: Why Access Control in Enterprise RAG Must Live in the Vector Layer

· 9 min read
Tian Pan
Software Engineer

Here is a failure mode that shows up in nearly every enterprise RAG deployment: an employee asks the internal AI assistant a question about compensation policy. The system returns correct, specific information — pulled from an HR document the employee was never supposed to see. No one gets fired for it immediately because no one is watching the retrieval layer. But the confidential document was indexed, the user's query hit it semantically, and the model faithfully reported what it found.

The mistake isn't unusual. It's the default outcome when teams apply public-web RAG patterns to private organizational knowledge without adapting the architecture. Web RAG has no access control layer because public web content has none. Enterprise data does — and that constraint changes the entire system design.

The Core Assumption That Breaks Everything

Public-web RAG is designed around a simple model: index everything, retrieve by semantic similarity, generate from what you find. This works because the retrieval corpus is homogeneous (text from the open web) and access-unrestricted. Anyone can read what anyone else retrieves.

Enterprise knowledge breaks both properties. A typical organization has Confluence documentation, Slack message history, Google Drive files, CRM records, org charts, meeting notes, and engineering wikis — each with its own permission model. A support engineer can read customer tickets but not salary bands. A junior analyst can read public dashboards but not board materials. The data lives across dozens of systems with overlapping, inconsistent access policies.

When you index all of this into a vector store and query by similarity, you merge these permission models into a single retrieval surface. Without additional controls, the retrieval layer becomes a privilege escalation vector: a user with limited access can trigger retrieval of documents they couldn't directly open, and the model will synthesize those documents into a response as if they were fair game.

The straightforward fix — "filter results in application code before showing them to the user" — is where most teams stop. It's insufficient. The model has already processed the document. If the application layer catches the retrieval only after generation, the damage is done. And application-layer filters are easy to bypass through prompt injection, API misuse, or simple bugs.

The real fix is enforcing access control earlier in the pipeline, at the vector layer itself.

Why BM25 Beats Embeddings for Internal Entity Queries

Before going further into security, there's a retrieval quality problem that bites before security ever does: enterprise knowledge contains a different kind of query than web RAG is optimized for.

Web queries tend to be conceptual: "what is the refund policy?" or "how do I configure SSL certificates?" These benefit from semantic embedding because the user's vocabulary may differ from the document's vocabulary. Embedding models compress both into a shared semantic space.

Internal queries are heavily entity-specific: "find the Q3 sales pipeline review for the EMEA team," "what's the on-call rotation for the payments service," "show me Sarah Kim's project assignment from last quarter." These are lookups against proper nouns, identifiers, and exact phrases that embedding models trained on public internet text handle poorly. Internal jargon, product codes, team names, and org-specific acronyms are statistically rare in public training data and underrepresented in embedding space.

For these queries, BM25 (Best Matching 25) — a lexical ranking algorithm based on term frequency and inverse document frequency — consistently outperforms dense embeddings. The reason is simple: BM25 rewards exact term match, and an exact match on a proper noun is almost always the right answer when someone asks for a specific document by name.

Production enterprise search systems use hybrid retrieval: BM25 and dense embeddings run in parallel, and their ranked result lists are fused using Reciprocal Rank Fusion or weighted scoring. This handles both entity lookups (where BM25 wins) and conceptual queries (where embeddings win). Teams that rely exclusively on semantic search for internal knowledge bases sacrifice a significant portion of the retrieval quality they could get for free by adding a BM25 column.

Access Control Must Live in the Retrieval Layer

There are two places in a RAG pipeline where access control can be enforced. Most teams choose the application layer because it's where their existing authorization logic lives. The security boundary should be the vector layer.

The application-layer approach works roughly like this: query the vector store without filters, retrieve the top-k documents, then discard results the user isn't allowed to see. This has two failure modes. First, it's inefficient — you're fetching documents only to discard them, which wastes both compute and context window space. Second, it's a thin defense: if the application logic has a bug, makes a wrong assumption, or is circumvented by a prompt injection attack, the documents flow through anyway.

The vector-layer approach pushes access control down into the retrieval query itself. At index time, each document chunk is tagged with metadata representing its access policy — which users or roles can retrieve it. At query time, the retrieval query includes a filter that restricts results to chunks the requesting user has permission to see. The vector store never returns an unauthorized document in the first place; the model never processes it.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates