Skip to main content

The Privacy Boundary No One Tests: Why 'Stateless' Tools Are the AI-Era IDOR

· 10 min read
Tian Pan
Software Engineer

A tool labeled "stateless" is a promise the runtime cannot keep. Behind the function signature sits a Redis cache, a vector index, an embedding store, a rate-limit table, a memoization layer, an LRU on the hot path — any one of which is a shared substrate where one user's data can land on another user's response. The function is stateless. The system is not. And in 2026, this is the most common privacy bug I see in agentic systems, because almost no one tests for it.

The shape of the bug is depressingly familiar to anyone who has worked on classic web apps. Insecure Direct Object Reference — IDOR — was the bread and butter of bug bounty for a decade: a request handler that accepts a record ID and returns the record without checking whether the caller is allowed to see it. The AI-era version is the same bug with a worse blast radius: a tool call that accepts a query and returns data without checking whether the caller's tenant owns that data. The query is in natural language. The cache key is a hash. The retrieval is approximate. None of those things absolve you of authorization, but each of them makes the bug harder to spot in code review.

The Function Lies, the Substrate Tells the Truth

When an engineer registers search_documents(query: str) -> list[Doc] as a tool, the LLM sees a clean interface: take a string, return some documents. The tool's docstring promises stateless retrieval. The implementation, three layers down, hits a vector index that contains every tenant's documents in a single collection because that was cheaper to operate when the product had eleven customers. There is no tenant_id parameter, because the LLM does not need one — and that is precisely the problem. The tool was designed for a single-tenant prototype and shipped into a multi-tenant product without anyone reopening the question of where the authorization check lives.

The substrate is the truth. If your Pinecone index is shared across tenants and your filter predicate lives in application code, then the moment someone forgets to add filter={"tenant_id": ctx.tenant_id} to a query, you have built a tenant-leaking system. If your pgvector table has no row-level security policy, then a query that omits the WHERE tenant_id = $1 clause returns everyone's neighbors. If your Redis cache uses the prompt hash as the key, then two tenants who happen to ask the same question — "summarize my latest invoice" — collide on the cache and one gets the other's invoice summary. The function signature said nothing about any of this, because the function signature lives in a different abstraction layer than the bug.

The 2023 ChatGPT incident is the canonical example, and worth re-reading even three years later. A race condition in the redis-py async client caused a request that was canceled mid-flight to leave bytes in a connection pool, and the next request that picked up that connection received the previous user's response. Names, billing addresses, the last four of credit card numbers — all of it leaked across users for nine hours before the cache layer's cross-user contamination was caught. The application code was correct. The tool functions were stateless. The connection pool, two libraries down, was not.

Where the Leaks Actually Live

In practice the leaks cluster in five places, and a privacy audit that does not enumerate all five is incomplete.

Vector indexes shared across tenants. Approximate nearest neighbor is approximate. Even with a tenant filter applied at query time, a misconfigured index can return cross-tenant neighbors before the filter runs, and any logging layer between the index and the filter sees data the user should never have access to. The safe pattern is hard isolation: a separate namespace per tenant, or a separate index. Pinecone namespaces, Weaviate's tenant-per-namespace model, or pgvector with row-level security enforced at the database — not at the service — close this gap. Filter-after-retrieve does not.

Prompt and response caches keyed on content. A "semantic cache" that stores (prompt_embedding -> response) and serves any request within a similarity threshold is a cross-tenant leak factory. Two tenants asking near-identical questions hit the same cache entry; the second tenant gets the first tenant's answer, which contains the first tenant's data because the model's response was personalized to their context. The fix is to make tenant ID part of the cache key — every cache key — and to audit every layer of caching, including the ones your framework added without telling you.

Long-term memory stores. Agent memory is the new session storage, and the bugs that haunted PHP session handling in 2008 are back, dressed up as embedding stores. Recent research on memory-extraction attacks (MEXTRA) shows that agents whose memory module lacks per-user partitioning will happily surface another user's stored facts when prompted with the right cue. If you have a "memory" tool that retrieves "what we talked about before," ask which before it means — and verify that the partition key is the authenticated user, not a session token, not a workspace ID, and certainly not a hash that collides under load.

Tool result caches. The optimization is irresistible: tool calls are slow and expensive, so cache the result. The bug is that tool results are scoped to the caller, and a cache that does not include the caller in its key returns Tenant A's database rows to Tenant B. This shows up most often when a "stateless" lookup tool wraps an internal API that itself is multi-tenant; the tool layer caches by query parameters; the API would have rejected the cross-tenant call, but the cache never asked the API.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates