Skip to main content

The Privacy Boundary No One Tests: Why 'Stateless' Tools Are the AI-Era IDOR

· 10 min read
Tian Pan
Software Engineer

A tool labeled "stateless" is a promise the runtime cannot keep. Behind the function signature sits a Redis cache, a vector index, an embedding store, a rate-limit table, a memoization layer, an LRU on the hot path — any one of which is a shared substrate where one user's data can land on another user's response. The function is stateless. The system is not. And in 2026, this is the most common privacy bug I see in agentic systems, because almost no one tests for it.

The shape of the bug is depressingly familiar to anyone who has worked on classic web apps. Insecure Direct Object Reference — IDOR — was the bread and butter of bug bounty for a decade: a request handler that accepts a record ID and returns the record without checking whether the caller is allowed to see it. The AI-era version is the same bug with a worse blast radius: a tool call that accepts a query and returns data without checking whether the caller's tenant owns that data. The query is in natural language. The cache key is a hash. The retrieval is approximate. None of those things absolve you of authorization, but each of them makes the bug harder to spot in code review.

The Function Lies, the Substrate Tells the Truth

When an engineer registers search_documents(query: str) -> list[Doc] as a tool, the LLM sees a clean interface: take a string, return some documents. The tool's docstring promises stateless retrieval. The implementation, three layers down, hits a vector index that contains every tenant's documents in a single collection because that was cheaper to operate when the product had eleven customers. There is no tenant_id parameter, because the LLM does not need one — and that is precisely the problem. The tool was designed for a single-tenant prototype and shipped into a multi-tenant product without anyone reopening the question of where the authorization check lives.

The substrate is the truth. If your Pinecone index is shared across tenants and your filter predicate lives in application code, then the moment someone forgets to add filter={"tenant_id": ctx.tenant_id} to a query, you have built a tenant-leaking system. If your pgvector table has no row-level security policy, then a query that omits the WHERE tenant_id = $1 clause returns everyone's neighbors. If your Redis cache uses the prompt hash as the key, then two tenants who happen to ask the same question — "summarize my latest invoice" — collide on the cache and one gets the other's invoice summary. The function signature said nothing about any of this, because the function signature lives in a different abstraction layer than the bug.

The 2023 ChatGPT incident is the canonical example, and worth re-reading even three years later. A race condition in the redis-py async client caused a request that was canceled mid-flight to leave bytes in a connection pool, and the next request that picked up that connection received the previous user's response. Names, billing addresses, the last four of credit card numbers — all of it leaked across users for nine hours before the cache layer's cross-user contamination was caught. The application code was correct. The tool functions were stateless. The connection pool, two libraries down, was not.

Where the Leaks Actually Live

In practice the leaks cluster in five places, and a privacy audit that does not enumerate all five is incomplete.

Vector indexes shared across tenants. Approximate nearest neighbor is approximate. Even with a tenant filter applied at query time, a misconfigured index can return cross-tenant neighbors before the filter runs, and any logging layer between the index and the filter sees data the user should never have access to. The safe pattern is hard isolation: a separate namespace per tenant, or a separate index. Pinecone namespaces, Weaviate's tenant-per-namespace model, or pgvector with row-level security enforced at the database — not at the service — close this gap. Filter-after-retrieve does not.

Prompt and response caches keyed on content. A "semantic cache" that stores (prompt_embedding -> response) and serves any request within a similarity threshold is a cross-tenant leak factory. Two tenants asking near-identical questions hit the same cache entry; the second tenant gets the first tenant's answer, which contains the first tenant's data because the model's response was personalized to their context. The fix is to make tenant ID part of the cache key — every cache key — and to audit every layer of caching, including the ones your framework added without telling you.

Long-term memory stores. Agent memory is the new session storage, and the bugs that haunted PHP session handling in 2008 are back, dressed up as embedding stores. Recent research on memory-extraction attacks (MEXTRA) shows that agents whose memory module lacks per-user partitioning will happily surface another user's stored facts when prompted with the right cue. If you have a "memory" tool that retrieves "what we talked about before," ask which before it means — and verify that the partition key is the authenticated user, not a session token, not a workspace ID, and certainly not a hash that collides under load.

Tool result caches. The optimization is irresistible: tool calls are slow and expensive, so cache the result. The bug is that tool results are scoped to the caller, and a cache that does not include the caller in its key returns Tenant A's database rows to Tenant B. This shows up most often when a "stateless" lookup tool wraps an internal API that itself is multi-tenant; the tool layer caches by query parameters; the API would have rejected the cross-tenant call, but the cache never asked the API.

Sandbox and execution environment reuse. When a tool spawns a subprocess, a notebook kernel, or a code-execution sandbox, that environment may be pooled. Files written by Tenant A's code in /tmp, environment variables set during their session, network connections still in CLOSE_WAIT — all of it survives into Tenant B's execution if the runtime recycles the pod. This is the modern version of the warm-pool leak, and it is invisible to anyone who only audits the tool's code.

What a Cross-Session Privacy Audit Actually Looks Like

The audit is not a checklist of "did you set tenant_id somewhere." It is an adversarial test that two known-distinct identities cannot see each other's data through any path, including paths that should not exist. The structure I have found useful is a four-step protocol applied to every tool the agent can call.

First, enumerate the substrate. For each tool, write down every layer it touches that could hold state across calls: caches (request-level, function-level, framework-level, library-level), persistent stores (vector, relational, document, key-value), memory modules, sandbox pools, log aggregators, observability backends. Most teams underestimate this list by a factor of two. Datadog's LLM observability collects prompts and tool arguments; if your tracing backend is single-tenant and your application is multi-tenant, your traces leak.

Second, inject canaries. As Tenant A, store a value that no other tenant could plausibly know — a long random token, embedded in a document, in a memory entry, in a tool argument. As Tenant B, run the agent through every plausible workflow and grep the responses, the cache contents, the trace logs, and the vector neighbor results for the canary. If it ever appears, you have a leak; the canary tells you which substrate.

Third, stress the cache layer. Cross-tenant leaks via caching are timing-dependent: they manifest under concurrency, under specific key collisions, under cache-warmup races. A privacy audit that runs serial requests will miss them. Run two tenant identities in parallel, with overlapping query patterns, against a shared cache layer, and verify that responses correlate only with the requesting tenant's content. The 2023 ChatGPT bug would not have been caught by serial testing; it required the canceled-request edge case under load.

Fourth, audit the negation. The hardest leaks are not "Tenant B got Tenant A's data" — those are loud — but "Tenant B got slightly-altered output that depended on Tenant A's data," which is silent. Embedding-cache leaks often manifest this way: the response is generated fresh, but the retrieval step pulled cross-tenant context, and the model used it without ever quoting it. To catch this, instrument retrieval to emit the IDs of every document retrieved, and assert in CI that those IDs all belong to the requesting tenant.

Instrumentation That Makes This Detectable

The reason cross-session leaks survive in production is that nothing in the standard observability stack flags them. A 200 response with a cross-tenant document inside is not an error. The fix is to make tenant boundaries first-class in your tracing.

Tag every span with the authenticated tenant ID at request entry, and propagate it through every tool call, every cache lookup, every retrieval. At every storage boundary — the vector index query, the SQL execution, the cache get — emit an assertion span that records the tenant filter actually applied. In CI, run a load test with two tenants and assert that no span emits a tenant_id mismatch between the request context and the storage filter. This is the AI-era equivalent of running ZAP against your auth boundaries; it will catch the bugs before a customer's SOC 2 auditor does.

A tighter pattern: enforce tenant filters at the storage layer, not the application layer. Postgres row-level security, scoped service accounts on Pinecone, Weaviate's per-tenant namespaces, BigQuery authorized views — anything that makes "forgot to add the filter" a database-level error rather than a silent application-level bug. The lesson from a decade of IDOR bugs is that authorization checks in handler code are too easy to forget; the authorization check has to live where the data lives.

The Discipline This Requires

Cross-session privacy is not a feature you ship; it is a property your system either has or doesn't, and you find out by attacking it. Most engineering organizations have not yet internalized that LLM tool calls are an authorization surface — that every def my_tool(query: str) is a request handler, every retrieval call is a database query, and every cache layer is a side channel. The teams that get this right run cross-tenant canary tests in CI, enforce tenant filters at the storage layer, and treat any cache that omits tenant ID from its key as a P0 bug. The teams that don't will find out the hard way, in a Hacker News thread, that their stateless tool was leaking data the whole time.

The cost of getting it right is real but bounded: per-tenant namespaces, partitioned memory stores, tenant-aware cache keys, RLS policies, and a privacy audit run before every release. The cost of getting it wrong is unbounded — regulatory exposure, customer trust, and the kind of incident that ends careers. If your AI product handles data for more than one customer and you cannot point to the exact line of code where the tenant filter is enforced for every tool call, that is the vulnerability. Find it before someone else does.

References:Let's stay in touch and Follow me for more thoughts and updates