Skip to main content

The Data Residency Contract Your Provider Honored at the API Boundary and Broke at the Cache

· 9 min read
Tian Pan
Software Engineer

Your residency audit traced every outbound request from the tenant's traffic, watched it terminate on a hostname in Frankfurt, and signed off. The audit was correct about everything it measured. It was also looking at the wrong layer. The request went to the EU. The bytes that satisfied the request — the cached prefix the provider hashed and pulled from the nearest available node — lived in us-east-1. Your regional endpoint promised you a destination. The cache promised nothing, because the cache was a different product, governed by a different SLA, designed for cost rather than for compliance.

The customer's auditor caught it. Not yours. A different vendor's incident report mentioned that prompt cache placement was decoupled from inference region, and the customer's GRC team asked the obvious follow-up question: where do our prefixes go? The contract amendment to close the gap took ninety days. The renewal got suspended. The team that wrote the integration had done nothing wrong by the documentation they were handed.

The Two Boundaries Are Not the Same Boundary

A regional API endpoint is a routing commitment. You point your client at eu-frankfurt.provider.com and the request terminates on infrastructure inside the region. That is the boundary your network-level audit can see and certify. Outbound packets, destination IPs, TLS handshake — all observable, all auditable, all clean.

A prompt cache is something else. It is a content-addressed store, keyed by a hash of the prefix (the system prompt, the tool schema, the few-shot examples) so identical prefixes from any tenant can reuse the same entry. The whole point of the cache is to avoid recomputing attention states the provider has already computed for someone. That economic logic only works if the cache is large, shared, and placed near compute capacity rather than near the request origin. A per-region cache that mirrors your routing topology would defeat the savings model.

So the provider builds the cache as a global pool. The prefix gets hashed, the hash gets written to whichever node has eviction headroom, and that node is selected by capacity heuristics that do not know your contract exists. Your EU-tenant request terminates in Frankfurt. The cache write goes to us-east. The audit was looking at the front door. The bytes left through a window in the back.

This is not negligence. The two systems were designed by different teams against different requirements. The inference layer ships with a regional commitment because customers ask for one. The cache layer ships with a cost commitment because the per-token economics demand one. Neither team is wrong about its own SLA. The customer is wrong about thinking the two SLAs compose.

Why the Inventory Misses It

The compliance team's residency control catalog has an entry for inference. It probably does not have an entry for cache. The reason is mundane: when the data inventory was built, the cache was either a private optimization the provider had not yet exposed or it was a feature with no per-region commitment to inventory. The inventory captured what was inventoryable. Anything the provider added later, sold as a free optimization, and enabled by default lives outside the catalog.

The pattern repeats across providers. Read the residency commitment carefully and you find clauses like "Extended prompt caching in regions that do not support regional processing may require that we process and temporarily store customer content outside of the region to deliver the services." That sentence is a complete description of the leak. It is also buried in a help-center article, not in the master agreement the GRC team reviews at procurement time. The contract the customer signed talks about inference. The footnote that mentions cache placement is two clicks deeper.

The other failure mode is just as common. Customers route through a hyperscaler-mediated path — Bedrock Frankfurt, Vertex EU — believing the hyperscaler's regional commitment subsumes the model provider's. It usually does for the inference call. It does not always for ancillary features. Bedrock's documentation on prompt caching includes the disclaimer that "caches are regional, so you might send two identical requests to the same inference profile and have the second one be a cache miss if they were routed to different regions" — which is the same fact from the other direction. If the provider does honor a regional cache, you pay for it in cache miss rate. If the provider does not, you pay for it in the residency gap.

What the Residency Diagram Actually Covers

Draw the residency diagram you would show an auditor. A box labeled client, an arrow into a box labeled EU endpoint, an arrow into a box labeled inference cluster (eu-frankfurt). The diagram closes; the bytes stay in the region. The auditor signs.

Now draw the implementation diagram. Same client, same EU endpoint, same inference cluster. But before the cluster, there is a step labeled cache lookup. Before the response goes out, there is a step labeled cache write. Both of those steps talk to a service labeled global prompt cache, and that service has shards in regions the residency diagram does not list. Some of those shards hold the prefix of every request your tenant has ever made. Some of them hold the response, too, if the provider caches output.

The first diagram is the one your contract covers. The second diagram is the one your data flows through. The compliance posture you have is a function of the first. The compliance posture you actually need is a function of the second. The gap between them is the cache layer, and no amount of more careful routing can close it from your side.

There is a research literature on this gap. The arXiv paper that introduced MemPool — an elastic memory pool managing distributed KV caches across serving instances — describes a global scheduler that enhances cache reuse through a global prompt-tree-based locality-aware policy. The locality the scheduler optimizes for is prefix locality, not tenant locality. A separate NDSS paper on prompt leakage via shared KV-cache notes that seven of eight surveyed LLM providers share caches globally across users. Both papers are about systems-level efficiency. Both inadvertently describe a residency hazard.

The Patterns That Close the Gap

There is no architectural trick that fixes this from the consumer side alone. Every closure path requires either contractual change with the provider or a deliberate sacrifice of the optimization the cache exists to deliver. Pick the one whose cost you can stomach.

Amend the contract to name cache placement. The contract that names inference residency does not name cache residency. The fix is to add a clause that does — explicitly per region, with a provider certification that the cache layer's residency is audited separately from the inference layer's. Enterprise contracts at major providers can carry this language; it is not always offered by default, and procurement has to ask. The 90-day amendment cycle is the typical cost. Build the slack into your renewal window.

Opt out of the shared cache for regulated traffic. Every major provider lets you disable prompt caching per request, usually via a header or a parameter. Disabling it for EU-tenant traffic trades the token savings for the residency guarantee. The math depends on cache hit rate; if your prefixes are stable and reused often, the cost is real. If your prefixes change per request anyway, the cost is rounding. Compute it before assuming it is too expensive.

Audit the cache layer like you audit the inference layer. A residency review that only traces outbound request destinations is reviewing the wrong layer. Extend the review to trace a sample prefix from request through cache write to verify the destination. The provider will not always give you the introspection you need. Push for it. The fact that you cannot see the cache layer is itself an audit finding.

Treat every "free optimization" as a contract surface. Prompt cache is one example. Model routing, request batching, sub-processor changes, default content-safety filtering — all of these are features the provider enables by default to improve cost or quality, and any of them can move bytes across boundaries your contract names. The architectural review at procurement should enumerate them and certify each against the residency catalog. The ones not on the catalog are the ones to ask about.

The Architectural Realization

The provider's regional endpoint is a commitment about where your request goes. It is not a commitment about where the bytes that satisfy your request live. Those are different statements about different layers, and the team that reads the first as a guarantee of the second has built a compliance posture that depends on a topology the provider never disclosed.

This generalizes beyond prompt cache. The same shape appears in CDN edge caching for non-AI workloads, in cross-region replication of feature stores, in queue-based message delivery where the queue's regional placement is independent of the producer's and consumer's. In every case, the auditable boundary (request routing) and the unauditable boundary (downstream storage placement) are governed by different SLAs with the same provider, and the customer's compliance posture is only as strong as the weakest of the two.

The defensive posture is to assume the cache topology is undisclosed, ask explicitly about it at contract time, and treat every default-on optimization as a candidate boundary crossing until the provider has put the residency commitment in writing for that specific feature. The offensive posture is to design the procurement review around enumerating cross-boundary surfaces rather than enumerating endpoints. The two postures arrive at the same checklist; only the second one catches the next feature the provider rolls out after the contract is signed.

The team that reads the residency contract as a routing guarantee builds a control surface that depends on a vendor disclosing what they are doing. The team that reads it as a layer-by-layer commitment builds one that depends on a vendor agreeing to what they will not do. The second posture is the one that survives the next product launch.

References:Let's stay in touch and Follow me for more thoughts and updates