Skip to main content

Multi-Region LLM Serving: The Cache Locality Problem Nobody Warns You About

· 10 min read
Tian Pan
Software Engineer

When you run a stateless HTTP API across multiple regions, the routing problem is essentially solved. Put a global load balancer in front, distribute requests by geography, and the worst thing that happens is a slightly stale cache entry. Any replica can serve any request with identical results.

LLM inference breaks every one of these assumptions. The moment you add prompt caching — which you will, because the cost difference between a cache hit and a cache miss is roughly 10x — your service becomes stateful in ways that most infrastructure teams don't anticipate until they're staring at degraded latency numbers in their second region.

The root cause is KV cache locality. When a language model processes a prompt, it computes key-value tensors for every token in that prompt and stores them in GPU memory. The next request that shares a prefix with a cached prompt can skip recomputing those tensors — but only if that next request lands on the same GPU node that holds the cache. Cross-region routing doesn't just miss the cache. It makes the cache as if it never existed.

Why Multi-Region LLM Serving Is Not Like Multi-Region APIs

Consider what happens when you add a second region to a well-tuned single-region LLM deployment. In region A, you've built up warm caches from repeated system prompts, shared conversation prefixes, and RAG context chunks that get prepended to most requests. Your cache hit rate sits at 60–70%. Users in region A see consistently low latency.

Region B starts cold. Every request is a full-context recomputation. Latency spikes. You add more GPUs. Costs go up. You tell yourself it'll warm up in a few days, and it does — but now you have a new problem. Your global load balancer, doing the sensible thing, occasionally routes a region-A user's request to region B because region A is under load. That request hits a cold cache and you see the latency spike in your P95 metrics. You add sticky sessions based on user ID. Latency normalizes. A week later, you realize sticky sessions by user ID means your load distribution is uneven, because some users generate 10x the traffic of others.

This is the pattern. Each fix reveals the next problem, and none of them are bugs in your code — they're architectural mismatches between stateless infrastructure assumptions and stateful inference requirements.

The KV Cache Is Your Real Unit of State

In traditional distributed systems, state lives in databases. You have clear separation between stateless compute and stateful storage. Scaling is predictable.

In LLM serving, the KV cache lives on the GPU that ran the prefill phase. It's co-located with the computation that produced it, it evicts under memory pressure with no durable record, and it has no API for external inspection. Two identical requests sent to two different nodes result in two cache misses, even though the "correct" behavior from a system design perspective would be to share that computation.

This creates a fundamental tension with the scaling strategies that work everywhere else. Round-robin load balancing, which is correct for stateless services, is actively harmful for cached LLM inference. Every request that doesn't land on a node with the matching prefix pays the full prefill cost. At scale, where prefill for a long system prompt can take hundreds of milliseconds, that's not a minor inefficiency — it's the difference between a p50 of 80ms and a p50 of 800ms.

The practical consequence: your load balancer needs to know about KV cache state, not just node health and queue depth. Generic HTTP load balancers don't have this information. This is why purpose-built routers like vLLM Router — written in Rust to minimize overhead, and designed specifically to consume KV cache events from inference engines — exist. They route based on prefix hash matching, not just round-robin or least-connections.

Consistent hashing by prefix is the right default for single-region deployments. You hash the first N tokens of the prompt, map that hash to a node, and route accordingly. Add bounded load constraints so no single node gets overwhelmed, and you have a reasonable steady-state. The research-backed implementation is Consistent Hashing with Bounded Loads (CHWBL), and it's specifically designed for this problem.

Data Residency Requirements Break Your Cache Optimization

If you serve EU users, you will eventually get a data residency requirement. The simplest version is: user data may not be processed outside the EU. The stricter version — increasingly common as organizations become more careful about cloud provider jurisdiction — is that data may not even transit through US-headquartered provider infrastructure, because of potential CLOUD Act exposure.

Here is where the architectural tension becomes a genuine conflict. Prompt caching is most effective when cache-warm requests stay on the same node or same regional cluster. But data residency requires that EU user data never leaves the EU region. Those two constraints are individually satisfiable. Together, they mean you cannot route EU traffic based on cache availability — you can only route it within your EU deployment.

The result: your EU region has a smaller cache pool than your global region, full stop. If your EU user base is smaller, the cache warms more slowly. If EU users have diverse query patterns with long, unique system prompts, hit rates may stay permanently low. You will pay more per token in EU than in your primary region, not because of infrastructure pricing differences, but because your cache architecture is constrained by compliance.

The only way to escape this is to stop relying on cloud-hosted LLM inference for EU data. Smaller models that can run on a single on-premises server avoid the multi-region problem entirely — there's no cross-region routing if there's only one region. This is a real tradeoff that teams making "EU expansion" decisions often discover only after they've already built their caching strategy around cloud APIs.

Model Weight Synchronization and Version Skew

Prompt caching and routing are the real-time problems in multi-region LLM serving. Model weight synchronization is the slow-burn problem that bites you during rollouts.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates