Skip to main content

Multi-Region LLM Serving: The Cache Locality Problem Nobody Warns You About

· 10 min read
Tian Pan
Software Engineer

When you run a stateless HTTP API across multiple regions, the routing problem is essentially solved. Put a global load balancer in front, distribute requests by geography, and the worst thing that happens is a slightly stale cache entry. Any replica can serve any request with identical results.

LLM inference breaks every one of these assumptions. The moment you add prompt caching — which you will, because the cost difference between a cache hit and a cache miss is roughly 10x — your service becomes stateful in ways that most infrastructure teams don't anticipate until they're staring at degraded latency numbers in their second region.

The root cause is KV cache locality. When a language model processes a prompt, it computes key-value tensors for every token in that prompt and stores them in GPU memory. The next request that shares a prefix with a cached prompt can skip recomputing those tensors — but only if that next request lands on the same GPU node that holds the cache. Cross-region routing doesn't just miss the cache. It makes the cache as if it never existed.

Why Multi-Region LLM Serving Is Not Like Multi-Region APIs

Consider what happens when you add a second region to a well-tuned single-region LLM deployment. In region A, you've built up warm caches from repeated system prompts, shared conversation prefixes, and RAG context chunks that get prepended to most requests. Your cache hit rate sits at 60–70%. Users in region A see consistently low latency.

Region B starts cold. Every request is a full-context recomputation. Latency spikes. You add more GPUs. Costs go up. You tell yourself it'll warm up in a few days, and it does — but now you have a new problem. Your global load balancer, doing the sensible thing, occasionally routes a region-A user's request to region B because region A is under load. That request hits a cold cache and you see the latency spike in your P95 metrics. You add sticky sessions based on user ID. Latency normalizes. A week later, you realize sticky sessions by user ID means your load distribution is uneven, because some users generate 10x the traffic of others.

This is the pattern. Each fix reveals the next problem, and none of them are bugs in your code — they're architectural mismatches between stateless infrastructure assumptions and stateful inference requirements.

The KV Cache Is Your Real Unit of State

In traditional distributed systems, state lives in databases. You have clear separation between stateless compute and stateful storage. Scaling is predictable.

In LLM serving, the KV cache lives on the GPU that ran the prefill phase. It's co-located with the computation that produced it, it evicts under memory pressure with no durable record, and it has no API for external inspection. Two identical requests sent to two different nodes result in two cache misses, even though the "correct" behavior from a system design perspective would be to share that computation.

This creates a fundamental tension with the scaling strategies that work everywhere else. Round-robin load balancing, which is correct for stateless services, is actively harmful for cached LLM inference. Every request that doesn't land on a node with the matching prefix pays the full prefill cost. At scale, where prefill for a long system prompt can take hundreds of milliseconds, that's not a minor inefficiency — it's the difference between a p50 of 80ms and a p50 of 800ms.

The practical consequence: your load balancer needs to know about KV cache state, not just node health and queue depth. Generic HTTP load balancers don't have this information. This is why purpose-built routers like vLLM Router — written in Rust to minimize overhead, and designed specifically to consume KV cache events from inference engines — exist. They route based on prefix hash matching, not just round-robin or least-connections.

Consistent hashing by prefix is the right default for single-region deployments. You hash the first N tokens of the prompt, map that hash to a node, and route accordingly. Add bounded load constraints so no single node gets overwhelmed, and you have a reasonable steady-state. The research-backed implementation is Consistent Hashing with Bounded Loads (CHWBL), and it's specifically designed for this problem.

Data Residency Requirements Break Your Cache Optimization

If you serve EU users, you will eventually get a data residency requirement. The simplest version is: user data may not be processed outside the EU. The stricter version — increasingly common as organizations become more careful about cloud provider jurisdiction — is that data may not even transit through US-headquartered provider infrastructure, because of potential CLOUD Act exposure.

Here is where the architectural tension becomes a genuine conflict. Prompt caching is most effective when cache-warm requests stay on the same node or same regional cluster. But data residency requires that EU user data never leaves the EU region. Those two constraints are individually satisfiable. Together, they mean you cannot route EU traffic based on cache availability — you can only route it within your EU deployment.

The result: your EU region has a smaller cache pool than your global region, full stop. If your EU user base is smaller, the cache warms more slowly. If EU users have diverse query patterns with long, unique system prompts, hit rates may stay permanently low. You will pay more per token in EU than in your primary region, not because of infrastructure pricing differences, but because your cache architecture is constrained by compliance.

The only way to escape this is to stop relying on cloud-hosted LLM inference for EU data. Smaller models that can run on a single on-premises server avoid the multi-region problem entirely — there's no cross-region routing if there's only one region. This is a real tradeoff that teams making "EU expansion" decisions often discover only after they've already built their caching strategy around cloud APIs.

Model Weight Synchronization and Version Skew

Prompt caching and routing are the real-time problems in multi-region LLM serving. Model weight synchronization is the slow-burn problem that bites you during rollouts.

When you fine-tune a model or update to a new base version, you need to distribute those weights to every region before switching traffic. For a 70B parameter model in bf16, that's roughly 140GB of data per region. S3 Cross-Region Replication with Replication Time Control guarantees 99.99% of objects replicate within 15 minutes, but "within 15 minutes" means you have a window where region A is running the new model and region B is still running the old one.

For most applications, this is an acceptable window. But there are cases where it matters:

  • Prompt format changes: If the new model expects a different system prompt format, you need to coordinate the prompt update with the model update, or you'll have a region serving mis-paired prompts and weights during the replication window.
  • Evaluation datasets: Automated evals that run against your production endpoint may see inconsistent results during rollouts because they're hitting different model versions in different regions.
  • Cached prefixes becoming invalid: When you update the model, existing KV cache entries computed by the old model are invalid. Different models produce different KV tensors for identical prompts. Your routing logic needs to drain or invalidate caches on a per-region basis as the new weights propagate.

The operational practice that avoids most of these problems: treat model rollouts as blue-green deployments per region, not as in-place weight swaps. Run new and old versions in parallel until the new weights are fully replicated and warmed, then shift traffic. This doubles your GPU requirements during rollouts, which is expensive, but it eliminates the version-skew window entirely.

The Routing Architecture That Actually Scales

Given all of the above, here is a routing architecture that handles multi-region LLM serving without requiring a dedicated platform team to maintain it:

Layer 1: Global DNS-based routing directs requests to the correct regional cluster based on geography and compliance requirements. EU requests go to EU clusters only. This layer is ignorant of LLM-specific state — it's just geography.

Layer 2: Regional gateway handles authentication, rate limiting, and cost tracking. This is where managed solutions like Cloudflare AI Gateway or LiteLLM can sit. At this layer, you can also implement model routing — sending simple queries to smaller models and complex queries to larger ones — without needing to involve the inference nodes.

Layer 3: Cache-aware load balancer distributes requests across inference nodes within the region using prefix-hash consistent routing. This is the layer where vLLM Router or an equivalent purpose-built component lives. It consumes KV cache events from inference engines, maintains a view of which prefixes are cached on which nodes, and routes to maximize hit rates while staying within load bounds.

Layer 4: Inference nodes with local GPU KV cache. These are effectively stateful from the perspective of caching, but the routing layer above abstracts that statefulness from the rest of the system.

The forward-looking addition to this architecture is KV cache disaggregation: moving the KV cache off the GPU into distributed object storage (Redis Cluster, or specialized systems like LMCache) so that any inference node in the region can access any cached prefix. This breaks the hard affinity between routing and specific GPU nodes, enabling more flexible scaling. It's not yet the default deployment pattern, but it's where the production serving ecosystem is heading.

What You Should Not Over-Engineer

Multi-region LLM serving has enough genuine complexity that it's easy to over-architect. A few things that sound necessary but often aren't:

Cross-region KV cache sharing: The network latency for moving KV tensors between regions is almost always larger than the latency savings from a cache hit. Keep caches regional.

Real-time cache state synchronization across regions: Related to the above. Your global load balancer doesn't need to know which prefixes are cached in which region. It needs to know which region a user belongs to for compliance, and geography for latency. That's it.

Active-active with global session affinity: You can run active-active across regions without requiring that any individual user's requests always go to the same region. Within-region stickiness (via consistent hashing) is enough to get most of the caching benefit. Global affinity adds complexity that rarely pays off unless you have specific multi-turn conversation workloads that span long time periods.

The architecture that works for most teams at most scales: regional isolation with within-region cache-aware routing, model rollouts via blue-green per region, and compliance boundaries enforced at the DNS/gateway layer. Add disaggregated KV storage when your cache hit rates plateau and you're still GPU-constrained. That's the decision tree.

The Operational Reality

Multi-region LLM serving is not dramatically harder than multi-region API serving, but it requires understanding which assumptions you can borrow from traditional distributed systems and which you cannot. The load balancer needs to be smarter. The rollout process needs to account for cache invalidation. Data residency requirements constrain your caching strategy in ways that matter for unit economics.

The teams that get into trouble are those who deploy LLM inference the same way they deploy stateless microservices and are then surprised when adding a second region doesn't improve latency the way they expected. The teams that do it well treat KV cache locality as a first-class architectural concern from the beginning, and build their routing layer around that constraint before the second region launch, not after.

References:Let's stay in touch and Follow me for more thoughts and updates