The Session Affinity Your Provider Load Balancer Quietly Ignored

June 3, 2026 · 11 min read

Software Engineer

Your dashboard says cache hit rate is 71%. Your finance partner is pleased. Your latency p50 is fine. Then a customer support thread arrives from a long-running agent session: turn 14 took eleven seconds to produce the first token, turn 15 took eight, turn 16 took nine. You pull the trace. Every one of those turns reports a cache_read_input_tokens value of zero. The system prompt is sixteen thousand tokens. The user thinks the agent is broken. You think your provider is broken. Neither of you is right. The aggregate hit rate is a survivorship statistic — it averages over the short conversations that hit cache trivially and quietly absorbs the long conversations that have collapsed to cold-on-first-token mid-session.

This is the failure mode that no provider postmortem will ever describe to you, because from their telemetry the system is working as designed. The load balancer is making the routing decision it was told to make. The cache is being populated and evicted on the schedule it was told to follow. The hint you passed — the prompt_cache_key, the conversation ID, the user ID, whatever string you serialized into that field — was advisory the whole time, and "advisory" means "ignored when convenient." Under load, when a scaling event happens, when an upstream pod is draining, when the affinity-aware tier is saturated, your hint quietly degrades to a uniform routing decision. The request lands on a cold pod. The KV tensors that would have served the prefix at sub-millisecond cost are sixteen feet away in a sibling rack and unreachable. Your conversation pays full-prefix cost again, and your dashboard's headline number doesn't move because two thousand other one-turn conversations hit cache fine.

The hint is not a contract

Read the prompt caching documentation from any major provider carefully. The phrase you will not find is "session affinity guarantee." What you will find is language like "routing hint," "increases the likelihood," "improves the chance," and "may help." OpenAI's prompt_cache_key is documented as a value that gets combined with the prefix hash to influence routing; published numbers describe coding customers seeing hit rates climb from 60% to 87% after adopting it, but the documentation also notes that requests for the same key beyond roughly fifteen per minute can overflow to additional machines, reducing effectiveness. Anthropic's caching guarantees the cache exists at the model tier with a TTL, but says nothing about which pod will serve the next request from the same conversation. OpenRouter's documentation explicitly describes a "sticky routing" layer for caching providers, which is an admission that without explicit stickiness above the provider, you don't get it.

That distinction is the whole story. Affinity is best-effort. The load balancer treats your hint as a tie-breaker that improves cache locality when load is low and capacity is even, and treats it as a suggestion to be discarded when load is high and capacity is uneven — which is exactly when your latency budget matters most. Your application is built on the implicit assumption that the cache is a property of the conversation. The provider's architecture treats the cache as a property of the pod. As soon as a request hops pods, the conversation pays cold-prefix cost again, and the only signal you receive is the per-request cache_read_input_tokens value that nobody on your team is watching at the conversation granularity.

The dashboard hides what matters

There are two ways to compute cache hit rate, and they give different numbers. The first is fleet-level: across all requests in a time window, sum cache_read_input_tokens divided by sum of cacheable input tokens. This is the number you see on the cost dashboard, on the provider's status page, on the slide in your platform team's quarterly review. It looks great because most conversations are short, and short conversations either hit cache on the system prompt or don't care.

The second is per-conversation: within a single conversation_id, what fraction of the turns after turn one had a non-zero cache read? This is the number that predicts customer pain, and almost nobody computes it. When the affinity hint is honored, this number stays close to 100% for the lifetime of the conversation. When affinity collapses — usually starting around turn 6 to turn 10, when a backend scaling event or a rolling deploy crosses paths with your session — this number drops sharply for that one conversation, and the user perceives the agent slowing down or "forgetting" what it just did. The fleet-level number doesn't move. The conversation-level distribution develops a heavy tail.

A useful exercise: take a week of traces, group by conversation_id, and plot a histogram of "fraction of post-turn-one calls with cache hits" per conversation. A healthy system has a sharp peak near 1.0. A degraded system has the peak in the same place but with a long tail trailing down toward 0.0, and that tail is where your support tickets come from. The mean of that distribution is the headline metric. The mean does not see the tail. Customers do.

How affinity gets dropped under load

The mechanics matter because they constrain what you can do about it. The provider's edge load balancer maps your prompt_cache_key (or its internal equivalent — the prefix hash, the session token, whatever they use) to a pod via something resembling consistent hashing. When the pod is healthy and under-utilized, the request lands there and the KV cache hits. When the pod is at concurrency limit, the LB has three choices, and all of them break affinity:

The first is queueing. The request waits at the affinity-targeted pod until a slot opens. This is invisible to your application but shows up as increased time-to-first-token without a cache-read drop. Some providers do this; it's the least bad option from your perspective and the one most likely to break the provider's own latency SLO, so it's rare.

The second is overflow routing. The LB picks a different pod and forwards the request there. The new pod has a cold cache for your prefix, so it recomputes the KV tensors. Your time-to-first-token spikes and cache_read_input_tokens drops to zero for that turn. This is the modal failure mode under capacity pressure.

The third is full rebalancing. During a scaling event — pods coming up, pods draining, region failovers — the affinity map itself is in flux. Even a request to a "healthy" pod may be served by a different worker than the one that built the cache for your conversation a minute ago, because the worker behind that endpoint changed. This is the failure mode that produces multi-turn cold periods rather than single-turn glitches.

None of these are visible from your application's vantage point as anything other than "the cache read field went to zero." You cannot tell which mechanism dropped the affinity. You also cannot retry your way out of it — re-sending the same request usually gets routed the same way, because the LB's view of pod health hasn't changed in the milliseconds between attempts.

What you can actually do

The first practical step is the cheap one: instrument per-conversation cache continuity and alert on its tail, not its mean. Add a conversation_id tag to every trace span, compute cache_read_input_tokens > 0 as a boolean per turn, and surface the p10 (not p50, not mean) of "fraction of turns with cache hit" grouped by conversation length bucket. When that p10 drops, your tail is widening even if your headline number looks the same. This is the signal that something has changed in the provider's routing behavior — a deploy, a capacity event, a regional incident.

The second step is to actually pass the hint. A surprising number of teams enable prompt caching at the prefix level but never set the routing-hint parameter, because the documentation treats it as optional. It is optional in the sense that the request will succeed without it. It is not optional in the sense that the cache will work consistently without it. Pass a stable prompt_cache_key (or the provider's equivalent) per conversation. Do not use random per-request values. Do not use per-user values unless your user only has one conversation in flight at a time. Use the conversation ID directly when possible.

The third step is to think about your conversation rate-limit interaction. OpenAI's documentation notes that affinity for a key starts overflowing around fifteen requests per minute for the same prefix-and-key combination. This is fine for an interactive chat user, who never approaches that rate. It is not fine for a batch agent processing items concurrently against the same system prompt, where you may exceed the threshold during traffic spikes. If you do that, your batch job's effective cache hit rate falls off a cliff during exactly the periods you most want it to perform. Splitting batch traffic across multiple cache keys can help; so can rate-limiting your own concurrency before the provider does it for you.

The fourth step is provider-side awareness. Some providers offer dedicated capacity tiers (provisioned throughput, dedicated pods, enterprise endpoints) with stronger affinity guarantees. The pricing reflects this, but if your conversation-level p10 cache hit rate is the metric your customer satisfaction depends on, that capacity is what you are paying for, not just the throughput. The shared tier's affinity is best-effort by design because the provider needs the routing flexibility to absorb other tenants' load spikes — which means your conversation pays the cost of someone else's traffic, on a pod somewhere you can't see.

Don't confuse cache-read with cache-existence

A subtle measurement trap is worth naming. When a provider reports a cache miss, it does not necessarily mean the cache for your prefix was evicted. It often means the cache exists, but the request was routed to a pod where the cache does not exist. The cache state is partitioned across the fleet, and "miss" is a per-pod attribute, not a global one. Your prefix may be sitting in three other pods' GPU memory at the moment of the miss; you just got routed to a fourth.

This matters because the natural reaction to a wave of misses is to extend the TTL or pay for the explicit cache write feature. Neither of those helps when the problem is routing, not eviction. You can buy a one-hour TTL and still miss every turn if the LB sends each turn to a different pod. Diagnosing routing-induced misses versus eviction-induced misses requires either provider-side telemetry you don't have, or empirical patterns: eviction-induced misses correlate with TTL boundaries (multiples of five minutes or one hour from session start); routing-induced misses correlate with provider-side capacity events (scaling, deploys, regional load shifts), which you can sometimes infer from status-page incidents or from broader latency anomalies at the same timestamps.

The cultural fix

Cache-as-conversation-property is a mental model your team builds early and rarely revisits. The provider's documentation reinforces it because the documentation talks about prefixes and TTLs, not pods and routing. Building the alternative mental model — cache-as-pod-property, affinity-as-best-effort, hit-rate-as-distribution-not-mean — costs nothing but takes a deliberate effort to install. The teams that have working agent latency in 2026 are the ones who have already gone through the version of this incident that broke their product, and rebuilt their dashboards around per-conversation continuity rather than fleet-level averages.

The headline metric will lie to you. It will lie even when nothing is wrong, by averaging away the experience of your unhappiest users. It will lie especially loudly when something is wrong, because the broken cases are a small fraction of total traffic and the fleet number stays flat. Watch the tail, pass the hint, instrument by conversation, and assume the load balancer will betray you the moment capacity gets tight. That assumption is correct. It is the architecture; it is not a bug they will fix.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Session Affinity Your Provider Load Balancer Quietly Ignored

The hint is not a contract

The dashboard hides what matters

How affinity gets dropped under load

What you can actually do

Don't confuse cache-read with cache-existence

The cultural fix

Recommended Reading

About Tian Pan

The hint is not a contract​

The dashboard hides what matters​

How affinity gets dropped under load​

What you can actually do​

Don't confuse cache-read with cache-existence​

The cultural fix​

Recommended Reading

About Tian Pan

The hint is not a contract

The dashboard hides what matters

How affinity gets dropped under load

What you can actually do

Don't confuse cache-read with cache-existence

The cultural fix