Skip to main content

One post tagged with "load-balancing"

View all tags

The Session Affinity Your Provider Load Balancer Quietly Ignored

· 11 min read
Tian Pan
Software Engineer

Your dashboard says cache hit rate is 71%. Your finance partner is pleased. Your latency p50 is fine. Then a customer support thread arrives from a long-running agent session: turn 14 took eleven seconds to produce the first token, turn 15 took eight, turn 16 took nine. You pull the trace. Every one of those turns reports a cache_read_input_tokens value of zero. The system prompt is sixteen thousand tokens. The user thinks the agent is broken. You think your provider is broken. Neither of you is right. The aggregate hit rate is a survivorship statistic — it averages over the short conversations that hit cache trivially and quietly absorbs the long conversations that have collapsed to cold-on-first-token mid-session.

This is the failure mode that no provider postmortem will ever describe to you, because from their telemetry the system is working as designed. The load balancer is making the routing decision it was told to make. The cache is being populated and evicted on the schedule it was told to follow. The hint you passed — the prompt_cache_key, the conversation ID, the user ID, whatever string you serialized into that field — was advisory the whole time, and "advisory" means "ignored when convenient." Under load, when a scaling event happens, when an upstream pod is draining, when the affinity-aware tier is saturated, your hint quietly degrades to a uniform routing decision. The request lands on a cold pod. The KV tensors that would have served the prefix at sub-millisecond cost are sixteen feet away in a sibling rack and unreachable. Your conversation pays full-prefix cost again, and your dashboard's headline number doesn't move because two thousand other one-turn conversations hit cache fine.