Cold Cache, Hot Cache: Why Your LLM Latency Numbers Lie in Staging
Your staging environment says p50 latency is 400ms. Your production dashboard says 1.8 seconds. You check the code — same model, same prompt, same provider. Nothing changed between deploy and release. The numbers shouldn't diverge this much, but they do.
The culprit is almost always cache state. Prompt caching — the single biggest latency optimization most teams rely on — behaves fundamentally differently under staging traffic patterns than production traffic patterns. And if you don't account for that difference, every latency number you collect before launch is fiction.
The Cache You Didn't Know You Were Depending On
Modern LLM providers cache aggressively. Anthropic caches prompt prefixes with a 5-minute TTL that refreshes on each hit. OpenAI caches automatically for any prompt over 1,024 tokens, with in-memory retention lasting 5–10 minutes of inactivity and extended caching persisting up to 24 hours. Google requires a 4,096-token minimum before caching activates.
The latency impact is enormous. Cached requests run 7% faster for short prompts (around 1,024 tokens) and up to 67% faster for long prompts (150K+ tokens). At the extreme end, a 100K-token prompt that takes 11.5 seconds uncached drops to 2.4 seconds with a warm cache — an 80% reduction.
Here's the problem: these numbers describe a perfectly warm cache. In staging, you run your test suite, each request warms the cache for the next one, and every measurement after the first few reflects best-case performance. In production, the cache state is a distribution — some requests hit warm entries, some hit cold ones, and the ratio depends on traffic patterns you can't simulate with a test harness that fires 50 requests and calls it done.
Why Staging Flatters You
Staging environments create artificially favorable cache conditions in three ways.
Low traffic diversity. Your test suite sends the same handful of prompts repeatedly. Each prompt warms the cache on first call, and every subsequent call benefits. In production, you have hundreds or thousands of distinct prompt prefixes — many of which appear too infrequently to maintain a warm cache entry before the TTL expires.
Predictable request timing. Staging tests fire sequentially or in small controlled bursts. The 5-minute cache TTL never expires because your test suite completes well within that window. Production traffic is bursty and uneven. A prompt prefix that's hot during business hours goes cold overnight, and the first morning request eats the full uncached latency.
Single-node routing. OpenAI's cache operates per-machine and starts spilling to additional machines at roughly 15 requests per minute per prefix. In staging, your low request volume keeps everything on one machine with a warm cache. In production, load balancing distributes requests across multiple machines, each starting with an empty cache. Those overflow requests are cache misses that don't show up in your staging metrics.
The Percentile Where It Hurts
Averages hide the damage. Your p50 might look reasonable because the majority of requests hit a warm cache. But p95 and p99 — the percentiles that drive user complaints and SLO violations — are dominated by cold-cache requests.
Consider a system where 70% of requests hit a warm cache (400ms) and 30% hit a cold cache (1.8s). The average looks like 820ms, which might pass your latency budget. But p95 is 1.8 seconds, and p99 could be worse if cold-cache requests also correlate with longer prompts or higher provider load. The tail isn't noise — it's a completely different performance regime that staging never exercises.
This effect compounds in agentic systems. A single agent loop might make 5–15 LLM calls. If each call has a 30% chance of hitting a cold cache, the probability that at least one call in a 10-step chain hits cold is over 97%. The end-to-end latency of the agent task is governed by the worst call in the chain, not the average.
The Three Cache Layers You Need to Understand
Production LLM latency involves multiple cache layers, each with different warm-up characteristics.
Provider-side prompt caching. This is the layer discussed above — prefix-based caching managed by the API provider. TTLs range from 5 minutes to 24 hours depending on provider and tier. You have limited control over eviction, and cache state is invisible to your application. The only signal is the cached_tokens field in the API response, which most teams don't monitor.
KV cache on the GPU. For self-hosted models, the key-value cache stores attention computations from previous tokens. Cold KV cache means the model recomputes attention from scratch for the full context window. This is the difference between 1.5 seconds and 11+ seconds at maximum context length. GPU memory fragmentation in naive implementations wastes 60–80% of allocated KV cache memory, though systems like vLLM's PagedAttention reduce waste to under 4%.
Application-level semantic caching. If you've built a semantic cache layer that returns stored responses for semantically similar queries, its hit rate depends entirely on the distribution of production queries — which you can't know until you're in production. A semantic cache that shows 40% hit rate on your eval set might drop to 8% against real user traffic.
Load Testing That Doesn't Lie
Standard load testing tools (k6, Locust, JMeter) fail for LLM workloads because they don't model cache thermodynamics. Here's a methodology that produces honest numbers.
Phase 1: Cold-start baseline. Measure latency with a completely cold cache. Fire each distinct prompt prefix exactly once, with enough spacing between requests that no cache entry survives. This gives you the worst-case floor. Every production request has some probability of hitting this floor, and your system needs to handle it.
Phase 2: Warm-up curve. Send the same prompt prefix repeatedly at increasing intervals — every 1 second, every 30 seconds, every 2 minutes, every 4 minutes, every 6 minutes. Plot TTFT against inter-request interval. This reveals the cache TTL boundary and shows you exactly where latency jumps from cached to uncached. For Anthropic, you'll see the cliff at 5 minutes. For OpenAI, it's fuzzier because caching is automatic and the eviction policy is opaque.
Phase 3: Traffic-shaped load test. Replay production-like traffic patterns, not synthetic uniform load. Use access logs from a similar feature or model request distributions from your analytics. The key variables are: number of distinct prompt prefixes, frequency distribution across those prefixes (Zipfian or otherwise), and inter-arrival times. Tools like GuideLLM can simulate configurable workloads that model these distributions.
Phase 4: Measure what matters. Track p50, p95, and p99 separately for cache-hit and cache-miss requests. Tag each request with whether it received cached tokens (most providers return this in the response). This gives you four numbers instead of one, and all four matter: p50-cached, p50-uncached, p99-cached, p99-uncached. Your SLO should be set against the blended distribution at your expected cache hit rate, not against the cached-only numbers.
Operational Patterns That Close the Gap
Once you accept that cache state drives latency variance, several operational patterns become obvious.
Monitor cache hit rate as a first-class metric. Extract the cached_tokens field from every API response and compute hit rate per prompt template, per time window. Alert when hit rate drops below your baseline — it means either traffic patterns shifted or the provider changed something. A 10% drop in cache hit rate can translate to a 30%+ increase in p95 latency.
Pre-warm after deploys. Deployments often reset cache state, especially for self-hosted models but sometimes for API providers too (if your deploy changes prompt prefixes). Run a warm-up script that fires representative requests for your most common prompt templates before routing production traffic. This converts the first few hundred users from cold-cache guinea pigs into normal-latency customers.
Design prompts for cache stability. Provider caching requires exact prefix matching. A single character change invalidates the cache. Structure your prompts so that static content (system prompt, instructions, tool schemas) comes first and dynamic content (user input, conversation history) comes last. Avoid injecting timestamps, request IDs, or other per-request values into the cacheable prefix. Research shows that caching only the system prompt provides the most consistent benefits — naively caching the full context can paradoxically increase latency due to cache write overhead on dynamic content.
Budget for the uncached tail. Set your timeout and retry logic based on cold-cache latency, not warm-cache latency. If your timeout is tuned to cached performance, cold-cache requests will timeout and retry, which doubles your cost and might still timeout on retry. A timeout of 2x your p99-uncached latency is a safer starting point than 2x your p50-cached latency.
Use separate latency SLOs for interactive and batch paths. Interactive requests need tight TTFT budgets, so cache misses hurt. Batch processing can tolerate cold-cache latency because no user is waiting. If your system mixes both, a single SLO will either be too tight for batch (causing false alerts) or too loose for interactive (missing real degradation).
The Honest Latency Curve
The gap between staging and production latency isn't a mystery, and it isn't the provider's fault. It's the predictable consequence of measuring performance in an environment where the cache is always warm and traffic is always uniform.
Before you ship, get your cold-cache numbers. Measure the warm-up curve. Tag requests by cache state and track percentiles separately. Build your SLOs around what production traffic actually looks like, not what your test harness makes it look like.
The most expensive latency surprise is the one you discover from user complaints. The cheapest is the one you find by running an honest load test with a cold cache on a Tuesday afternoon — before a single real user hits the endpoint.
- https://developers.openai.com/cookbook/examples/prompt_caching_201
- https://arxiv.org/html/2601.06007v2
- https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- https://developer.nvidia.com/blog/llm-benchmarking-fundamental-concepts/
- https://optyxstack.com/performance/latency-distributions-in-practice-reading-p50-p95-p99-without-fooling-yourself
- https://reintech.io/blog/llm-load-testing-benchmark-ai-application-production
- https://developers.redhat.com/articles/2025/06/20/guidellm-evaluate-llm-deployments-real-world-inference
