The Eval-Rig Latency Lie: Why Your p95 Doubles in Production
The eval team puts a number on the deck: "p95 latency is 1.2s." The launch ships. A week later, oncall posts a graph: production p95 is 4.8s and climbing through the dinner-time peak. Engineers spend the next five days arguing about whether something regressed, instrumenting model versions, opening tickets with the provider — and eventually discover that nothing changed except where the number was measured. The eval rig was reporting the latency of a quiet machine running serial calls against a warm cache. Production is a different system. The p95 was never wrong; it was answering a different question.
This is the eval-rig latency lie. It is not about bad benchmarks — most teams use reasonable tools and report the numbers honestly. It is about the gap between "the latency of the model" and "the latency a user experiences," and the fact that the rig you build for development almost always measures the first while implying the second. Once you internalize this, latency SLOs derived from a benchmark stop looking like product commitments and start looking like claims about a private testing environment that nobody else can reproduce.
The four hidden assumptions in your eval rig
Every eval suite makes assumptions about its environment. The dangerous ones are the assumptions nobody wrote down, because nobody on the team noticed them. Four show up in almost every rig I have seen.
Prompt caches are pre-populated. Anthropic, OpenAI, and Google all offer prefix caching, and the savings are real — Anthropic's documentation cites up to 90% cost reduction and 85% latency reduction for long, repeated prompts. The eval rig hits the same system prompt thousands of times in a row, and after the first request the cache is hot for the rest of the run. Production traffic is not like this: a user lands on a page cold, the system prompt may not be cached for that account or that region, and the very first call of a session pays the full TTFT. Your suite never reports the cold-start distribution because the rig is structurally incapable of producing one.
KV caches are resident from prior runs. Inside the inference engine, the KV cache survives across calls within the same worker. Repeated benchmark runs against a self-hosted endpoint are all warm. Across LMCache and llm-d benchmarks, warm-cache TTFT can be five to ten times faster than cold — work from the llm-d project shows roughly 88% faster TTFT on warm cache hits versus cold for representative workloads. If your eval rig keeps workers pinned and never evicts, you are reporting the steady-state behavior of a fully warmed pipeline, not the first-byte latency a real user pays after a model upgrade, an autoscale event, or a cache rotation.
Provider tiers are quiet. Eval suites tend to run on a schedule the team finds convenient — overnight, mid-morning, between standups. Whatever that schedule is, it correlates with provider load. Public benchmarks repeatedly show that the same model can drift two to three times in TTFT depending on time of day, with the worst values clustering around US business-hours peaks for hosted endpoints. Your suite is not measuring the model; it is measuring the model at 6am Pacific. Production users live in every timezone, and the load you see when traffic peaks is not the load the rig was calibrated against.
Concurrency is one. This is the big one. Most eval rigs make calls serially, or with a tiny static concurrency that fits comfortably under the provider's TPM and RPM ceilings. A real workload has bursts: a marketing email goes out, a job runs at the top of the hour, an autoscaler is slow to react. The moment your concurrency exceeds the rate-limit ceiling, you start eating 429s, and 429s are not separate from latency — they manifest as retry delays inside the same request span. A request that gets rate-limited, waits two seconds, retries, and succeeds is a 2-second-slower request from the user's point of view, not an "error." Eval rigs that bucket retries as errors and exclude them from the latency percentile are reporting a number that does not exist on the user-facing side of the API.
The combined effect is not additive — it compounds. A rig that is pre-cached, KV-warm, provider-quiet, and serial can underreport p95 by 3-5x against the same model serving the same prompt distribution in production.
Where the numbers actually diverge
The cleanest way to see the gap is to instrument both sides and compare. Three patterns show up almost universally.
The first is the cold-start tax on TTFT. Eval p95 TTFT looks tight — say 400ms. Production p95 TTFT for a brand-new session is 1.6s, because the user's prompt prefix is not in any cache yet, the model worker may not be holding their prefix in KV, and the request lands during a load spike. Across an entire user population, "first call of a session" is a meaningful chunk of all calls. The eval suite, which never has a "first call," collapses that distribution into a single tight band that does not represent any actual user.
The second is the rate-limit-induced fat tail. Most providers expose RPM and TPM limits, and most teams provision for a multiple of average load that breaks during peaks. A request that takes 800ms when nothing is queued can take 4.5s when it has to backoff once and 8s when it has to backoff twice. Naive retry logic makes this worse: if your client retries with a fixed delay or without jitter, retries from a hundred concurrent clients land at the same instant and re-trigger the limit. The eval rig, running serially, never exercises this code path. The first time anyone sees it is the day the production graph spikes, and the conclusion is "the model got slow" rather than "our concurrency is now meaningful."
The third is the streaming jitter that doesn't show up in averages. Inter-token latency variance — the spacing between tokens — tends to be invisible to a rig that reports only TTFT and end-to-end time. But for streaming UX, a model with smooth ITL feels fast and a model with jitter feels broken even at the same total latency. Production load makes ITL worse: shared GPU contention, prefix-cache evictions mid-decode, and queue-depth spikes all stretch individual token gaps. Your eval rig, reporting clean medians, will say two models are equivalent that any user can tell apart in five seconds of streaming.
What a production-shaped eval looks like
The fix is not "throw out benchmarks." Benchmarks are useful — they just need to answer the question the SLO is asking. Production-shaped evals share four properties.
Cold runs that flush KV between trials. Add a flush step (or rotate workers) between a fraction of trials, and report the cold and warm distributions separately. The headline p95 should be a weighted mix that reflects your actual cache-hit rate in production, not the asymptotic warm number. If you do not know your cache-hit rate, that's the first metric to instrument before you publish a latency SLO at all.
Traffic-replay harnesses, not synthetic concurrency. Tools like LLMPerf, GuideLLM, and Gatling support traffic-shape replay rather than fixed RPS. Take a real day of production traces (or a representative slice), replay the timing pattern against your test endpoint, and report the latency distribution under that shape. The answer will be different — usually worse — than the synthetic-concurrency answer, and that difference is information the SLO owner needs.
Time-of-day stratified sampling. Run the suite at multiple times across the day, for at least a week, and report a banded p95 rather than a point estimate. If your provider's TTFT moves 2x between 3am and 3pm Pacific, an SLO of "p95 < 1.0s" is true for one of those windows and false for the other, and only the banded version tells you which.
Retry-aware accounting. Treat every retried request's wall-clock time, including the backoff, as part of its latency. If a request was 429'd twice and finally succeeded after 3.2 seconds of waiting, that is a 3.2-second-slower successful request, not three separate events. Reporting retries as a separate "error rate" hides the user-visible cost of rate-limit pressure inside a metric nobody looks at on the latency page.
The discipline shift here is small but important: your SLO is a claim about what users experience, so the eval that defends it must be shaped like user traffic, not like a developer running a script.
The org failure mode: a private rig nobody can reproduce
The most expensive version of this problem is not the technical one. It is the org one. Once an eval rig becomes the source of latency numbers, the team that owns it owns those numbers, and the rig accumulates implicit assumptions: which workers it pins, how it handles retries, what time-of-day it runs, which prompt distribution it samples, whether it warms the cache before measurement. None of this is documented because it never had to be — the rig just produces a number, and the number gets quoted in standups.
When production diverges, the conversation goes like this. The eval team says "p95 is 1.2s, here's the run." Someone else says "p95 is 4.8s, here's the production graph." The eval team says "we ran it again, still 1.2s." The other team says "we ran traffic-replay against the same endpoint, got 5.1s." Both teams are correct about their own numbers. Neither can reproduce the other's. A week disappears proving that the discrepancy is environmental rather than a regression, and the actual fix — a backoff bug, a cache-warmup gap, a provisioning shortfall — only gets diagnosed once someone with credentials on both sides sits down and reconciles the rigs by hand.
The architectural takeaway is that latency is not a property of a model. It is a property of a deployment — the model, the cache state, the concurrency, the network path, the retry policy, the time of day, the user's geography. Any single number that ignores those is implicitly fixing them. The fix is to publish what you fixed: every latency claim should travel with the environment it was measured in, the same way every accuracy claim already travels with the eval set it was measured on. Without that, the rig becomes a private artifact that produces numbers nobody outside the rig can defend.
Treating the rig as production code
The cultural change that closes the gap is small: the eval rig is production code, not a notebook. It needs the same observability the runtime path has — a dashboard for cache-hit rate during eval runs, a banded p95 chart instead of a single number, an explicit field on every reported latency for the time-of-day window and the concurrency level. It needs version control on its assumptions: a documented retry policy, a documented prompt distribution, a documented cold-vs-warm split. And it needs an owner who is on the hook when production p95 diverges from eval p95, not a team that hands a number off and disclaims responsibility for what happens downstream.
The teams that get this right are not the ones with the fanciest harness. They are the ones whose eval rig produces numbers another team can reproduce with their own credentials, on their own schedule, against the same endpoint — because the assumptions are explicit and the measurement environment travels with the measurement. Until your rig clears that bar, the latency it reports is a private claim about a private machine. Calling that number a "p95" is a borrowing of vocabulary, and the borrowing is what makes the production gap so much more painful than it needs to be.
- https://inference.net/content/llm-performance-benchmarks
- https://bentoml.com/llm/inference-optimization/llm-performance-benchmarks
- https://docs.anyscale.com/llm/serving/benchmarking/metrics
- https://research.aimultiple.com/llm-latency-benchmark/
- https://modal.com/llm-almanac/how-to-benchmark
- https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
- https://llm-d.ai/blog/kvcache-wins-you-can-see
- https://developers.redhat.com/articles/2025/06/20/guidellm-evaluate-llm-deployments-real-world-inference
- https://github.com/ray-project/llmperf
- https://blog.christianposta.com/ai/learnings-from-load-testing-llms/
- https://gatling.io/blog/load-testing-an-llm-api
- https://bentoml.com/llm/inference-optimization/llm-inference-metrics
