The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation
The eval suite runs at 2 AM. Traffic is low. The cache is cold but the queues are empty. The provider's continuous batcher has spare slots and will service every request near its TTFT floor. The latency distribution is tight, the judge scores are stable, and the dashboard turns green. The team ships.
Six hours later, at 8 AM Pacific, the same prompts hit production during US morning peak. p95 latency is 2.4x what the eval reported. A non-trivial fraction of requests get a 529 from one provider and a fallback to a smaller routing tier from another. Streaming pacing is choppier. The judge — re-run on a sample of production traces that night — gives a half-point lower median score than the same judge gave the same prompts at 2 AM. Nothing changed in the codebase. Nothing changed in the prompt. The wall clock changed.
The architectural realization that has to land is this: an LLM call is not a pure function of its input tokens. It's a stochastic distributed system call where the input includes the wall clock, the load on the provider's cluster, the state of the prompt cache, the size of the current decode batch, and the routing decision the provider's load balancer made under the conditions that prevailed in the millisecond your request arrived. The team that runs evals at 2 AM is calibrating an instrument on conditions its users never experience.
Why the Wall Clock Is an Input
It helps to enumerate the mechanisms, because once they're named the diurnal effect stops looking like a mystery. There are at least four.
Continuous batching changes the math under load. Modern inference servers don't run requests one at a time. They use continuous batching, which slots tokens from many requests into the same forward pass and evicts them at the token level as each finishes. Anyscale's measurements on this technique have shown order-of-magnitude throughput differences vs static batching, but the throughput win comes with a property eval suites rarely model: the batch size of the forward pass that processes your token depends on what other tokens are in flight. At 2 AM the batch is small; at 10 AM it's big. The matrix multiplications are larger, the kernel choices may differ, and on some hardware the floating-point accumulation order changes — which means the same logits you'd compute on an isolated request can differ by enough to flip a top-k decision near a boundary.
Floating-point arithmetic isn't associative. The Thinking Machines analysis of nondeterminism in LLM inference makes this concrete: even at temperature 0, the same input can produce different outputs because the kernels that run under different load conditions reduce floating-point sums in different orders, and (a + b) + c is not always equal to a + (b + c) in IEEE-754. Most of the time the difference is invisible. On boundary tokens — the ones where two candidates are within a hair of each other in the softmax — it isn't. And the rate at which boundary tokens flip is correlated with batch size, which is correlated with load, which is correlated with time of day.
The prompt cache has a temperature. Providers cache long shared prefixes (system prompts, retrieved context, few-shot exemplars) so they don't have to recompute them for every request. A request that lands when the prefix is warm pays one latency budget; a request that lands when the prefix has been evicted pays another. Eviction is driven by total cache pressure, which is driven by neighbor traffic, which is — again — diurnal. At 2 AM your eval suite is the only tenant exercising that prefix; the cache stays warm by virtue of repetition. At peak hour you're competing with everyone else for cache capacity, and your prefix gets cold-recomputed more often than your eval suggested it would.
The provider quietly re-routes you under load. Providers don't publish exactly when this happens, but the overload-error pattern is well documented: in 2026, both OpenAI and Anthropic have visible peak-hour capacity strain, and the 429 / 529 curves are heavily diurnal. What's less visible — and far more dangerous for evals — is the routing fallback that happens before the overload error: a request that would have hit the primary cluster gets shifted to a backup pool with different hardware, possibly different quantization, possibly different speculative-decoding configuration. The output is still on-spec for the model name on the invoice. It is not necessarily identical to what the primary cluster produced. And the team that evaluated on the primary cluster at 2 AM has no traceable record that some fraction of its production requests are being served by something subtly different.
Why Eval at 2 AM Is the Industry Default
Nobody chose this. It accreted.
CI runs nightly because that's when CI has historically run, in a world where the metric being measured was unit-test pass/fail and the cost-relevant axis was "did we free up the build farm before the morning standup." That's a reasonable schedule for a deterministic system. It is the wrong schedule for a stochastic distributed system whose behavior covaries with the load on someone else's hardware.
The provider-side incentive compounds it. Off-peak batch APIs offer steep discounts — Anthropic's batch tier is 50% off, OpenAI's is similar, and the windows are explicitly carved out around low-load hours. A team trying to hit a quarterly cost target will move evals into the discount window without thinking carefully about what that does to representativeness, because the line item in the budget reads "evals" and the line item that would catch the consequence reads "production quality regressions" and they live on different dashboards owned by different people.
And the human factor: the eval engineer wants to see results in the morning. A 2 AM run means stable conditions, no flakes from peak-hour rate limits, and a clean dashboard for the standup. Every individual incentive points at the wrong hour.
The Discipline That Has to Land
There are five practices that, taken together, close the gap. None of them is hard individually. The hard part is that the team has to admit the gap exists.
Sample evals across the diurnal curve, not at one convenient slot. The right eval cadence is not "nightly at 2 AM." It's a stratified sample across the full traffic distribution: weekday peak (9–11 AM and 2–4 PM in the user's primary region), weekend low, EU morning, APAC evening, and the off-peak slot the cost-conscious slice still uses. This is more expensive in dollars and operationally noisier, but it produces an eval result that means something for the conditions production actually runs in. If you can only afford one slot, run it during peak, not off-peak — the team should be optimizing against the conditions where its users actually live, not against the conditions that are easiest to measure.
Track latency by hour as a first-class SLO. A single p95 number averaged over a 24-hour window is a lie. The dashboard that matters is a heatmap: hour-of-week on one axis, p50/p95/p99 latency on the other, with the number-of-requests volume overlaid so you can see where the batching pressure lives. The Anthropic and OpenAI status pages already publish API response time trackers segmented by hour; an internal version of that dashboard, scoped to your traffic and your prompts, is the tool that makes diurnal drift visible to the team.
Distinguish cold-cache and warm-cache budgets. Latency SLOs that don't separate these two regimes paper over a 3–10x gap. A prompt that's expected to run at warm-cache p95 of 800ms and cold-cache p95 of 3200ms should have both budgets named in the contract, and the alerting should know which budget applied to a given request. The cleanest way to instrument this is to hash the system-prompt prefix and tag every request with whether the provider returned a cache hit; if your provider exposes that signal (Anthropic's API does, OpenAI's automatic caching surfaces it via response metadata), use it.
Run a synthetic-load eval that recreates peak conditions. This is the harder one. The point is not to load-test your own infrastructure (load tests do that). The point is to run your eval suite under realistic neighbor pressure. Tools like LLMPerf, GuideLLM, and Gatling can drive concurrent traffic at production-realistic shapes — varied prompt lengths, realistic distributions of input and output tokens, ramp-up and warm-up phases — while a separate worker thread fires the eval prompts. The latency, error rate, and judge score the eval prompts get under that condition is the number you should be staking your launch decisions on, not the number you get when the eval prompts are the only thing your account is sending.
Get your provider on the record about what's stable across load. Some routing behaviors are stable: the model name on the invoice will not silently switch to a different family. Some are not: which cluster, which quantization tier under fallback, which speculative-decoding configuration, which kernel implementation. Ask the provider's enterprise team to enumerate which axes are guaranteed stable across load conditions and which the team has to re-measure after every significant traffic shift. The answer is usually less reassuring than you'd hope, and the conversation itself is the artifact — it forces explicit ownership of the contract that has been implicit and unmonitored.
The Two Failure Modes This Catches
Two specific incident patterns disappear once a team gets diurnal eval right.
The first is the launch that ships green and degrades silently. A feature launches with eval scores measured at 2 AM. Production users hit it during peak. Quality is consistently lower than the launch promised, but never bad enough to trigger the regression alert that's calibrated against the 2 AM baseline. The team spends a quarter wondering why qualitative user feedback is worse than the dashboard says it should be. With diurnal eval in place the discrepancy surfaces during the launch checklist, before users see it.
The second is the prompt change that passes eval and breaks production. An engineer rewrites a system prompt for clarity. The eval — at 2 AM, on a warm cache, with a small batch — shows a one-point improvement on the rubric. The prompt ships. Production at peak hour hits the new prompt with a cold cache and a large batch, and the shape of the cold-cache forward pass interacts with the new prompt's structure to flip a routing decision that the warm-cache eval never exercised. Users see a regression. The team blames the model. The actual root cause was that the eval was measuring under conditions that don't generalize to production, and the prompt change happened to be near a boundary that the eval didn't probe.
What Production Parity Actually Means
The phrase "production parity" gets used a lot in offline-vs-online eval discussions. It usually means "the eval prompts look like real user prompts" — the same length distribution, the same tag mix, the same sampling temperature settings, the same model name. That's necessary, but it isn't sufficient. Two prompts can be byte-identical and still see different model behavior because the wall clock differs. Production parity has to include the operating conditions of the call, not just the input bytes: the batch size that's likely to apply, the cache state that's likely to apply, the routing tier that's likely to apply, and the time of day at which the inference is going to run.
The teams that learn this the cheap way build it into their eval harness from day one: every eval result is tagged with the wall-clock hour, the cache-hit signal, and the observed end-to-end latency, and the comparison against the production distribution is made on those joint dimensions, not just on the prompt content. The teams that learn it the expensive way are the ones that ship a feature with a green eval and a quiet degradation that takes a quarter to trace back to the time of day on the eval cron.
The Architectural Frame
The deeper frame is that LLM inference is the first piece of infrastructure most product teams will operate that is both a model artifact (subject to weight and behavior changes) and a shared compute service (subject to neighbor and load effects), without the operational discipline that either community brings to its own piece. ML teams are used to versioning weights but not to thinking about co-tenants. Distributed systems teams are used to thinking about co-tenants but not to thinking about behavior that drifts on the prompt-cache axis. AI engineering, as a discipline, is the place where both vocabularies have to land at once.
The wall-clock-as-input observation is the entry point. Once a team accepts that the time of day is part of the input distribution, the rest of the discipline — stratified eval scheduling, cold-vs-warm budgets, synthetic-load eval, provider stability contracts — follows from a single principle: measure under the conditions you'll actually run under, not under the conditions that are easiest to measure. Until that principle is internalized, every dashboard the team builds is calibrating an instrument on conditions its users never experience, and every launch decision is being made against a version of reality that disappears at 6 AM when the load comes back.
- https://research.aimultiple.com/llm-latency-benchmark/
- https://opper.ai/blog/llm-router-latency-benchmark-2026
- https://www.codeant.ai/blogs/llm-throughput-rate-limits
- https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
- https://arxiv.org/html/2408.04667v5
- https://gatling.io/blog/load-testing-an-llm-api
- https://github.com/ray-project/llmperf
- https://blog.premai.io/load-testing-llms-tools-metrics-realistic-traffic-simulation-2026/
- https://developers.openai.com/api/docs/guides/latency-optimization
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns
- https://docs.anthropic.com/en/api/rate-limits
- https://insightfinder.com/blog/hidden-cost-llm-drift-detection/
- https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- https://www.anyscale.com/blog/continuous-batching-llm-inference
- https://tokenmix.ai/blog/anthropic-overloaded-error-why-workarounds-2026
