The Same Prompt at 3 PM and 3 AM Is Not the Same Prompt: Diurnal Drift in LLM Evaluation
The eval suite runs at 2 AM. Traffic is low. The cache is cold but the queues are empty. The provider's continuous batcher has spare slots and will service every request near its TTFT floor. The latency distribution is tight, the judge scores are stable, and the dashboard turns green. The team ships.
Six hours later, at 8 AM Pacific, the same prompts hit production during US morning peak. p95 latency is 2.4x what the eval reported. A non-trivial fraction of requests get a 529 from one provider and a fallback to a smaller routing tier from another. Streaming pacing is choppier. The judge — re-run on a sample of production traces that night — gives a half-point lower median score than the same judge gave the same prompts at 2 AM. Nothing changed in the codebase. Nothing changed in the prompt. The wall clock changed.
The architectural realization that has to land is this: an LLM call is not a pure function of its input tokens. It's a stochastic distributed system call where the input includes the wall clock, the load on the provider's cluster, the state of the prompt cache, the size of the current decode batch, and the routing decision the provider's load balancer made under the conditions that prevailed in the millisecond your request arrived. The team that runs evals at 2 AM is calibrating an instrument on conditions its users never experience.
Why the Wall Clock Is an Input
It helps to enumerate the mechanisms, because once they're named the diurnal effect stops looking like a mystery. There are at least four.
Continuous batching changes the math under load. Modern inference servers don't run requests one at a time. They use continuous batching, which slots tokens from many requests into the same forward pass and evicts them at the token level as each finishes. Anyscale's measurements on this technique have shown order-of-magnitude throughput differences vs static batching, but the throughput win comes with a property eval suites rarely model: the batch size of the forward pass that processes your token depends on what other tokens are in flight. At 2 AM the batch is small; at 10 AM it's big. The matrix multiplications are larger, the kernel choices may differ, and on some hardware the floating-point accumulation order changes — which means the same logits you'd compute on an isolated request can differ by enough to flip a top-k decision near a boundary.
Floating-point arithmetic isn't associative. The Thinking Machines analysis of nondeterminism in LLM inference makes this concrete: even at temperature 0, the same input can produce different outputs because the kernels that run under different load conditions reduce floating-point sums in different orders, and (a + b) + c is not always equal to a + (b + c) in IEEE-754. Most of the time the difference is invisible. On boundary tokens — the ones where two candidates are within a hair of each other in the softmax — it isn't. And the rate at which boundary tokens flip is correlated with batch size, which is correlated with load, which is correlated with time of day.
The prompt cache has a temperature. Providers cache long shared prefixes (system prompts, retrieved context, few-shot exemplars) so they don't have to recompute them for every request. A request that lands when the prefix is warm pays one latency budget; a request that lands when the prefix has been evicted pays another. Eviction is driven by total cache pressure, which is driven by neighbor traffic, which is — again — diurnal. At 2 AM your eval suite is the only tenant exercising that prefix; the cache stays warm by virtue of repetition. At peak hour you're competing with everyone else for cache capacity, and your prefix gets cold-recomputed more often than your eval suggested it would.
The provider quietly re-routes you under load. Providers don't publish exactly when this happens, but the overload-error pattern is well documented: in 2026, both OpenAI and Anthropic have visible peak-hour capacity strain, and the 429 / 529 curves are heavily diurnal. What's less visible — and far more dangerous for evals — is the routing fallback that happens before the overload error: a request that would have hit the primary cluster gets shifted to a backup pool with different hardware, possibly different quantization, possibly different speculative-decoding configuration. The output is still on-spec for the model name on the invoice. It is not necessarily identical to what the primary cluster produced. And the team that evaluated on the primary cluster at 2 AM has no traceable record that some fraction of its production requests are being served by something subtly different.
Why Eval at 2 AM Is the Industry Default
Nobody chose this. It accreted.
CI runs nightly because that's when CI has historically run, in a world where the metric being measured was unit-test pass/fail and the cost-relevant axis was "did we free up the build farm before the morning standup." That's a reasonable schedule for a deterministic system. It is the wrong schedule for a stochastic distributed system whose behavior covaries with the load on someone else's hardware.
The provider-side incentive compounds it. Off-peak batch APIs offer steep discounts — Anthropic's batch tier is 50% off, OpenAI's is similar, and the windows are explicitly carved out around low-load hours. A team trying to hit a quarterly cost target will move evals into the discount window without thinking carefully about what that does to representativeness, because the line item in the budget reads "evals" and the line item that would catch the consequence reads "production quality regressions" and they live on different dashboards owned by different people.
And the human factor: the eval engineer wants to see results in the morning. A 2 AM run means stable conditions, no flakes from peak-hour rate limits, and a clean dashboard for the standup. Every individual incentive points at the wrong hour.
The Discipline That Has to Land
- https://research.aimultiple.com/llm-latency-benchmark/
- https://opper.ai/blog/llm-router-latency-benchmark-2026
- https://www.codeant.ai/blogs/llm-throughput-rate-limits
- https://thinkingmachines.ai/blog/defeating-nondeterminism-in-llm-inference/
- https://arxiv.org/html/2408.04667v5
- https://gatling.io/blog/load-testing-an-llm-api
- https://github.com/ray-project/llmperf
- https://blog.premai.io/load-testing-llms-tools-metrics-realistic-traffic-simulation-2026/
- https://developers.openai.com/api/docs/guides/latency-optimization
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns
- https://docs.anthropic.com/en/api/rate-limits
- https://insightfinder.com/blog/hidden-cost-llm-drift-detection/
- https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- https://www.anyscale.com/blog/continuous-batching-llm-inference
- https://tokenmix.ai/blog/anthropic-overloaded-error-why-workarounds-2026
