Skip to main content

Time-of-Day Quality Drift: Why Your AI Feature Behaves Differently at 10 AM ET

· 9 min read
Tian Pan
Software Engineer

Your eval suite ran green at 2 AM PT on a quiet provider. QA smoke-tested at 11 PM the night before launch. The feature goes live, and by Tuesday at 10 AM Eastern your p95 is 40% higher than the dashboard you signed off on, your agent is dropping the last tool call in a six-step plan, and your support inbox is filling with tickets that all sound the same: "the AI was weird this morning." Nobody is wrong. The model is also not wrong. The eval set is wrong — it never saw a saturated provider, so it has no opinion on what the feature does when the queue depth triples and the deadline budget collapses.

Provider load is not a latency problem with a quality side effect. It is a distribution shift in the inputs your model and your agent loop receive, and you have built every quality signal you trust on the wrong half of that distribution. The fix is not a faster region or a better model. The fix is to stop pretending your eval harness is sampling from the same world your users are.

The eval suite is sampling a fiction

Most teams run evals against a cold provider in a low-traffic window because that is when nothing else is happening, the runs are cheaper, and the variance is lower. Variance is lower because the provider is not under load. The eval suite then anchors the team's mental model of "what the model does" on the behavior of an inference fleet operating at maybe 30% utilization with empty batches and headroom on autoscaling.

Production at 10 AM ET on a Tuesday is a different fleet. Public benchmarks now show that OpenAI's GPT-4o TTFT can swing 2–4x between off-peak and peak hours, ranging from roughly 300ms to 800ms, while Anthropic's Claude shows roughly 400–900ms TTFT with tighter variance under load. Those are the numbers vendors publish. They do not capture the second-order effects that ride on top of the latency: when a token-stream slows down, your client-side deadlines hit sooner, your agent loop runs out of budget mid-plan, and your retry path engages with a slightly different prompt than the one your eval suite ever evaluated.

What you measured was the ceiling of model quality. What users get is the floor of the joint system — model, provider load, network path, your own deadline budget, your fallback chain — at the moment they happened to ask. The eval ceiling and the production floor diverge most when you most need them to agree.

Latency degrades, but quality degrades through latency

The tempting framing is that latency and quality are separate axes. They are not. Once a provider's batching density rises and its queue deepens, the token stream slows. Three things downstream of that slowdown silently degrade quality, none of which are visible in a latency dashboard:

  • Reasoning gets truncated when token budgets time out. Your prompt allocates, say, 4,000 tokens for chain-of-thought. At off-peak streaming speed, the model finishes its reasoning and then answers. At peak speed, the same wall-clock budget cuts the reasoning short and the model answers with a partial trace it would not have committed to on a calm provider.
  • Fallback routes fire silently. Production routers commonly have a declarative fallback chain — primary model, cheaper model, semantic cache, error — that trips on 429s, on rising error rates, or on cost velocity. Under load, you are not running the model you evaluated; you are running an unannounced mixture of it and a cheaper sibling.
  • Tool-call sequences get cut off. An agent loop with a 30-second deadline that needs five tool calls on a fast day needs the same five calls on a slow day. On the slow day, you get four and a guess.

Each of these is invisible to a quality eval that grades whole-response correctness on a fixed prompt. The eval set scores fine because the post-fallback, post-truncation, post-cut-off responses are still grammatical and topical. They are just systematically worse on the dimensions your users care about, and the gap is correlated with the hour of the day.

The org seam that makes this invisible

This failure mode survives in production because of a clean org boundary that nobody questions. SRE owns latency. The AI team owns quality. Both teams have dashboards. Neither team has a dashboard that overlays the two on the same time axis.

When SRE sees the p95 climb at 10 AM, they file it as expected diurnal load and move on. When the AI team sees a slight quality regression in the weekly eval-on-traffic report, they file it as model drift and queue up a re-prompt. The correlation never surfaces because no individual reviewer is looking at both signals together with provider-side congestion as the third axis. The org chart is reproducing the bug.

The fix here is structural, not technical. The latency-quality joint dashboard has to exist before any team owns it, because no team will request a dashboard for a problem they have been organized not to see. The leadership move is to pick one engineer — pragmatically, the one who already complains the loudest about cross-team correlations — and give them the explicit charter to overlay quality metrics on provider congestion signals on the same time axis, then make the result a standing item in whatever cross-functional review already exists. The point is not the dashboard. The point is that the org now has someone whose job it is to notice.

What an eval harness that respects time-of-day looks like

A quality regression that only appears under provider load is a regression that an off-peak eval suite cannot, in principle, catch. The eval harness has to sample across the load curve, not against an idealized empty fleet. Three concrete moves change this:

  • Run evals on a schedule that matches user load, not infra convenience. If your users hit the feature hardest at 10 AM Eastern, your eval suite should run at 10 AM Eastern, on the same provider tier and region as production traffic. Off-peak runs are still useful as a baseline; they are not sufficient on their own. The cost of the extra runs is rounding error compared to a quarter of users tolerating a degraded experience.
  • Use synthetic load to manufacture provider congestion in your own staging path. Open-source toolkits like GuideLLM let you simulate real-world traffic patterns against an inference deployment and measure throughput and latency under controlled load. Pair the load generator with an eval harness and you now have a way to grade model quality against saturation conditions you can re-create on demand. This is the cheap version of the test.
  • Shadow your evals onto production traffic during peak windows. Shadow testing — duplicating live requests to a candidate model without affecting users — is the established pattern for new-model rollouts. The same plumbing works for grading the current model against itself under real peak conditions. You get the unforgiving inputs your users actually send, at the load profile that matters, without exposing anyone to the result.

The point is not to pick one of these. It is to give up on the idea that a single low-variance, low-traffic eval window represents production. It does not, and the gap between the two is the bug.

Comms and posture: name the diurnal floor

There is a customer-facing dimension to this that engineering teams tend to under-invest in until it shows up as a renewal risk. If your feature has time-of-day quality drift and you have decided not to fully fix it — a defensible choice; provider load is upstream of you — then the comms posture has to name it. Promising flat latency that the team cannot deliver is worse than naming the variance up front.

Three specific moves help:

  • Status pages need a quality dimension, not just an uptime one. "AI features may run slower or with reduced reasoning depth during peak provider load" is a sentence the support team can point at instead of having to write five tickets a morning explaining the same thing five times.
  • The product surface can hint at the floor. A subtle "this response took longer than usual" indicator is cheaper than a refund and earns more trust than silent degradation. Anchor the user's expectation to the actual distribution, not to the best case.
  • Sales and success need a vocabulary for it. "AI feature responsiveness varies with provider load" should be a phrase the customer-facing org can say without flinching. If it is not, the renewal call where the customer raises the issue is one your engineering team is going to lose.

These are not concessions. They are an acknowledgment that AI feature quality has a diurnal component your eval suite did not see, your dashboard does not surface, and your customer notices first.

What changes when the team sees the curve

The interesting downstream effect of sampling across the load curve is that the team stops treating quality as a single number and starts treating it as a distribution conditioned on provider state. The first time you see a quality-vs-congestion overlay, the conversation in the launch meeting changes from "ship at 87%" to "ship at 87% median, 74% at the peak-load tenth percentile — is the floor acceptable?" That is a better conversation, and it is the conversation your users have already been having on your behalf in support tickets every morning.

The autoscaling policy on the application's own concurrency window is the next move. Pre-emptively widening deadlines during known peak windows beats chasing the regression after it lands; the agent loop that finishes five tool calls at 2 AM should be allowed to finish five tool calls at 10 AM, even if each call is slower, rather than being forced to truncate to four because the deadline was set to the off-peak case.

The architectural realization here is small but consequential: AI feature quality is a joint property of your model, your prompt, your agent loop, and the provider's load curve at the moment of inference. The team that treats it as a property of the first three is over-fitting to the shape of an empty provider. The team that treats it as joint is the one that ships a feature whose floor it has actually measured.

References:Let's stay in touch and Follow me for more thoughts and updates