Diurnal Latency: Why Your AI Feature Is Slowest at 9am ET
Sometime in the last quarter, an engineer on your team opened a Slack thread that started with "the model got slow." They had a graph: p95 latency for your assistant feature climbed steadily from 7am, peaked around 10am Eastern, plateaued through lunch, and quietly recovered after 5pm. The shape repeated the next day, and the day after that. The team retraced their deploys, blamed a tokenizer change, then a context-length regression, then nothing in particular. The fix never landed because the bug never lived in your code.
Frontier model providers run shared inference fleets. When your users wake up, so does the rest of North America, plus the European afternoon, plus every internal tool at every other company that bought into the same API. Queue depth at the provider doubles, GPU contention rises, and your p95 doubles with it — without a single line of your codebase changing. It is the most predictable production incident in your stack and almost no team builds a dashboard for it.
The shape of the curve
Public latency trackers and provider postmortems converge on the same daily rhythm. Weekday peak load on the major US-hosted endpoints sits roughly between 8am and 2pm Eastern, with a smaller secondary bump after dinner when consumer chat traffic picks up. Off-peak windows — call it midnight to 6am Eastern — routinely run 30 to 40 percent faster on identical workloads, sometimes more on long prompts where the time-to-first-token is dominated by prefill scheduling rather than decode throughput.
The variance does not announce itself as an outage. There is no 5xx, no rate-limit error, no incident on the provider's status page. Requests still complete; they just take longer. A workload that finishes in 4 seconds at 4am can take 9 seconds at 10am, and your tracing tool will quietly fold both into the same histogram. If you aggregate across the day, the daily average looks fine and the daily p95 looks like noise. The signal lives entirely in the time dimension you threw away.
Research workloads have measured this directly. A two-month trace of campus ChatGPT usage published in a 2025 systems paper showed request rates fluctuating up to 3× within minutes while following broader diurnal patterns, and operators were forced to overprovision GPUs to the peak just to hold their service-level objectives during the bad hours. That overprovisioning is what your provider also does, but only up to the contractual budget — the rest of the load lands on you as latency.
Why your dashboard hides it
Most teams set up latency monitoring the way they set up latency monitoring for a database: a histogram aggregated across the full sampling window, with p50, p95, and p99 lines on a single chart. That is the right shape for a service whose load you control. It is the wrong shape for a service whose load is dominated by other people's traffic.
The aggregation hides the bimodality. If 30 percent of your traffic lands in the off-peak window and 70 percent in peak, your p95 reflects a weighted blend that no individual user ever experiences. A 2am batch job and a 10am customer-facing call are wired through the same client and accounted for in the same bucket, even though they live on different planets latency-wise. The dashboard reads as "stable with high tail variance" when the truth is "two stable regimes with a deterministic switch in between."
The fix is not subtle: cohort by hour-of-day. Cut the same histogram into 24 sub-histograms, one per hour, and lay them out as a heatmap. The diurnal pattern will jump out instantly — usually a hot band from 13:00 to 19:00 UTC that fades on weekends. Layer per-region latency on top and you can see the European morning contributing to the slope before North America wakes up. Once you have that view, every model-side latency question gets one extra column on the way to an answer: was this user in the bad hours?
What you can actually do
- https://gptforwork.com/tools/openai-api-and-other-llm-apis-response-time-tracker
- https://llmoverwatch.com/
- https://research.aimultiple.com/llm-latency-benchmark/
- https://kickllm.com/research/ai-api-latency-comparison.html
- https://tokenmix.ai/blog/ai-api-latency-benchmark
- https://www.lololai.com/blog/the-strategic-timing-guide-how-time-of-day-impacts-genai-performance
- https://community.openai.com/t/avoiding-throttling-during-peak-hours/1358839
- https://www.sentisight.ai/the-busiest-times-for-generative-ai-usage/
- https://jovans2.github.io/files/DynamoLLM_HPCA2025.pdf
- https://arxiv.org/html/2410.01228v1
- https://portkey.ai/blog/failover-routing-strategies-for-llms-in-production/
- https://www.together.ai/blog/batch-api
- https://platform.claude.com/docs/en/test-and-evaluate/strengthen-guardrails/reduce-latency
