Your Load Tests Are Lying: LLM Provider Capacity Contention in Production
You ran a load test. Your p95 latency was 450ms. You felt good about it, shipped the feature, and then your on-call rotation lit up two weeks later because users were seeing 25-second response times at 9 AM on a Tuesday.
Nothing changed in your code. No deployment, no config change. The provider's status page said "operational." And yet your app was unusable for 20 minutes during peak business hours.
This is the LLM capacity contention problem, and it's one of the most common failure modes engineers don't see coming until they've already been burned.
Why Your Load Test Cannot Reproduce This
When you load test a traditional API, you control the server. Your test directly maps to the system under test. Apply enough traffic, you'll find the breaking point.
LLM APIs work differently. You are one tenant among thousands on a shared infrastructure. The provider's capacity is pooled — the same GPUs that run your requests also run requests from every other customer on the same tier, the same region, the same model.
Your load test runs at 2 AM in a dedicated test environment. It hits the provider when global demand is low. Capacity is available, queues are empty, and you get representative response times. You tune your timeouts around those numbers, call it done, and ship.
What you haven't tested: what happens to your latency when the East Coast start-of-day traffic hits the same model you're using, simultaneously, from thousands of other organizations. Or when a viral product launch by some company you've never heard of spikes demand on gpt-4o while your job runs. Or when the provider rolls out a new model tier that quietly draws capacity away from the tier you're on.
The gap between your test environment and this shared production reality is structural. No amount of tuning your test harness closes it.
The Specific Pattern: Shared Peak Demand Pressure
LLM providers publish rate limits in terms of requests per minute (RPM) and tokens per minute (TPM). These are per-customer limits, not global capacity numbers. The implication most teams miss: staying within your rate limits does not mean you're protected from latency degradation when the provider's shared infrastructure is under pressure.
When global demand spikes, several things happen at the provider level:
- Queue depth grows. Your requests enter a longer queue before being scheduled on a GPU. This adds latency before the first token even starts generating.
- Batching decisions shift. Inference systems batch requests together to improve GPU utilization. Under high load, batch composition changes, which affects Time to First Token (TTFT) even for requests that would otherwise have been fast.
- Regional imbalance appears. Providers route traffic across regions, but capacity isn't perfectly distributed. If a region gets overwhelmed, requests overflow to farther-away regions, adding round-trip latency.
The Azure OpenAI incident in November 2025 illustrated this clearly: latency spiked from 2–5 seconds to over 60 seconds in the Sweden Central region due to backend capacity constraints from high global demand. The per-customer rate limits were never hit. The shared infrastructure was simply saturated.
OpenAI's December 2024 incidents followed similar patterns — load balancer misconfigurations and infrastructure upgrades during high-demand periods produced cascading failures where 45% of API requests returned errors for over 90 minutes. Not because any individual customer's limits were exceeded, but because shared systems failed under load.
By Q1 2025, LLM API uptime across the industry had fallen from 99.66% to 99.46% year-over-year — a 60% increase in downtime as demand growth outpaced infrastructure scaling. Individual providers like Anthropic averaged roughly 158 incidents in a 90-day window, with median incident duration over an hour.
The Metrics You Should Be Tracking (But Probably Aren't)
Most load tests measure what traditional tools measure: requests per second, error rates, response time. These are wrong metrics for LLM APIs.
The metrics that actually matter:
Time to First Token (TTFT): The latency from request submission to receiving the first token of the response. This is what users perceive as "lag." Target values: under 500ms for conversational interfaces, under 200ms for copilot-style assistants. Under provider capacity pressure, TTFT is the first metric to degrade — it reflects queue depth directly.
Inter-Token Latency (ITL): The time between consecutive tokens in a streaming response. Values above 100ms create visible stuttering. This reflects the provider's current decoding throughput and is sensitive to GPU resource contention.
p99 latency, not p50: Your median response time is a vanity metric. Users who hit the p99 case, the ones who wait 20 seconds when everyone else waits 2, are the ones who file support tickets and churn. Track your p95 and p99 in production separately from your test environments.
Goodput: The fraction of requests that completed successfully within your SLO deadline. A request that takes 90 seconds and succeeds isn't a success for a product with a 10-second timeout — it's a failure that burned quota and returned nothing to the user.
Multi-Provider Hedging: The Practical Architecture
The only structural defense against provider capacity contention is not depending on a single provider for all traffic. This isn't a backup-provider-for-emergencies setup — it's active traffic distribution.
The pattern used by teams that have been through this:
Latency-based routing: Route each request to the provider currently showing the lowest latency for that model tier. Tools like LiteLLM support this natively. You configure a pool of providers (OpenAI + Azure OpenAI + Anthropic, or any combination), set a health-check interval, and the router distributes traffic dynamically based on observed response times. When one provider starts slowing down, traffic naturally shifts away from it.
Fallback chains: Every primary provider configuration should have an explicit fallback sequence. A common pattern: preferred model on preferred provider → same model on secondary provider → smaller/faster model on any provider. The degradation is visible to users (a faster but less capable model), but the alternative — a timeout — is worse.
Request hedging: For latency-sensitive paths, send the same request to two providers simultaneously and return whichever responds first. Cancel the slower one. This is expensive (you pay for two requests) but appropriate for user-facing flows where latency variance is unacceptable. Reserve it for the critical path only.
The important implementation detail: make sure your routing layer is measuring actual production latency, not just checking status page indicators. Providers declare "operational" while being significantly slower than normal. Your p95 latency measurement at the request level will catch this before the status page does.
Capacity Reservation: When Hedging Isn't Enough
Multi-provider hedging reduces your exposure to any single provider's bad day. But if you have hard SLO commitments — internal or external — and predictable, high-volume workloads, hedging alone may not be sufficient. You'll still be competing with every other tenant on the shared pool.
The providers have answered this with reserved capacity tiers:
AWS Bedrock Provisioned Throughput: Dedicated capacity for a chosen model, billed hourly regardless of actual usage. You're no longer on the shared pool — your requests go to reserved compute. The break-even point: if you're using more than roughly 50–70% of the reserved capacity at the price point you'd otherwise pay on-demand, provisioned throughput saves money and buys predictability.
Azure OpenAI Provisioned Throughput Units (PTUs): Similar reserved model, billed monthly per PTU. The tradeoff is explicit: you commit to capacity in advance, and in exchange you get predictable latency and no exposure to shared-tier congestion.
OpenAI's tier system: OpenAI's paid tiers (Tier 3 and above) come with higher rate limits and, implicitly, priority access. Teams that have consistently high spend are less exposed to shared-tier capacity pressure.
The caveat: reserved capacity does not help if the provider's infrastructure has a broader incident. It isolates you from tenant-to-tenant contention, not from infrastructure failures. That's why reservation and multi-provider hedging are complementary rather than substitutes.
Circuit Breakers and SLO-Aware Retries
Retries are the default response to LLM API failures, and they're frequently counterproductive during capacity pressure events. When a provider is degraded, each retry is another request competing for scarce capacity. A retry storm — thousands of clients retrying simultaneously — can keep an already-struggling provider down longer than the underlying issue would have lasted on its own.
The right pattern is layered:
Timeout-first, not retry-first: Set aggressive per-request timeouts that reflect your actual SLO budget, not the maximum the provider is capable of. If your product requires a 10-second response, timeout at 8 seconds, not 30. Don't let slow requests accumulate.
Exponential backoff with full jitter: When retrying, don't retry at a fixed interval. Double the wait time each attempt, add random jitter to spread the retry distribution, and cap the total retry budget (not just the count — a single request that retries 5 times over 30 seconds is too slow to matter). Jitter is not optional; without it, retries cluster and recreate the thundering herd.
Circuit breakers on the provider: Track error rate and p95 latency per provider. When either metric exceeds your threshold, open the circuit — stop sending requests to that provider for a cooldown period. After cooldown, allow a single probe request to test recovery. The three-state model (closed → open → half-open) prevents both cascade failures and permanent exclusion.
The configuration that works in practice: circuit opens after a 20–30% error rate over a 60-second window, or after p95 latency exceeds 2× your normal baseline for 5 minutes. Cooldown period: 30–60 seconds. These numbers need to be tuned to your specific workload, but they're reasonable starting defaults.
Observability: Seeing Provider Capacity Pressure Before Users Do
The gap between "something is wrong" and "users are impacted" is where your observability stack matters. The signals that precede user-visible degradation:
TTFT by provider and region: A 20% increase in median TTFT on a specific provider-region combination is an early warning that capacity pressure is building. You'll see this 5–10 minutes before p99 latency becomes user-visible. Alert on TTFT trends, not just absolute thresholds.
Request queue age: If you're using a gateway layer (LiteLLM, Portkey, or your own), track how long requests spend in the queue before being dispatched. Queue age growing is a leading indicator of both provider slowness and your own throughput limits.
Error rate by provider: 429 (rate limited) and 5xx errors from specific providers are the clearest signals. Aggregate these at 1-minute granularity and alert at 5% error rate on any single provider.
Token throughput efficiency: Track actual tokens delivered per second versus expected. A model that normally generates 80 tokens/second but is averaging 20 is experiencing backend contention you won't see in your error rate.
OpenTelemetry's LLM observability conventions provide a standard schema for these metrics. Platforms like Langfuse and LangSmith provide it out of the box if you instrument your LLM calls through them. If you're rolling your own, emit TTFT, ITL, total tokens, and error code on every request as structured log events — you can build dashboards and alerts from that foundation.
Putting It Together: The Production-Ready Architecture
The teams that don't get paged on LLM capacity events run roughly this architecture:
-
A gateway layer in front of all LLM calls — not each individual service calling the provider directly. The gateway centralizes routing decisions, retry logic, and circuit breaker state.
-
Active-active across 2+ providers for the same model capability, with latency-based routing. Not warm standby — live traffic split.
-
Per-provider circuit breakers with p95 latency + error rate triggers.
-
Retry budgets tied to the global deadline of the user-facing request, not the individual LLM call timeout.
-
TTFT and p99 latency dashboards per provider, with alerts at 1.5× baseline that notify before the on-call phone rings.
-
Reserved capacity for the highest-priority, highest-volume flows where SLO commitments are contractual.
None of this is exotic. The patterns are borrowed directly from how teams run databases and external APIs with uptime requirements. The difference is that LLM APIs are younger, the outage history is less well-documented, and the shared-capacity model means your blast radius includes every other organization on the same tier.
Your load tests were not lying on purpose. They just couldn't see what they couldn't control. Build the architecture that doesn't assume the test environment is the whole story.
- https://reintech.io/blog/llm-load-testing-benchmark-ai-application-production
- https://gatling.io/blog/load-testing-an-llm-api
- https://status.openai.com/history
- https://status.anthropic.com/history
- https://docs.litellm.ai/docs/routing
- https://openrouter.ai/docs/guides/routing/provider-selection
- https://portkey.ai/blog/failover-routing-strategies-for-llms-in-production/
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://aws.amazon.com/bedrock/pricing/
- https://www.palantir.com/docs/foundry/aip/llm-capacity-management
- https://opentelemetry.io/blog/2024/llm-observability/
- https://zuplo.com/learning-center/token-based-rate-limiting-ai-agents
