Skip to main content

The Provider Reliability Trap: Your LLM Vendor's SLA Is Now Your Users' SLA

· 9 min read
Tian Pan
Software Engineer

In December 2024, Zendesk published a formal incident report stating that from June 10 through June 11, 2025, customers lost access to all Zendesk AI features for more than 33 consecutive hours. The engineering team's remediation steps were empty — there was nothing to do. The outage was caused entirely by their upstream LLM provider going down, and Zendesk had no architectural path to restore service without it.

This is the provider reliability trap in its clearest form: you ship a feature, make it part of your users' workflows, promise availability through implicit or explicit SLA commitments, and then discover that your entire reliability posture is bounded by a dependency you don't control, can't fix, and may not have formally evaluated before launch.

The trap is structural. You can't engineer your way out of it at incident time. The decisions that matter happen months before any specific outage, during architecture and product design. Most teams make those decisions badly — not because they're careless, but because the math is invisible until it bites.

The SLA Arithmetic Nobody Does Before Launch

The fundamental property of serial dependencies is that their availabilities multiply. If your own infrastructure runs at 99.9% and your LLM provider delivers 99.5%, your theoretical maximum availability is 0.999 × 0.995 = 99.4%. You cannot offer 99.9% to your customers if a serial dependency in your critical path only guarantees 99.5%.

This math gets worse as you add dependencies:

  • 3 services each at 99.9%: 0.999³ ≈ 99.7%
  • 5 services each at 99.9%: 0.999⁵ ≈ 99.5%
  • Your infra (99.9%) + LLM at 99.5% + vector DB at 99.9%: ≈ 99.3%

What do those numbers feel like in practice? A 99.5% SLA allows 43.8 hours of downtime per year — roughly 3.6 hours per month. A 99.9% SLA allows 8.76 hours per year. If you're offering 99.9% to enterprise customers and your LLM provider is capped at 99.5%, you have a 5x gap before a single line of application code fails.

The problem is compounded by the current state of provider SLAs. Anthropic's standard tier has no published SLA — it's best-effort availability. OpenAI's direct API for standard customers also offers no contractual guarantee. Google's Vertex AI publishes a 99.5% target. Azure OpenAI offers 99.9%, which is why enterprises route through it specifically. Even the best publicly available LLM SLA maxes out at 99.9%, and most standard-tier users have nothing in writing.

Why Incidents Cluster at the Worst Possible Time

A common mental model treats provider outages as independent events uniformly distributed across time. This model is wrong, and acting on it underestimates your real risk.

LLM inference is GPU-constrained. Unlike CPU-based services that can spin up thousands of additional instances in minutes, GPU provisioning has lead times measured in months. When a new model launches or a viral use case drives sudden demand, providers can't elastically scale to absorb it. The result is that capacity pressure and infrastructure stress co-occur with the usage spikes you care most about.

The OpenAI December 2024 outage was triggered by a new telemetry service overwhelmed Kubernetes API servers — infrastructure changes that accompanied rapid platform evolution. The March 2025 image generation launch coincided with infrastructure strain and subsequent outages. Multiple documented incidents show this pattern: new capability launches drive usage surges, infrastructure changes are deployed to support growth, and outages follow.

Academic analysis of LLM service telemetry from 2023–2024 found that ChatGPT was fully accessible on only 88.85% of days during that period. The key implication: if you're running a B2B product where customers use your AI features during business hours, and your provider's outages correlate with high-traffic periods, your user-visible reliability is lower than the headline availability numbers suggest.

The Product Decisions That Must Happen Before an Outage

The most consequential reliability work isn't incident response — it's the categorization exercise you should do before launch. Every AI feature in your product falls into one of three categories:

Features that can serve from cache. If the LLM output is deterministic or slowly varying given similar inputs — product descriptions, FAQ responses, content summaries, recommendation explanations — you can cache responses and serve them during provider outages. One documented production case improved effective uptime from 99.2% to 99.87% by adding circuit breakers with cached response fallback. This category often represents more of your feature surface than you expect.

Features that can degrade to a lighter model or rule-based fallback. Simple intent classification, basic routing, keyword matching — for many tasks, a cheaper or locally-hosted model provides enough functionality to maintain the feature at reduced capability. A customer support bot that can't access Claude might still route tickets to the right queue using a rule-based classifier. Users experience degraded quality, not outage.

Features that hard-fail. Novel document analysis, multi-step agentic workflows, real-time code generation — these have no reasonable pre-computable fallback. When your provider goes down, these features go down. The critical question isn't how to prevent this but whether you've been honest with users about the dependency and designed the failure message accordingly.

Skipping this categorization means every feature hard-fails by default. Doing it upfront means you can preserve a meaningful fraction of your product's value during incidents.

The Multi-Provider Architecture

For features in the "must stay up" category, the engineering answer is multi-provider fallback. The pattern is straightforward in principle:

Sequential failover triggers on 429 (rate limit), 503 (service unavailable), and 5xx errors. Critically, it should not trigger on 400-level user errors — those indicate bad requests that will fail at any provider, and cascading them wastes credits and latency. The router detects failure from provider A and re-routes to provider B within a single request lifecycle.

Hedged requests fire a request at provider B if provider A hasn't responded within a threshold latency. Return whichever responds first. This handles degraded (slow) providers, not just failed ones.

Circuit breakers track rolling error rates across a time window. When the rate crosses a threshold — say, 20% errors in 60 seconds — the circuit opens and all traffic routes to the secondary provider for a cooldown period. This prevents thundering-herd retries against a failing provider.

Load splitting runs traffic across providers continuously at something like 70/30 rather than treating provider B purely as a failover target. This keeps your secondary path warm, validates its behavior with real traffic, and reduces the blast radius when provider A fails, since 30% of users are already being served from B.

LiteLLM's Router class implements these patterns in open source. Portkey and similar AI gateways offer managed versions with governance and observability layers. Building it yourself is straightforward but requires more operational discipline around failure testing.

One real production configuration used by a developer building AI-powered tools in the open source community: Anthropic Claude as primary → OpenAI as secondary → OpenRouter as a catchall. The pattern is becoming the enterprise default: enterprise spending data from 2025 shows multi-provider architectures across Anthropic (40% of enterprise AI spend) and OpenAI (27%) as the dominant pattern, reflecting revealed preference for not putting all workloads on one vendor.

Prompt Compatibility Is the Hidden Tax

Provider fallback sounds clean until you encounter the practical problem: prompt syntax, tool calling schemas, tokenizer behavior, and model response patterns differ meaningfully across providers. A prompt tuned for Claude 3.7 Sonnet may produce different outputs on GPT-4o — different in ways that may or may not matter for your use case, but that you won't know until you test.

The operational discipline required: treat your fallback provider as a first-class production path, not a cold spare. Run the same eval suite against it. Monitor its outputs. When you tune a prompt against your primary provider, check whether the changes hold on your fallback. This isn't free work, and it's the reason many teams add a fallback provider to their config but never verify it actually works.

The minimum viable test: run your 50 most common request types against provider B under synthetic load once per week. Surface behavioral divergence in your eval pipeline. It won't catch everything, but it will catch the worst regressions before users encounter them mid-incident.

The Dependency Risk Calculation for Your Launch Checklist

The question isn't whether your team has built graceful degradation for every LLM feature. That's often the wrong level of investment for a product in its first year. The question is whether you've been explicit about the calculation.

For each AI feature before launch:

  1. Classify it. Cache-eligible, lighter-model degradable, or hard-fail. This shapes the honest SLA you can offer.
  2. Do the serial dependency math. Multiply your own infra availability by your provider's published SLA. If you're on a standard tier with no published SLA, treat it as 99.5% for conservative planning.
  3. Compare to your commitment. If you're offering customers 99.9% and the math gives you 99.4%, you need either fallback architecture or a conversation with your enterprise customers about what "uptime" includes.
  4. Build the failure mode. For hard-fail features, write the failure message now. "AI features are temporarily unavailable" — with a fallback to static help content or human routing — is better than a broken UI with no explanation.

The teams that get hurt by provider outages are usually not the ones who couldn't build fallback — they're the ones who never did the math that would have told them fallback was necessary. The provider reliability trap is a trap precisely because the gap between what you promise users and what your dependencies can deliver is invisible until an incident makes it concrete.

The SLA You're Actually Offering

When your LLM provider goes down, it doesn't matter what's in your service agreement with your customers. What matters is whether they can do their jobs. The Zendesk engineers who watched 33 hours of AI feature downtime roll by had no contractual obligation that could restore service — the math had been wrong before any incident occurred.

The correct mental model: your product's reliability is the minimum of your own infra and every serial dependency in your critical path. Every LLM provider you add without fallback architecture is a dependency that sets your ceiling. The ceiling is lower than most teams think, and it drops further during the product launches and high-traffic moments when reliability matters most.

Design for the gap before the gap finds you.

References:Let's stay in touch and Follow me for more thoughts and updates