The LLM Provider Incident Runbook: Staying Up When Your AI Stack Goes Down

April 17, 2026 · 11 min read

Software Engineer

In December 2024, OpenAI's entire platform went dark for over four hours. A new telemetry service had been deployed with a configuration that caused every node in a massive fleet to simultaneously hammer the Kubernetes API. DNS broke. The control plane buckled. Every service went with it. Recovery took so long partly because the team lacked what they later called "break-glass tooling" — pre-built emergency mechanisms they could reach for when normal procedures stopped working.

If you were running an AI-powered product that day, you were making decisions fast under pressure. Multi-provider routing? Graceful degradation? Cached responses? Or just a status page and a prayer?

This is the runbook you should have written before that call came in.

LLM providers target 99–99.5% uptime, which sounds reasonable until you do the math: that's 6 to 14 times worse than the SLAs you probably assume for your own cloud infrastructure. And degradations — silent output quality issues, elevated latency, partial availability — happen far more often than full outages. A 2025 empirical analysis of 792 incidents across major providers found the failure modes varied significantly: OpenAI's architecture tended toward compartmentalized failures, while Anthropic's service interdependencies led to more cascading cross-service impacts.

The engineering implication: you cannot treat LLM provider reliability as someone else's problem. You have to build for it.

The Three Layers of Failure (And Why They Need Different Responses)

Before writing any runbook, you need to categorize what failure actually looks like. There are three distinct classes, and they require entirely different responses.

Hard failures are the obvious ones: 503s, timeouts, the provider's status page going red. Your monitoring catches these within seconds. The response is routing: send traffic somewhere else.

Soft failures are more dangerous. The provider returns 200 OK, but the responses are wrong — lower quality, truncated, semantically degraded. Anthropic's August 2025 post-mortem documented exactly this: three overlapping infrastructure bugs caused 16% of Claude Sonnet requests to be silently misrouted or degraded over weeks. Standard HTTP uptime monitoring saw nothing abnormal. Standard APM tools saw nothing abnormal. Only output quality monitoring — sampling model responses and evaluating them — would have surfaced the issue early.

Latency failures sit in between. The provider is technically up, but p99 latency has tripled. Your streaming responses feel broken. Users give up before completion. This one is particularly nasty because it's hard to attribute: is it the provider, your network, your prompt length, or something downstream?

Your runbook needs distinct detection paths and response protocols for all three. Teams that only plan for hard failures leave themselves blind to the conditions that often precede them.

Detection: What You Should Be Measuring Before the Incident

Most teams instrument LLM calls the way they instrument any API: request count, error rate, p50 latency. That's necessary but not sufficient.

The metrics that actually matter during an incident:

p99 latency by provider and model, not just average. Averages mask tail degradation that users experience acutely.
Error rate by error type: 429 (rate-limited), 503 (unavailable), and provider-specific failure codes each mean different things and call for different responses. A surge in 429s means you've hit a quota; a surge in 503s means the provider is struggling.
TPM consumption versus limits, not just RPM. Teams running agents or RAG pipelines routinely ignore token-per-minute limits until an incident reveals that their long prompts are hitting the TPM ceiling long before the request ceiling.
Fallback activation rate: how often is your routing layer switching to backup providers? A gradual increase here is often your first signal of a developing incident at the primary.
Output quality signal: sample a percentage of production responses and run lightweight evals. You don't need perfect quality scoring — you need an anomaly detector that fires when output length, structure, or semantic similarity to expected patterns shifts meaningfully.

Alerting should trigger on combinations, not single metrics. p99 latency spiking alone might be a fluke. p99 spiking while error rate climbs while fallback activation increases is an incident.

The Routing Layer: Building Failover That Works Under Pressure

The routing layer is the core of your resilience strategy. Here's how to think about it structurally.

Retry with jitter before failing over. Transient errors — network blips, brief load spikes — resolve within seconds. A simple exponential backoff with randomized jitter absorbs most of these without touching your fallback providers. The jitter matters: without it, all clients retry simultaneously, creating the retry storms that amplify roughly 40% of cascading failures.

Circuit breakers for sustained failures. When a provider's error rate crosses a threshold (say, 40% of requests failing within a 60-second window), stop sending traffic and trip the circuit breaker. Route to your backup. Periodically probe the primary with a small percentage of traffic to detect recovery. This reduces failover latency from the 10+ seconds of waiting for timeouts to milliseconds of proactive detection.

Model equivalence mapping. For each model you use in production, define the fallback mapping before any incident. GPT-4o maps to Claude Sonnet or Gemini 1.5 Pro. For each mapping, document the known behavioral differences — response verbosity, JSON adherence, instruction following — so you know what to expect when the switch happens automatically at 2 AM.

Avoid mid-stream fallover. If you're streaming responses, attempting to switch providers mid-generation produces a jarring and often broken user experience. Trigger failover before the first token is returned, not after. This requires your circuit breaker to have made the routing decision upfront.

One team achieved 99.97% effective uptime through a full-stack approach: primary provider with circuit breaker → secondary provider with same model class → simplified fallback model → cached response. The trade-off was a 20–30% increase in evaluation effort to maintain quality assurance across all nodes in the fallback chain.

Graceful Degradation: What Your Product Does When AI Isn't Available

Failover to another provider is the best case. But what if no provider is available, or the task requires a specific capability that your fallback can't match? You need a degradation strategy.

Semantic caching is often your first line. A cache keyed on semantic similarity (not exact string match) can serve prior responses for queries that approximate what users have already asked. Cache hit rates of 30–70% are achievable depending on use-case repetitiveness. For high-volume support or Q&A applications, this means the majority of user queries survive an LLM outage with no visible degradation.

Rule-based fallbacks trade accuracy for reliability. For structured tasks — classification, routing, simple extraction — you often have a rules-based implementation that was superseded by the model but never deleted. These run in milliseconds and are 100% reliable. Keeping them alive as fallback paths is cheap insurance.

Feature flags, not hardcoded routing logic, should control when degradation activates. This lets you switch behaviors without a deployment, test degradation paths in production before you need them, and roll back quickly if a fallback behavior has unexpected consequences.

Disabled is a valid degraded state. For AI features that are genuinely non-critical — summarization, suggestions, enhancements — showing nothing is better than showing wrong. A missing "AI summary" is a missing feature. A confidently wrong AI summary during an incident is a trust problem.

The decision tree for each AI feature should be explicit: if the primary fails, do we try a fallback model, serve a cached response, apply rules, or disable? That decision should be made in advance and encoded in your configuration, not improvised during an incident.

The Runbook Itself: What to Do in the First Thirty Minutes

When an alert fires, the first thirty minutes determine whether an incident is contained or escalates. Here's the minimal structure you need pre-written.

Triage (first 5 minutes):

Check the provider's status page. Note the time and status for the incident log.
Pull your real-time dashboard: error rate, p99 latency, fallback activation rate.
Classify the failure type: hard failure, soft degradation, or latency failure.
Confirm your circuit breakers have tripped correctly. If not, trip them manually.

Containment (5–15 minutes): 5. Verify traffic is routing to fallback providers. Check spend rate — unconstrained failover can spike costs unexpectedly if fallback providers have different pricing. 6. If degradation is affecting output quality rather than availability, decide whether to disable the feature or continue with degraded output. This decision should be pre-made for each feature. 7. Update your internal status channel. The first post should go out within 10–15 minutes even if it only says "investigating."

Communication (15–30 minutes): 8. If user-facing impact is confirmed, post to your external status page. Specificity beats vagueness: "AI features are experiencing degraded response quality due to a provider infrastructure issue" is better than "some features may be affected." 9. Do not promise resolution times you cannot commit to. Update every 30–60 minutes until resolved.

Resolution: 10. Before declaring the incident resolved, probe the primary provider with a small percentage of traffic for at least 10 minutes. Wait for stability before fully reopening the circuit. 11. Post a brief summary within 24 hours. A full post-mortem within 72 hours. This isn't just transparency — it's how your team learns.

The single most common failure mode in AI incident response is making decisions in the first thirty minutes that should have been made before the incident ever happened. Which models are acceptable fallbacks? Who has authority to disable a feature? What's the cost ceiling before escalation? Those questions need answers written in the runbook, not improvised under pressure.

Silent Degradation: The Failure Mode Your Monitoring Probably Misses

The Anthropic August 2025 incident is worth studying specifically because it reveals a gap in most teams' observability.

Three infrastructure bugs compounded over weeks. At peak, 16% of requests were being misrouted or served from incorrectly configured servers. None of this produced HTTP errors. Standard monitoring showed healthy systems. Roughly 30% of Claude Code users experienced at least one degraded interaction before the issue was identified.

This is the failure mode that standard uptime monitoring cannot detect: semantic output degradation. The service is up. The API returns 200. The response is just wrong, or subtly off, in ways that require understanding what a good response looks like.

Defending against this requires output quality monitoring: sampling production responses and evaluating them. The evaluation doesn't need to be sophisticated — length anomalies, structural violations (expected JSON that isn't JSON), semantic similarity to a small set of known-good responses — but it needs to exist. Build it before you need it. Once you're in an incident, you need to know whether the degradation started an hour ago or three weeks ago.

Cost as an Incident Metric

One failure mode that gets under-discussed: during an incident, your cost structure can collapse before you notice.

A financial services team documented their agents entering recursive loops when their primary provider started returning malformed responses. Their weekly LLM spend went from $127 to$ 47,000 before an alert fired. The circuit breaker logic had been implemented for availability, not for cost. The per-provider cost thresholds had never been set.

Add cost to your incident monitoring:

Alert when per-provider spend rate increases by more than 3x within any 15-minute window.
Set hard caps with automatic circuit-breaker behavior — not just soft alerts — so runaway agent loops self-terminate.
Log provider switches alongside cost so you can reconstruct exactly when and why spend spiked during a post-incident review.

Incidents that start as availability events can become cost events through failover alone. A request that costs $0.01 on your primary provider might cost$ 0.08 on your backup. At scale, that's not background noise.

Conclusion: Write the Runbook Before You Need It

LLM providers are more reliable than they were two years ago, and less reliable than your database. That gap will not close completely — the infrastructure is too complex, the demand too unpredictable, the hardware ecosystem too heterogeneous.

The teams that navigated the OpenAI December 2024 outage and the Anthropic August 2025 degradation without significant user impact had one thing in common: they had built for this. Multi-provider routing was already running. Fallback models were already mapped. Cost alerts were already configured. Feature degradation decisions were already documented.

The runbook isn't a document you write during an incident. It's the document that makes the incident manageable when it arrives. Write it now, while you have time to think clearly. Then test it — a planned failover drill on a Thursday afternoon reveals gaps that a 2 AM incident will not let you discover calmly.

Your LLM provider will go down. The only question is whether you're ready.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The LLM Provider Incident Runbook: Staying Up When Your AI Stack Goes Down

The Three Layers of Failure (And Why They Need Different Responses)

Detection: What You Should Be Measuring Before the Incident

The Routing Layer: Building Failover That Works Under Pressure

Graceful Degradation: What Your Product Does When AI Isn't Available

The Runbook Itself: What to Do in the First Thirty Minutes

Silent Degradation: The Failure Mode Your Monitoring Probably Misses

Cost as an Incident Metric

Conclusion: Write the Runbook Before You Need It

Recommended Reading

About Tian Pan

The Three Layers of Failure (And Why They Need Different Responses)​

Detection: What You Should Be Measuring Before the Incident​

The Routing Layer: Building Failover That Works Under Pressure​

Graceful Degradation: What Your Product Does When AI Isn't Available​

The Runbook Itself: What to Do in the First Thirty Minutes​

Silent Degradation: The Failure Mode Your Monitoring Probably Misses​

Cost as an Incident Metric​

Conclusion: Write the Runbook Before You Need It​

Recommended Reading

About Tian Pan

The Three Layers of Failure (And Why They Need Different Responses)

Detection: What You Should Be Measuring Before the Incident

The Routing Layer: Building Failover That Works Under Pressure

Graceful Degradation: What Your Product Does When AI Isn't Available

The Runbook Itself: What to Do in the First Thirty Minutes

Silent Degradation: The Failure Mode Your Monitoring Probably Misses

Cost as an Incident Metric

Conclusion: Write the Runbook Before You Need It