Skip to main content

Treating Your LLM Provider as an Unreliable Upstream: The Distributed Systems Playbook for AI

· 10 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Response times look fine. Error rates are near zero. And yet your users are filing tickets about garbage answers, your agent is making confidently wrong decisions, and your support queue is filling up with complaints that don't correlate with any infrastructure alert you have.

Welcome to the unique hell of depending on an LLM API in production. It's an upstream service that can fail you while returning a perfectly healthy 200 OK.

Most backend engineers have internalized a playbook for unreliable third-party dependencies: circuit breakers, retries with backoff, bulkheads, timeouts, fallback responses. These patterns are battle-tested across decades of distributed systems practice. But LLM APIs break every assumption those patterns were built on. Naively applying them gives you a false sense of safety while the real failures slip through undetected.

The Failure Modes You're Not Monitoring For

Traditional upstream services fail in ways your monitoring stack knows how to catch. A database returns a connection error. A payment API sends back a 500. A microservice times out. Your alerts fire, your circuit breaker trips, your fallback kicks in. The system degrades gracefully.

LLM APIs introduce a failure mode that infrastructure monitoring was never designed to detect: semantic degradation. The API returns a 200, latency looks normal, the response body is well-formed JSON — but the content is wrong. The model hallucinated a policy that doesn't exist. It answered a question about your product using information from a competitor. It followed instructions from a prompt injection embedded in a user's input. None of these trigger a single alert in your Datadog dashboard.

This is the fundamental difference between LLM upstreams and every other API you depend on. With a payment processor, a 200 means the charge went through. With an LLM, a 200 means the model produced tokens. Whether those tokens are useful, accurate, or safe is a separate question your infrastructure layer has no opinion about.

The failures that actually hurt in production tend to follow a pattern:

  • Hallucination drift: The model starts grounding answers on less relevant context, but responses remain fluent and confident while correctness decays.
  • Behavioral regression: A provider updates their model, and your carefully tuned prompts start producing subtly different output. Tone shifts, format changes, edge cases that used to work now don't.
  • Silent tool-call corruption: In agent workflows, small changes in model behavior trigger excessive or incorrect tool calls that your infrastructure metrics don't flag.

Why Your Timeout Strategy Is Probably Wrong

Setting timeouts for LLM APIs is harder than it sounds because the latency distribution is unlike anything else in your stack. A typical microservice might have a P50 of 50ms and a P99 of 200ms — a 4x spread. LLM APIs routinely show P50 of 500ms and P99 of 4-5 seconds — a 10x spread. On a bad day, that P99 can spike to 15-20 seconds without the provider reporting any incident.

This means your timeout has to accommodate an enormous range. Set it too tight and you'll kill requests that were going to succeed. Set it too loose and you'll have users staring at spinners for 20 seconds before getting an error. Neither option is acceptable.

The conventional wisdom of "set your timeout to 2x P99" breaks down here because the P99 itself is unstable. During peak hours or after a provider model update, your baseline latency can shift by 3-5x with no warning. A timeout calibrated to yesterday's latency distribution may be wrong today.

What works better in practice:

  • Adaptive timeouts based on a rolling window of recent latency percentiles, not a static configuration value. If your P95 has been climbing for the last 10 minutes, your timeout should adjust automatically.
  • Per-operation timeout budgets that account for the expected output length. A request generating 50 tokens should have a very different timeout than one generating 2,000 tokens.
  • Deadline propagation from your user-facing SLA backward through the call chain. If your API promises a response in 10 seconds and you've already spent 3 seconds on retrieval, your LLM call gets a 6-second budget, not whatever default your HTTP client is configured with.

Circuit Breakers Need a Semantic Dimension

The classic circuit breaker pattern monitors error rates: when failures exceed a threshold, the breaker trips open and stops sending traffic to the failing service. For LLM providers, this catches maybe 20% of the problems — the ones that actually produce HTTP errors.

The circuit breaker that production AI teams end up building monitors three signals simultaneously:

Error rate (the traditional signal): Track 429s, 500s, 502s, and 503s. Community consensus from production deployments lands around 5 failures to trip open, with a 60-second cooldown period. Alert at >5% error rate, escalate to critical at >15%.

Latency degradation: When P95 latency exceeds 3x its rolling baseline, treat it as a partial failure even if every request eventually succeeds. A provider that's technically up but taking 12 seconds per request is functionally down for any user-facing application.

Quality degradation: This is the hard one. You need inline quality checks — lightweight evaluations that run on a sample of responses to detect hallucination rate increases, format compliance drops, or instruction-following regressions. When your quality score drops below a threshold, the circuit breaker should route traffic to a fallback provider even though the primary is technically healthy.

Building that third signal is what separates teams that run LLMs in production from teams that demo LLMs in production.

The Fallback Chain Is More Complex Than You Think

A typical fallback strategy for a REST API is straightforward: if Service A fails, try Service B. LLM fallbacks are trickier because the services aren't interchangeable.

Different models from different providers have different capabilities, different context windows, different instruction-following characteristics, and different failure modes. Your prompt that works perfectly with Claude may produce garbage with GPT-4o, and vice versa. A fallback chain of OpenAI → Anthropic → Google sounds good on a whiteboard but requires maintaining three sets of prompts, three sets of quality baselines, and three integration test suites.

What actually works in production:

  • Same-provider fallback first: Before crossing provider boundaries, try a different model tier from the same provider. Falling back from GPT-4o to GPT-4o-mini, or from Claude Opus to Claude Sonnet, keeps your prompt compatibility high while giving you a cheaper, often faster alternative during degradation.
  • Cached response fallback: For queries you've seen before (or queries similar enough), serve from a response cache. This works surprisingly well for information retrieval and FAQ-style interactions where freshness isn't critical.
  • Graceful degradation responses: For some features, the right fallback isn't another model — it's a static response that tells the user "this feature is temporarily operating in limited mode" rather than serving a lower-quality answer that looks authoritative.

One critical mistake: don't let your fallback share the same failure domain as your primary. If both your primary and fallback are routed through the same LLM gateway, a gateway outage takes out both paths. If both providers use the same cloud region, a regional incident hits both.

Building the Provider Health Dashboard You'll Wish You Had Earlier

Every team running LLMs in production eventually builds some version of a provider health dashboard. The ones that work track five things:

Request-level metrics: Latency percentiles (P50, P75, P95, P99), error rates by status code, and throughput. These are table stakes — you'd track them for any upstream.

Token economics: Input and output token counts per request, cost per request, and cache hit rates if you're using prompt caching. This is where you catch the subtle cost blowups from models that start generating verbose responses after an update.

Quality scores: Automated evaluation scores on a sampled percentage of responses. Even a simple format-compliance check ("did the response follow the JSON schema I asked for?") catches a surprising number of regressions.

Rate limit headroom: How close you are to your provisioned throughput and token-per-minute limits, tracked as a percentage. You want to alert at 70% utilization, not when you're already getting 429s.

Provider incident correlation: A timeline that overlays your internal quality metrics with provider status page updates. This lets you retrospectively identify whether a quality dip was caused by a provider change or by something on your side.

The most important insight from building this dashboard is that LLM provider reliability is not binary. There's no single "up or down" state. The provider can be up for error rates, degraded for latency, and failing for quality — simultaneously. Your monitoring needs to reflect that.

The Bulkhead Pattern: Isolate Your AI Features

If your application calls an LLM provider from multiple features — a chatbot, a summarization pipeline, a code review tool — and they all share the same API key and connection pool, a rate limit hit on one feature starves all the others.

The bulkhead pattern isolates features into separate compartments with independent resource pools. For LLM APIs, this means:

  • Separate API keys (or at minimum, separate rate limit tracking) per feature.
  • Independent circuit breakers per feature so a degraded chatbot doesn't trip the breaker for your summarization pipeline.
  • Per-feature token budgets that prevent any single feature from consuming your entire provisioned throughput.

This is the same pattern you'd use to prevent a noisy neighbor in a microservices architecture. The difference is that LLM API rate limits are typically account-wide, not per-endpoint, which makes the isolation harder to achieve without provider cooperation or multiple accounts.

Non-Idempotency Is the Default

Most LLM API calls are non-idempotent by default. Ask the same question twice, get two different answers. This breaks retry logic in a subtle way: if a request times out and you retry it, you might get a completely different response. If the first request actually succeeded on the provider's side (you just didn't receive the response in time), you've now paid for two completions and need to decide which one to use.

For agent workflows where each step depends on the previous one, blind retries can be dangerous. The retry might produce a different plan, pick different tools, or make different assertions — and now your agent is trying to reconcile two divergent execution paths.

Mitigations that work:

  • Use seeds and temperature 0 when you need determinism for retries. This doesn't guarantee identical output across retries, but it reduces variance.
  • Idempotency keys at the application layer: Track which logical operation each request belongs to, and if you get a duplicate response, discard it.
  • Checkpoint-and-resume for agents: Instead of retrying an entire multi-step workflow, checkpoint after each successful step and resume from the last checkpoint on failure.

What Your Playbook Should Look Like

If you're running LLM APIs in production today, here's the minimum viable resilience setup:

  1. Adaptive timeouts with per-operation budgets and deadline propagation.
  2. Three-signal circuit breakers covering errors, latency, and quality.
  3. Same-provider model fallback as the first tier, with cross-provider fallback as the second tier.
  4. Bulkheads isolating independent features from each other's blast radius.
  5. Quality monitoring on sampled traffic with automated regression detection.
  6. A provider health dashboard that tracks the five dimensions above.

None of this is exotic distributed systems theory. It's the same playbook you'd apply to any unreliable upstream dependency, adapted for the unique properties of LLM APIs: variable latency, non-idempotent responses, and the silent failure mode that makes all the other patterns necessary.

The teams that get this right don't treat their LLM provider as a magical service requiring novel infrastructure. They treat it as what it is — a third-party API that happens to be flakier, slower, and harder to monitor than any upstream they've depended on before. They apply the patterns they already know, extended where the old assumptions break down.

References:Let's stay in touch and Follow me for more thoughts and updates