Skip to main content

Treating Your LLM Provider as an Unreliable Upstream: The Distributed Systems Playbook for AI

· 11 min read
Tian Pan
Software Engineer

Your monitoring dashboard is green. Response times look fine. Error rates are near zero. And yet your users are filing tickets about garbage answers, your agent is making confidently wrong decisions, and your support queue is filling up with complaints that don't correlate with any infrastructure alert you have.

Welcome to the unique hell of depending on an LLM API in production. It's an upstream service that can fail you while returning a perfectly healthy 200 OK.

Most backend engineers have internalized a playbook for unreliable third-party dependencies: circuit breakers, retries with backoff, bulkheads, timeouts, fallback responses. These patterns are battle-tested across decades of distributed systems practice. But LLM APIs break every assumption those patterns were built on. Naively applying them gives you a false sense of safety while the real failures slip through undetected.

The Failure Modes You're Not Monitoring For

Traditional upstream services fail in ways your monitoring stack knows how to catch. A database returns a connection error. A payment API sends back a 500. A microservice times out. Your alerts fire, your circuit breaker trips, your fallback kicks in. The system degrades gracefully.

LLM APIs introduce a failure mode that infrastructure monitoring was never designed to detect: semantic degradation. The API returns a 200, latency looks normal, the response body is well-formed JSON — but the content is wrong. The model hallucinated a policy that doesn't exist. It answered a question about your product using information from a competitor. It followed instructions from a prompt injection embedded in a user's input. None of these trigger a single alert in your Datadog dashboard.

This is the fundamental difference between LLM upstreams and every other API you depend on. With a payment processor, a 200 means the charge went through. With an LLM, a 200 means the model produced tokens. Whether those tokens are useful, accurate, or safe is a separate question your infrastructure layer has no opinion about.

The failures that actually hurt in production tend to follow a pattern:

  • Hallucination drift: The model starts grounding answers on less relevant context, but responses remain fluent and confident while correctness decays.
  • Behavioral regression: A provider updates their model, and your carefully tuned prompts start producing subtly different output. Tone shifts, format changes, edge cases that used to work now don't.
  • Silent tool-call corruption: In agent workflows, small changes in model behavior trigger excessive or incorrect tool calls that your infrastructure metrics don't flag.

Why Your Timeout Strategy Is Probably Wrong

Setting timeouts for LLM APIs is harder than it sounds because the latency distribution is unlike anything else in your stack. A typical microservice might have a P50 of 50ms and a P99 of 200ms — a 4x spread. LLM APIs routinely show P50 of 500ms and P99 of 4-5 seconds — a 10x spread. On a bad day, that P99 can spike to 15-20 seconds without the provider reporting any incident.

This means your timeout has to accommodate an enormous range. Set it too tight and you'll kill requests that were going to succeed. Set it too loose and you'll have users staring at spinners for 20 seconds before getting an error. Neither option is acceptable.

The conventional wisdom of "set your timeout to 2x P99" breaks down here because the P99 itself is unstable. During peak hours or after a provider model update, your baseline latency can shift by 3-5x with no warning. A timeout calibrated to yesterday's latency distribution may be wrong today.

What works better in practice:

  • Adaptive timeouts based on a rolling window of recent latency percentiles, not a static configuration value. If your P95 has been climbing for the last 10 minutes, your timeout should adjust automatically.
  • Per-operation timeout budgets that account for the expected output length. A request generating 50 tokens should have a very different timeout than one generating 2,000 tokens.
  • Deadline propagation from your user-facing SLA backward through the call chain. If your API promises a response in 10 seconds and you've already spent 3 seconds on retrieval, your LLM call gets a 6-second budget, not whatever default your HTTP client is configured with.

Circuit Breakers Need a Semantic Dimension

The classic circuit breaker pattern monitors error rates: when failures exceed a threshold, the breaker trips open and stops sending traffic to the failing service. For LLM providers, this catches maybe 20% of the problems — the ones that actually produce HTTP errors.

The circuit breaker that production AI teams end up building monitors three signals simultaneously:

Error rate (the traditional signal): Track 429s, 500s, 502s, and 503s. Community consensus from production deployments lands around 5 failures to trip open, with a 60-second cooldown period. Alert at >5% error rate, escalate to critical at >15%.

Latency degradation: When P95 latency exceeds 3x its rolling baseline, treat it as a partial failure even if every request eventually succeeds. A provider that's technically up but taking 12 seconds per request is functionally down for any user-facing application.

Quality degradation: This is the hard one. You need inline quality checks — lightweight evaluations that run on a sample of responses to detect hallucination rate increases, format compliance drops, or instruction-following regressions. When your quality score drops below a threshold, the circuit breaker should route traffic to a fallback provider even though the primary is technically healthy.

Building that third signal is what separates teams that run LLMs in production from teams that demo LLMs in production.

The Fallback Chain Is More Complex Than You Think

A typical fallback strategy for a REST API is straightforward: if Service A fails, try Service B. LLM fallbacks are trickier because the services aren't interchangeable.

Different models from different providers have different capabilities, different context windows, different instruction-following characteristics, and different failure modes. Your prompt that works perfectly with Claude may produce garbage with GPT-4o, and vice versa. A fallback chain of OpenAI → Anthropic → Google sounds good on a whiteboard but requires maintaining three sets of prompts, three sets of quality baselines, and three integration test suites.

What actually works in production:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates