Skip to main content

The Vendor SLA Gap: Why Your LLM Provider's Uptime Misses the Failure Mode That Breaks Your Product

· 9 min read
Tian Pan
Software Engineer

Your LLM provider says 99.95% availability. Your status page is green. Your latency dashboard is in the SLO. Your product is broken anyway — the assistant started refusing routine requests this morning, the JSON outputs that powered the downstream parser shifted from compact to chatty, and a third of the support tickets you triage with a model are coming back with "I can't help with that." Every one of those responses returned 200 OK in under 800ms. None of them violated the SLA. The SLA covered the failure mode you do not actually have.

This is the gap nobody priced into the procurement conversation. The vendor sells availability — a request-level promise that the API answered in time — and the product team consumes capability, which is a request-level promise that the answer was usable. The two are not the same metric, and the team that confuses them is one quiet model bump away from learning the difference.

What the SLA Actually Measures

Read any frontier-model vendor's availability commitment and the same definition appears in the legal text: an "available" request is one that returns a 2xx response within some latency bound. That definition is a sensible primitive for a hosted infrastructure service. It is also missing every failure mode that actually matters for an LLM-backed product.

The list of failures that return 200 OK and never appear in the SLA report is long enough to be its own taxonomy:

  • Refusal-rate spikes. A content-safety policy update tightens the model's refusal disposition, and prompts that were answered yesterday come back today with "I can't help with that." The HTTP layer says success. The product layer says outage.
  • Capability regressions after a silent model bump. Vendors push model updates without changing the endpoint name. Your "evergreen" alias resolves to a new snapshot, and a workflow that depended on the old model's behavior starts producing different (and sometimes worse) output. The provider didn't lie — they ship updates because they're improvements on aggregate benchmarks — but your specific use case lost ground without a release note.
  • Quota throttling that returns 200 with a degraded answer. Under load, some providers fall back to smaller models, lower context windows, or cheaper sampling settings rather than returning 429. The call succeeded, the bill went up, the answer got worse.
  • Content-safety policy changes shipped as feature updates. The model's behavior is shaped by a system-level safety stack that evolves independently of the model version. Two requests to the same pinned snapshot on different days can behave differently because of changes in what's around the model, not in it.
  • Regional capacity drops that route a continent of traffic to a backup region with different behavior. The region failover keeps the API up. The model serving that backup region may be running a different version, a different quantization, or a different system prompt. Your users hit the same URL and get a subtly different product.

Every item on this list passes the vendor's SLA. None of them pass the user's "is the feature working?" check. The team that signed the contract was promised availability. The team running the product needs functional availability — a different SLI that the vendor does not measure, does not publish, and cannot be held to.

Why Procurement Cannot Close This Gap

The instinct, when you notice the gap, is to push it back to legal: negotiate a stronger SLA, demand model-version stability guarantees, get a refusal-rate ceiling written into the contract. This rarely works, and not because vendors are uncooperative.

The vendor cannot promise a refusal-rate ceiling because refusal behavior is a property of your prompts, not their API. The vendor cannot promise capability stability because they are continually retraining and reshaping the safety stack, and "your specific task got worse" is not a regression they can detect — they regression-test their own benchmarks, not yours. The vendor cannot promise that the model serving your region's failover is identical to your primary because the operational reality of running a multi-region inference service includes capacity-driven heterogeneity.

The functional-availability gap is structural, not commercial. The vendor is selling you a piece of infrastructure that has an irreducibly application-specific reliability story. Closing the gap is on you to instrument, because you are the only party who knows what "working" means for your product.

The Instrumentation That Closes the Gap

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates