Skip to main content

Your Provider's 99.9% SLA Is Measured at the Wrong Boundary for Your Agent

· 11 min read
Tian Pan
Software Engineer

A model provider publishes a 99.9% availability SLA. The procurement team frames it as "three nines, four hours of downtime per year, acceptable for a non-tier-zero workload." Six months later the agent feature ships and the on-call dashboard shows a user-perceived task-success rate around 98% — a number nobody wrote into a contract, nobody can find on the provider's status page, and nobody owns. The provider is meeting their SLA. The product is missing its SLO. Both are true at the same time, and the gap is not a bug — it is arithmetic.

The arithmetic is the part most teams skip. A provider's 99.9% is measured against a synchronous-request workload — one user, one prompt, one response, one billing event. An agent does not generate that workload. A single user-perceived task fans out into 8 to 20 inference calls, retries on transient errors, hedges on slow ones, and aggregates partial outputs. Each of those calls is an independent draw against the provider's failure distribution, and the task fails if any essential call fails. The boundary the SLA covers and the boundary the user feels are not the same boundary.

This post is about the math, the contract clauses that should exist but don't, and the observability work that surfaces the gap before users do. The recurring theme: a vendor SLA is a claim about a workload that vendor was benchmarking, and your agent is not generating that workload.

The composition math nobody negotiates

Series availability is the textbook case. If a system depends on N components, each with availability A, and the task fails when any one of them fails, total availability is approximately A^N. Three components at 99% each give 97% total. Twenty components at 99.9% each give 98.0%. The math is unforgiving: redundancy gets you back, but a chain of independent calls multiplies the failure probability, not the success probability.

Plug in real numbers from the industry. Provider availability over a recent 90-day window has been reported as low as 98.95% for one major model API and around 99.76% overall for another, with API components occasionally dipping below 99% for stretches. These are real, recent operating points — not the marketing 99.99% — and they exist because foundation-model infrastructure is younger and capacity-constrained in ways established cloud APIs are not.

Now compose them with an agent loop. Suppose your agent makes 12 inference calls per user task — a planner call, four retrieval-grounding calls, four tool-result-summarization calls, two replans, and a final synthesis. At an honest 99.5% per-call availability, task availability is 0.995^12 ≈ 94.2%. That is more than thirty hours of failed tasks per month, against a provider that is meeting a contractually clean 99.5% target. At 99.9% per-call you get 98.8% — still worse than what the provider's status page shows, by an order of magnitude in error rate.

The SLA you signed describes per-call availability at the boundary the provider's load balancer can measure. The number your users live with is per-task availability at the boundary your product owns. There is no conspiracy here — it is just two different denominators.

Retries don't fix this; they make it worse along a different axis

The first instinct is to add retries. A 429 from a rate limiter, a 503 from a degraded region, a timeout from a slow node — most of these resolve within seconds, and a retry recovers the call. So the team adds exponential backoff with jitter and considers the problem handled.

It isn't. Retries change the failure distribution but they introduce three new failure modes that the provider's SLA does not cover:

Latency tax on the happy-but-slow path. A retry-on-timeout policy turns a slow call into a slow-plus-retry call. If 5% of calls hit a 10-second timeout and retry to a successful 2-second response, you have shifted the p95 latency by a multiple, and the user perceives a slowdown that no per-call SLA captures. The eval rig running serial calls against a warm provider tier never sees this.

Retry storms feeding back into rate limits. A degraded provider region returns 429s. Your retry policy fires three more requests at the same region. Each agent that does this multiplies load by 3-4x. If 10% of your fleet is retrying at any given moment during a degradation, the effective load goes up sharply right when the provider can least handle it. The provider has not violated their SLA — they are returning correct 429s — but your system has self-amplified the problem. Retry storms multiplying load 10x within seconds is a documented production pattern in agent fleets.

Cost amplification on the unhappy path. A 12-call task that retries three times on the worst call becomes a 15-call task. Multiply across a fleet under degradation and you have a measurable monthly cost line item that exists only because the provider was slightly unhealthy. Your finance team sees the bill before your SRE team sees the cause.

The retry policy is doing exactly what it was designed to do. It is just that "transient errors" and "user-perceived task failures" are different events, and a fix calibrated for one does not improve the other.

The contract clauses that should exist but don't

Most LLM provider contracts inherit their SLA structure from synchronous web-API contracts. They speak in availability percentages and credit thresholds at the per-request boundary. The clauses that would actually protect an agent workload are usually missing, and asking for them at renewal is one of the highest-leverage things a procurement team can do once they understand the math.

Rate-limit burst headroom for fanout. A standard rate-limit clause is a steady-state ceiling: N requests per minute. Agents do not generate steady-state load — they generate bursty load shaped like the user's task arrival pattern, with each task firing 10-20 calls in rapid succession. A clause that gives a defined burst budget over a short window (say, 4x the steady-state for 15 seconds) is far more useful than a 50% bump in the steady-state ceiling.

Retry-tolerant credit for transient failures. Most SLA credits trigger on per-request error rates above some threshold, computed against the provider's view of "request." Negotiate credit accounting that acknowledges retried-and-eventually-succeeded calls as partial failures — the provider's metric counts them as a clean success on the second attempt, but they cost you latency and tokens.

Inter-token latency ceilings, not just time-to-first-token. Streaming SLAs typically cover time-to-first-token. Users perceive inter-token jitter, not first-token latency, and a provider can hit their TTFT ceiling while emitting a 3-second pause mid-response that reads as "broken" to the user. Push for p99 inter-token latency as a contract metric, or at minimum as a published SLI.

Multi-region failover with prompt-version coherence. When a provider region fails and you reroute to a different region, the model version, prompt cache state, and even the tokenizer version may differ subtly. A clause that pins the model version across regions and commits to coordinated rollouts (so you cannot get cross-version skew during a regional cutover) prevents an entire class of incidents that look like regressions but are routing artifacts.

The cost of negotiating these is usually small relative to the deal size. The cost of not having them surfaces as incident toil three quarters in.

Multi-provider routing changes the math, but not for free

The natural response to the composition problem is multi-provider routing — if any single provider's availability is your ceiling, fan out across two and your composed availability goes up. This works, with caveats. By mid-2025, roughly 40% of production LLM teams had multi-provider routing in place, mostly motivated by visible outages at both major providers.

Parallel redundancy improves availability mathematically. Two providers each at 99.5% give a parallel availability of 1 - (1 - 0.995)^2 = 99.9975% — almost four nines from two threes. But the math assumes independence (correlated outages, like a shared underlying cloud provider, break it), it assumes equivalent quality (the cheap fallback might pass the availability check while regressing on the quality check), and it assumes seamless cutover (which is harder than it sounds when prompts are tuned per-model).

Operational realities to budget for:

  • Prompt portability is a non-trivial engineering project. A prompt that works against model A often regresses against model B because of differences in instruction-following bias, tokenizer behavior, or output formatting priors. The eval suite is the backstop, but maintaining parallel passing eval suites across two providers is real ongoing work.
  • Tool-call schema drift across providers. Each provider has slightly different conventions for function/tool calling, structured output, and error semantics. The middleware that abstracts these has to be tested against the failure modes each provider actually produces, not just the happy path.
  • Cost of warm capacity. Keeping a meaningful share of traffic flowing to the secondary provider so it stays warm and so you have current quality data is not free. Running it as a cold-standby that activates during failover means the first traffic post-failover hits cold caches and degraded latency at exactly the worst moment.

Multi-provider routing is the right answer for most agent workloads at scale, but it raises the question from "what is my availability?" to "what is my availability and my quality and my cost," and the answer is no longer a single number on a dashboard.

Observability: compose provider metrics with harness stats

The piece most teams underbuild is the observability layer that combines vendor metrics with your own harness statistics into a task-level availability number. The provider's status page shows you their view. Your APM shows you your code's view. Neither shows you the user's view.

A working setup needs three things stitched together:

  • Per-call instrumentation that distinguishes call outcomes. Not "success or failure," but "succeeded on first try," "succeeded after N retries," "failed after exhausting retries," "failed to a fallback provider." Each of these has different cost and latency implications.
  • Per-task aggregation that defines the user-visible boundary. A trace that maps every fan-out call back to the user request that initiated it, with a task-success label that depends on the agent's own definition of completion, not on individual call status codes.
  • Drift alarms on the gap between provider SLI and your SLI. When the provider's API availability drops 0.2 percentage points, your task availability often drops 1-2 percentage points. Alarming on the divergence — your number falling faster than theirs — surfaces a self-amplifying retry storm or a cascade across calls long before either side's absolute number breaches an SLO.

The drift alarm is the highest-leverage piece. The absolute task-success number is a lagging indicator that's easy to argue with after an incident. The divergence between provider availability and task availability tells you that something in your composition layer — retry policy, fanout shape, replan logic — is amplifying a small upstream wobble into a big downstream one. That is the signal you can act on while the incident is in flight, not after.

What to actually do this quarter

If your team is shipping agent features against vendor SLAs, the playbook fits on one page:

  1. Compute your fanout factor. Look at a real day's traffic. How many provider calls does the median user task make? The 95th percentile? You probably do not know, and the number is the input to every other decision.
  2. Compose the math. Multiply per-call provider availability by the fanout factor. Compare to your task-success SLO. If the composed number is below the SLO, no amount of provider performance inside their SLA will save you.
  3. Audit the retry policy. For each retryable error, ask whether retrying is recovering the call or just amplifying load. Replanning on semantic errors and cancelling outstanding fanouts on critical errors are usually higher-leverage than a fourth retry.
  4. Negotiate the right clauses at renewal. Burst headroom, retry-aware credit accounting, inter-token latency ceilings, region-pinned model versions. Cheap to ask for, expensive to retrofit.
  5. Build the divergence alarm. Provider SLI minus task SLI, alerted on the gap, not the absolute. This is the metric that catches incidents that the provider's status page does not.

The architectural takeaway is short. "We run on top of a 99.9% provider" is a claim about a workload your provider was benchmarking, not the workload your agent is generating. The user-perceived availability of your feature is a composition — provider availability, retry policy, fanout shape, aggregation logic — and every one of those terms is yours to design. The contract is the floor; the math determines the ceiling.

References:Let's stay in touch and Follow me for more thoughts and updates