Your Provider's 99.9% SLA Is Measured at the Wrong Boundary for Your Agent
A model provider publishes a 99.9% availability SLA. The procurement team frames it as "three nines, four hours of downtime per year, acceptable for a non-tier-zero workload." Six months later the agent feature ships and the on-call dashboard shows a user-perceived task-success rate around 98% — a number nobody wrote into a contract, nobody can find on the provider's status page, and nobody owns. The provider is meeting their SLA. The product is missing its SLO. Both are true at the same time, and the gap is not a bug — it is arithmetic.
The arithmetic is the part most teams skip. A provider's 99.9% is measured against a synchronous-request workload — one user, one prompt, one response, one billing event. An agent does not generate that workload. A single user-perceived task fans out into 8 to 20 inference calls, retries on transient errors, hedges on slow ones, and aggregates partial outputs. Each of those calls is an independent draw against the provider's failure distribution, and the task fails if any essential call fails. The boundary the SLA covers and the boundary the user feels are not the same boundary.
This post is about the math, the contract clauses that should exist but don't, and the observability work that surfaces the gap before users do. The recurring theme: a vendor SLA is a claim about a workload that vendor was benchmarking, and your agent is not generating that workload.
The composition math nobody negotiates
Series availability is the textbook case. If a system depends on N components, each with availability A, and the task fails when any one of them fails, total availability is approximately A^N. Three components at 99% each give 97% total. Twenty components at 99.9% each give 98.0%. The math is unforgiving: redundancy gets you back, but a chain of independent calls multiplies the failure probability, not the success probability.
Plug in real numbers from the industry. Provider availability over a recent 90-day window has been reported as low as 98.95% for one major model API and around 99.76% overall for another, with API components occasionally dipping below 99% for stretches. These are real, recent operating points — not the marketing 99.99% — and they exist because foundation-model infrastructure is younger and capacity-constrained in ways established cloud APIs are not.
Now compose them with an agent loop. Suppose your agent makes 12 inference calls per user task — a planner call, four retrieval-grounding calls, four tool-result-summarization calls, two replans, and a final synthesis. At an honest 99.5% per-call availability, task availability is 0.995^12 ≈ 94.2%. That is more than thirty hours of failed tasks per month, against a provider that is meeting a contractually clean 99.5% target. At 99.9% per-call you get 98.8% — still worse than what the provider's status page shows, by an order of magnitude in error rate.
The SLA you signed describes per-call availability at the boundary the provider's load balancer can measure. The number your users live with is per-task availability at the boundary your product owns. There is no conspiracy here — it is just two different denominators.
Retries don't fix this; they make it worse along a different axis
The first instinct is to add retries. A 429 from a rate limiter, a 503 from a degraded region, a timeout from a slow node — most of these resolve within seconds, and a retry recovers the call. So the team adds exponential backoff with jitter and considers the problem handled.
It isn't. Retries change the failure distribution but they introduce three new failure modes that the provider's SLA does not cover:
Latency tax on the happy-but-slow path. A retry-on-timeout policy turns a slow call into a slow-plus-retry call. If 5% of calls hit a 10-second timeout and retry to a successful 2-second response, you have shifted the p95 latency by a multiple, and the user perceives a slowdown that no per-call SLA captures. The eval rig running serial calls against a warm provider tier never sees this.
Retry storms feeding back into rate limits. A degraded provider region returns 429s. Your retry policy fires three more requests at the same region. Each agent that does this multiplies load by 3-4x. If 10% of your fleet is retrying at any given moment during a degradation, the effective load goes up sharply right when the provider can least handle it. The provider has not violated their SLA — they are returning correct 429s — but your system has self-amplified the problem. Retry storms multiplying load 10x within seconds is a documented production pattern in agent fleets.
Cost amplification on the unhappy path. A 12-call task that retries three times on the worst call becomes a 15-call task. Multiply across a fleet under degradation and you have a measurable monthly cost line item that exists only because the provider was slightly unhealthy. Your finance team sees the bill before your SRE team sees the cause.
The retry policy is doing exactly what it was designed to do. It is just that "transient errors" and "user-perceived task failures" are different events, and a fix calibrated for one does not improve the other.
The contract clauses that should exist but don't
Most LLM provider contracts inherit their SLA structure from synchronous web-API contracts. They speak in availability percentages and credit thresholds at the per-request boundary. The clauses that would actually protect an agent workload are usually missing, and asking for them at renewal is one of the highest-leverage things a procurement team can do once they understand the math.
Rate-limit burst headroom for fanout. A standard rate-limit clause is a steady-state ceiling: N requests per minute. Agents do not generate steady-state load — they generate bursty load shaped like the user's task arrival pattern, with each task firing 10-20 calls in rapid succession. A clause that gives a defined burst budget over a short window (say, 4x the steady-state for 15 seconds) is far more useful than a 50% bump in the steady-state ceiling.
Retry-tolerant credit for transient failures. Most SLA credits trigger on per-request error rates above some threshold, computed against the provider's view of "request." Negotiate credit accounting that acknowledges retried-and-eventually-succeeded calls as partial failures — the provider's metric counts them as a clean success on the second attempt, but they cost you latency and tokens.
Inter-token latency ceilings, not just time-to-first-token. Streaming SLAs typically cover time-to-first-token. Users perceive inter-token jitter, not first-token latency, and a provider can hit their TTFT ceiling while emitting a 3-second pause mid-response that reads as "broken" to the user. Push for p99 inter-token latency as a contract metric, or at minimum as a published SLI.
- https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://portkey.ai/blog/failover-routing-strategies-for-llms-in-production/
- https://medium.com/google-cloud/building-bulletproof-llm-applications-a-guide-to-applying-sre-best-practices-1564b72fd22e
- https://www.runtime.news/as-ai-adoption-surges-ai-uptime-remains-a-big-problem/
- https://nordicapis.com/api-reliability-report-2026-uptime-patterns-across-215-services/
- https://www.bmc.com/blogs/system-reliability-availability-calculations/
- https://link.springer.com/chapter/10.1007/11560333_10
- https://www.getmaxim.ai/articles/multi-agent-system-reliability-failure-patterns-root-causes-and-production-validation-strategies/
- https://www.codeant.ai/blogs/parallel-tool-calling
- https://agentgateway.dev/blog/2025-11-02-rate-limit-quota-llm/
- https://zuplo.com/learning-center/token-based-rate-limiting-ai-agents
- https://www.getmaxim.ai/articles/how-ai-gateways-tackle-rate-limiting-for-llm-apps/
