Skip to main content

3 posts tagged with "sla"

View all tags

Your Provider's 99.9% SLA Is Measured at the Wrong Boundary for Your Agent

· 11 min read
Tian Pan
Software Engineer

A model provider publishes a 99.9% availability SLA. The procurement team frames it as "three nines, four hours of downtime per year, acceptable for a non-tier-zero workload." Six months later the agent feature ships and the on-call dashboard shows a user-perceived task-success rate around 98% — a number nobody wrote into a contract, nobody can find on the provider's status page, and nobody owns. The provider is meeting their SLA. The product is missing its SLO. Both are true at the same time, and the gap is not a bug — it is arithmetic.

The arithmetic is the part most teams skip. A provider's 99.9% is measured against a synchronous-request workload — one user, one prompt, one response, one billing event. An agent does not generate that workload. A single user-perceived task fans out into 8 to 20 inference calls, retries on transient errors, hedges on slow ones, and aggregates partial outputs. Each of those calls is an independent draw against the provider's failure distribution, and the task fails if any essential call fails. The boundary the SLA covers and the boundary the user feels are not the same boundary.

What 99.9% Uptime Means When Your Model Is Occasionally Wrong

· 10 min read
Tian Pan
Software Engineer

A telecom company ships an AI support chatbot with 99.99% availability and sub-200ms response times — every traditional SLA metric is green. It is also wrong on 35% of billing inquiries. No contract clause covers that. No alert fires. The customer just churns.

This is the watermelon effect for AI: systems that look healthy on the outside while quietly rotting inside. Traditional reliability SLAs — uptime, error rate, latency — were built for deterministic systems. They measure whether your service answered, not whether the answer was any good. Shipping an AI feature under a traditional SLA is like guaranteeing that every email your support team sends will be delivered, without any commitment that the replies make sense.

The Provider Reliability Trap: Your LLM Vendor's SLA Is Now Your Users' SLA

· 9 min read
Tian Pan
Software Engineer

In December 2024, Zendesk published a formal incident report stating that from June 10 through June 11, 2025, customers lost access to all Zendesk AI features for more than 33 consecutive hours. The engineering team's remediation steps were empty — there was nothing to do. The outage was caused entirely by their upstream LLM provider going down, and Zendesk had no architectural path to restore service without it.

This is the provider reliability trap in its clearest form: you ship a feature, make it part of your users' workflows, promise availability through implicit or explicit SLA commitments, and then discover that your entire reliability posture is bounded by a dependency you don't control, can't fix, and may not have formally evaluated before launch.