What 99.9% Uptime Means When Your Model Is Occasionally Wrong

April 20, 2026 · 10 min read

Software Engineer

A telecom company ships an AI support chatbot with 99.99% availability and sub-200ms response times — every traditional SLA metric is green. It is also wrong on 35% of billing inquiries. No contract clause covers that. No alert fires. The customer just churns.

This is the watermelon effect for AI: systems that look healthy on the outside while quietly rotting inside. Traditional reliability SLAs — uptime, error rate, latency — were built for deterministic systems. They measure whether your service answered, not whether the answer was any good. Shipping an AI feature under a traditional SLA is like guaranteeing that every email your support team sends will be delivered, without any commitment that the replies make sense.

The gap matters because AI features are increasingly load-bearing for B2B products. When an AI system generates an invoice summary, classifies a contract clause, or drafts a compliance checklist, "it returned HTTP 200" is not a useful success criterion. The customer cares whether the output was correct, and they will hold you accountable for it — even if your contract doesn't.

Why Traditional SLAs Break for AI

Traditional SLAs work for deterministic systems because success is binary and observable. A database either returns a row or it doesn't. An API either responds within 300ms or it doesn't. You can measure it, log it, and alert on it. The SLA reflects ground truth.

AI output quality is none of those things. The same prompt can produce responses that range from excellent to factually wrong, depending on phrasing, context window state, model version, temperature, and dozens of other variables. Success is a distribution, not an event. And critically, detecting failure often requires a human in the loop — or a second AI evaluation pass — because the system returns a confident-sounding 200 with a hallucinated answer.

There are also second-order problems. Uptime SLAs measure availability, not quality degradation. A model provider can roll out a new version that reduces your task-specific accuracy by 15% while keeping p99 latency unchanged. Every dashboard is green; your product silently gets worse. Aggregate benchmarks compound this — a model that scores well on general benchmarks may perform significantly worse on your specific task distribution, and you won't discover this from vendor SLA metrics alone.

The legal exposure compounds the engineering problem. Courts have begun treating AI-generated outputs as factual representations in the context they were delivered. If your AI tool generates a financial summary that a customer relies on, hallucinations are not just a quality problem — they can be a breach of contract or misrepresentation problem. And 88% of AI vendors cap their own liability at monthly subscription fees, meaning the downstream risk lands on you.

What an Honest AI Quality Commitment Looks Like

The answer is not to promise perfect outputs. It is to define what "good enough" means for a specific task, and to make that commitment explicit and measurable.

The basic structure of an honest AI quality commitment has three parts:

A task scope definition. The quality floor applies to a named task with defined inputs, not to the model in general. "The system will correctly extract invoice date, vendor name, and total amount with ≥ 95% accuracy on standard invoice formats" is a defensible commitment. "The AI will be accurate" is not.

A measurement protocol. How do you know when the floor has been breached? Sampling-based evaluation with human or LLM judges, applied on a defined cadence (weekly spot-checks on 1% of production traffic), gives you a repeatable measurement method. Ad hoc internal assessments do not.

A remediation trigger. When quality drops below the floor, what happens? Escalation to human review, rollback to a prior model version, customer notification — these need to be defined in advance, not negotiated after a customer complaint.

For internal SLOs, the frame is similar but the emphasis shifts to enabling teams to detect degradation before customers do.

Designing Internal SLOs That Actually Catch Quality Failures

Most engineering teams that adopt SLOs for AI features make the same mistake: they measure latency and error rate because those are easy to instrument, and they assume quality is someone else's problem.

A useful quality SLO has three properties. First, it measures something that correlates with task success, not just request completion. For an extraction task, this might be field-level accuracy on a sample of outputs evaluated against ground truth. For a summarization task, it might be a relevance score from an LLM judge, with periodic human calibration to verify the judge is tracking reality. For a classification task, it is precision and recall on a held-out test set that is refreshed quarterly as the input distribution shifts.

Second, it uses an error budget rather than a hard threshold. An error budget treats quality headroom as a finite resource. If your quality floor is 92% accuracy on invoice extraction, and your last 30-day sample is at 94%, you have 2 percentage points of budget to spend on experimentation, model rollouts, and edge cases before you breach the commitment. When the budget is low, you stop experimenting and stabilize. When it is high, you have room to take risk. This is the same framing SRE teams use for uptime — it aligns incentives without requiring zero tolerance for failure.

Third, it catches version drift. Model providers update their models, sometimes with no warning. Prompt behavior shifts. Input distribution changes as your user base grows. A quality SLO that only runs at launch will not catch the slow degradation that occurs over months. Shadow evaluation — routing a small slice of production traffic to an evaluation pipeline that scores outputs against your quality criteria — gives continuous visibility into whether your quality floor is still being met, independent of vendor-side metrics.

The Measurement Window Problem

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

What 99.9% Uptime Means When Your Model Is Occasionally Wrong

Why Traditional SLAs Break for AI

What an Honest AI Quality Commitment Looks Like

Designing Internal SLOs That Actually Catch Quality Failures

The Measurement Window Problem

Recommended Reading

About Tian Pan

Why Traditional SLAs Break for AI​

What an Honest AI Quality Commitment Looks Like​

Designing Internal SLOs That Actually Catch Quality Failures​

The Measurement Window Problem​

Recommended Reading

About Tian Pan

Why Traditional SLAs Break for AI

What an Honest AI Quality Commitment Looks Like

Designing Internal SLOs That Actually Catch Quality Failures

The Measurement Window Problem