Skip to main content

SLOs for Non-Deterministic Systems: Defining Reliability When Every Response Is Different

· 8 min read
Tian Pan
Software Engineer

Your AI feature returns HTTP 200, completes in 180ms, and produces valid JSON. By every traditional SLI, the request succeeded. But the answer is wrong — a hallucinated product spec, a fabricated legal citation, a subtly incorrect calculation. Your monitoring is green. Your users are furious.

This is the fundamental disconnect that breaks SRE for AI systems. Traditional reliability engineering assumes a successful execution produces a correct result. Non-deterministic systems violate that assumption on every request. The same prompt, same context, same model version can produce a different — and differently wrong — answer each time.

A 2025 McKinsey survey found that 51% of organizations using AI experienced negative consequences, with nearly one-third attributing the issues to inaccuracy. Not downtime. Not latency. Inaccuracy. The systems were running perfectly while producing wrong answers.

If you're operating AI features in production, you need a new class of reliability objectives — ones that measure whether the system is right, not just whether it's running.

Why Traditional SLIs Miss the Dominant Failure Mode

The standard SRE toolkit — availability, latency percentiles, error rate — was designed for deterministic systems. A database query either returns the correct rows or throws an error. A payment API either processes the charge or fails with a status code. Success and correctness are the same thing.

For LLM-powered features, they diverge completely. Consider the failure modes that traditional SLIs cannot detect:

  • Hallucination: The model invents facts with high confidence. HTTP 200, valid schema, completely wrong content.
  • Semantic drift: A model update subtly changes the tone or reasoning style. No errors, but users notice the product "feels different."
  • Relevance decay: The retrieval layer returns stale documents, and the model confidently synthesizes an outdated answer.
  • Safety violations: The model produces harmful or policy-violating content. Structurally perfect, semantically catastrophic.

You can have 99.99% availability and 50ms p99 latency while your AI feature is actively damaging user trust. Traditional SLIs will never page you for this.

Defining Semantic SLIs

The solution is to introduce a second axis of indicators that measure output quality directly. These are semantic SLIs — metrics that assess what the system said, not how fast it said it.

Correctness rate. What percentage of responses pass automated semantic evaluation? This requires building or adopting evaluation frameworks that can score outputs against ground truth or rubrics. It's expensive to set up but non-negotiable for production AI.

Hallucination rate. What fraction of responses contain fabricated claims? This can be measured through grounding checks — verifying that assertions in the output trace back to source documents or known facts.

Policy compliance rate. How often does the model violate content policies, leak PII, or ignore guardrails? This is measurable with rule-based classifiers and is often the most straightforward semantic SLI to implement.

Schema-valid output rate. For structured generation tasks, what percentage of outputs conform to the expected schema? A JSON response that parses but contains null where a required field should be is semantically invalid even if structurally valid.

Workflow completion rate. For agentic systems, what percentage of multi-step workflows complete successfully or explicitly escalate, rather than silently failing partway through? A 10-step agent pipeline at 90% per-step success yields only ~35% end-to-end completion — firmly in prototype territory.

The key insight: these metrics are inherently probabilistic. You cannot expect 100% correctness from a non-deterministic system. Which is exactly why you need formal SLOs around them.

Setting Error Budgets When Your Baseline Is 85%, Not 99.9%

Here's where traditional SRE culture clashes hardest with AI reality. In classical SRE, an error budget of 0.1% feels generous. For AI systems, your baseline accuracy might be 85%, and getting to 90% could be a major engineering effort.

This requires rethinking error budgets from first principles.

Start with your actual baseline. Measure current performance across a representative sample before setting any targets. If your RAG pipeline currently answers 82% of queries correctly, a 95% SLO is aspirational fiction, not a useful reliability target.

Set budgets per failure category. Not all failures are equal. A hallucinated legal citation is catastrophic. A slightly verbose response is cosmetic. Define separate error budgets:

  • Factual accuracy: 5% error budget (95% correct)
  • Safety violations: 0.1% error budget (99.9% compliant)
  • Schema validity: 1% error budget (99% valid)
  • Relevance scoring: 10% error budget (90% relevant)

Use sliding window baselines. Production LLM performance naturally fluctuates. A fixed threshold either fires too often during normal variance or stays silent during real degradation. Compare current performance against a rolling 7-day or 30-day window. React when the mean shifts beyond two standard deviations, not when it crosses an arbitrary line.

Budget burn rate matters more than absolute level. A system at 88% accuracy is concerning. A system that dropped from 92% to 88% in 48 hours is an incident. Track the velocity of error budget consumption, not just the current level. Alert at 50%, 75%, and 90% of budget consumption thresholds.

The Alerting Architecture for Non-Deterministic Systems

Traditional alerting fires on binary signals: the error rate exceeded 1%, latency broke the p99 threshold. For AI systems, you need an alerting architecture that distinguishes genuine degradation from normal variance.

Layer 1: Infrastructure alerts (traditional SLIs). These still matter. If the model API returns 500s or latency spikes, your existing monitoring handles it. Keep these exactly as they are.

Layer 2: Quality regression alerts. Run continuous evaluation against a golden test set — a curated set of inputs with known-good outputs. When the score on this test set drops by more than a threshold (say, 5 points on a 100-point scale), fire an alert. This catches model-side regressions that produce no infrastructure signals.

Layer 3: Distribution shift alerts. Monitor the statistical distribution of model outputs: confidence scores, token probabilities, output length distributions, embedding space clustering. A change in these distributions often precedes a measurable quality drop. A less confident or less consistent model is frequently the earliest signal that it's seeing unfamiliar patterns.

Layer 4: Business metric correlation. Tie your AI quality metrics to downstream business signals. If your AI support agent's resolution rate drops, or your AI search feature's click-through rate falls, these are lagging indicators that confirm what your leading indicators should have caught.

The key design principle: your on-call team should never have to debug prompts at 3am without a runbook. Every alert needs a corresponding playbook — what to check, how to verify, and what the deterministic fallback is if the AI feature needs to be degraded gracefully.

Detecting Silent Degradation

The most dangerous failure mode in AI systems is silent degradation — the model gets worse gradually, without triggering any threshold-based alert. Individual responses look plausible. Aggregate quality is eroding.

Detection requires treating evaluation as a continuous service, not a one-time test:

Canary evaluation pipelines. Maintain a regression suite of golden conversations, domain-specific QA pairs, safety red-team prompts, and business-critical workflows. Run it on a schedule — daily at minimum, hourly for critical systems. Compare results against your baseline.

Shadow scoring. For new model versions or prompt changes, run the new version alongside the current one and compare semantic outputs before full rollout. This is the AI equivalent of canary deployments, but you're comparing meaning, not just error rates.

Human override rate tracking. If humans are reviewing AI outputs, the rate at which they reject or substantially modify the AI's answer is one of your most valuable SLIs. A rising override rate is a high-fidelity signal of quality degradation that no automated metric can match.

Audit cadence. Instrument semantic metrics on at least one critical LLM workflow per quarter. Review the model and data supply chain — did the embedding model get updated? Did the retrieval corpus change? Did a provider-side model update ship without announcement? These supply chain changes are the most common root cause of silent degradation.

Making It Operational

Defining semantic SLOs is the easy part. Operating them is where teams get stuck. Here's a practical implementation sequence:

Week 1: Instrument. Pick your highest-value AI feature. Add logging that captures the full input-output pair, not just metadata. You cannot evaluate quality without the actual content.

Week 2: Baseline. Score a representative sample of production outputs manually or with an LLM-as-judge pipeline. Establish your current accuracy, hallucination rate, and relevance scores. This is your starting point, not your target.

Week 3: Automate evaluation. Build an automated scoring pipeline that runs against sampled production traffic. Start with simple heuristics (response length, schema validation, keyword presence) and layer in semantic evaluation (embedding similarity, LLM-as-judge scoring).

Week 4: Set SLOs and alert. Based on your baseline, set realistic SLOs with appropriate error budgets. Configure alerts on budget burn rate. Write runbooks for each alert.

Ongoing: Close the loop. When incidents occur, add the failing case to your golden test set. Your evaluation suite should grow from real production failures, making it increasingly hard for the same category of failure to recur undetected.

The organizations that get this right aren't the ones with the most sophisticated AI models. They're the ones that treat AI output quality with the same rigor they've always applied to uptime — with measurable objectives, meaningful budgets, and operational playbooks that let teams respond before users notice.

Non-deterministic systems don't get a pass on reliability. They just require us to redefine what reliability means.

References:Let's stay in touch and Follow me for more thoughts and updates