SLOs for Non-Deterministic AI Features: Setting Error Budgets When Wrong Is Probabilistic
Your AI feature is "up." Latency is fine. Error rate is 0.2%. The dashboard is green. But over the past two weeks, the summarization quality quietly dropped — outputs are now technically coherent but factually shallow, consistently missing the key detail users care about. Nobody filed a bug. No alert fired. And you won't know until the next quarterly review when retention numbers come in.
This is the failure mode that traditional SLOs are blind to. Availability and latency measure whether your service is responding — not whether it's responding well. For deterministic systems, those two things are nearly equivalent. For LLM features, they can diverge silently for weeks.
The solution isn't to abandon SLOs. It's to extend them. Behavioral quality SLOs give you the same error budget discipline — with explicit burn rates, alert thresholds, and rollback triggers — applied to the dimension that actually matters for AI features: output quality over time.
Why Infrastructure SLOs Don't Transfer
Traditional SLOs assume a clean binary: a request either succeeded or it failed. The denominator is request count; the numerator is "good" requests. For web services, "good" means status 200 with a response under your latency budget. Failure is observable and usually immediate.
LLM systems break this model in three ways.
First, failures are silent. An LLM that hallucinates a fact, truncates a summary, or generates an off-tone customer response doesn't return an error code. The request completes successfully. HTTP 200, latency nominal, error budget unaffected — while the user gets a wrong answer.
Second, quality is continuous, not binary. A response isn't simply correct or incorrect; it exists on a spectrum from excellent to subtly misleading to confidently wrong. A binary error rate can't express the difference between a system that's occasionally terrible and one that's consistently mediocre.
Third, degradation is gradual. Infrastructure failures spike: you go from 99.9% to 40% in seconds and alerts fire immediately. Quality degradation drifts: you go from 91% good responses to 83% over three weeks as a model update slightly shifts output distributions, or as your user base gradually shifts toward input patterns your prompts weren't tuned for. By the time it's visible to humans, you're already deep in the hole.
This last point is the trap. SRE practice learned to fear slow burns precisely because they erode error budgets without triggering alerts. The same dynamic applies to AI quality — you just need different instruments to detect it.
The Three Pillars of Behavioral Quality SLOs
A behavioral quality SLO needs to answer the same question a traditional SLO does: "over the past 30 days, what fraction of requests were good enough?" The difference is how you define "good enough."
Pillar 1: Semantic Regression Thresholds
The most direct behavioral SLO is a semantic correctness rate: the fraction of sampled outputs that score above a threshold on an automated evaluator. The evaluator can be rule-based (does the output contain required fields?), embedding-based (is the BERTScore against a reference above 0.85?), or LLM-as-judge (does a grading model rate the response as satisfactory?).
Each evaluator type has a different cost-accuracy tradeoff. Rule-based checks are deterministic and cheap — run them on 100% of traffic. Embedding similarity catches semantic drift that exact-match misses, at low cost per request. LLM-as-judge is the most sensitive but introduces its own non-determinism and cost; use it on a 5-10% sample or on traffic flagged by cheaper signals.
The SLO target becomes: "95% of sampled responses score ≥ 4.0 on the quality rubric over a 28-day rolling window." The error budget is the 5% tolerance. When the quality rate burns toward that budget faster than expected, you get a meaningful alert — one that fires on output degradation, not infrastructure failure.
Pillar 2: Output Distribution Drift Detection
Quality thresholds can catch absolute degradation, but they miss relative drift — the case where outputs are still technically "passing" but have changed character in ways users will notice. Response length distributions shifting, hedging language increasing, topic coverage narrowing: these patterns can persist below the quality threshold while meaningfully worsening the user experience.
Output distribution monitoring tracks statistical properties of your outputs over time. At the simplest level, this means tracking the distribution of response lengths, confidence signals, or categorical labels (for classification tasks). At the more sophisticated level, it means monitoring the distribution of output embeddings using Wasserstein distance or Maximum Mean Discrepancy against a reference window, flagging when the distribution shifts beyond a threshold.
The SLO formulation here is subtler: rather than a quality rate, you're measuring distribution stability. "The Wasserstein distance between output embeddings this week and the reference window shall not exceed 0.15." When it does, you don't necessarily have a quality incident — but you have a signal to investigate.
Pillar 3: Preference Win-Rate Against a Baseline
Semantic thresholds and drift metrics tell you whether your system changed. Win-rate tells you whether it got better or worse relative to something you trust.
The idea is simple: periodically sample pairs of responses (current model vs. baseline), route them to a grading system (human annotators, LLM-as-judge, or a specialized reward model), and measure what fraction of the time users or the grader prefer the current output. Win-rate above 50% means you're improving; below 50% means you've regressed.
This is the most expensive of the three signals, but it's the most aligned with actual user preference. It also gives you a natural rollback criterion: if win-rate against last week's model drops below 45% and stays there for 48 hours, that's an automated trigger to revert the model version.
Wiring Behavioral SLOs Into Incident Response
Having behavioral SLOs is only useful if they're wired into your operational processes — not sitting in a dashboard that someone checks monthly.
Alert design for quality SLOs differs from infrastructure alerts in one critical way: you want burn rate alerts, not threshold alerts. A threshold alert on "quality rate < 90%" fires once and stays firing, generating noise. A burn rate alert fires when the quality rate is consuming your error budget faster than the budget allows — it fires when the trajectory is bad, not just when a point estimate crosses a line.
Concretely: if your monthly error budget is 5% (95% quality target), and you're burning 0.5% per day when the budget allows 0.17% per day, you'll exhaust the budget in 10 days. A burn rate alert at 3x normal burn rate would fire on day one. That's the model — borrow it directly from Google's SRE book and apply it to quality metrics.
Quality-specific incident runbooks need their own taxonomy. The incident classes that matter for LLM systems are distinct from infrastructure:
- Quality degradation: quality rate drops or win-rate regresses, origin unclear
- Distribution shift: outputs have changed character but quality threshold isn't yet breached
- Model drift: a provider model update changed behavior (detectable via behavioral fingerprinting)
- Data drift: the incoming request distribution has shifted away from the training/evaluation distribution
Each class has a different investigation path. Quality degradation starts with "did we ship a prompt change?" Distribution shift starts with "did input patterns change?" Model drift starts with "did the provider push a silent update?" The runbook needs to walk on-call through each tree, not just tell them to "investigate quality."
Cost circuit breakers also belong in this architecture. LLM evaluation costs money: running an LLM-as-judge on every request is expensive at scale. Design your monitoring pipeline to sample aggressively on steady-state traffic, then increase sample rates automatically when cheaper signals (rule-based checks, distribution drift) indicate a potential incident. Spend your evaluation budget where it matters most.
The Organizational Hard Part
The technical design is the easier half. The harder half is getting AI teams to treat behavioral quality as an operational responsibility rather than a product launch checklist item.
The error budget conversation is where this usually breaks down. Traditional error budgets come with a forcing function: when you exhaust your budget, you stop shipping new features and focus on reliability. That's painful enough to motivate investment in reliability work. Behavioral quality error budgets need the same forcing function — if your quality SLO burns out, the roadmap pauses until the quality work is done — but AI teams resist this because quality work is less legible than infrastructure work. "Improve response quality" doesn't have a clear ship date.
The fix is to make quality work as legible as infrastructure work. Every behavioral SLO needs a corresponding investment: someone owns the eval pipeline, someone owns the grading rubric, someone reviews grading samples weekly to catch evaluator drift. These are scoped, assigned, and tracked — not aspirational.
Evaluator trust is a related problem. Teams resist SLOs they don't trust. If the LLM-as-judge is rating outputs inconsistently, if the semantic threshold was set arbitrarily, if win-rate comparisons are noisier than expected — the SLO will be quietly ignored when it fires at inconvenient times. Invest in evaluator validation before you make the SLO operational: measure inter-rater agreement, compare automated grades to human grades on a representative sample, set thresholds based on data rather than intuition.
Regression cadence also needs explicit planning. Behavioral quality SLOs are only as good as the frequency at which you run evaluations. Evaluating weekly means incidents can fester for seven days. Evaluating continuously on a sample means incidents surface within hours but requires a production evaluation pipeline that's maintained like infrastructure. The right answer depends on the cost sensitivity of your application — but teams that treat evaluation as a quarterly event are not running SLOs; they're doing periodic audits.
What to Build First
If you're starting from zero, resist the urge to build everything at once. A behavioral quality SLO can be bootstrapped in three phases:
Phase 1: Coverage baseline. Pick one high-volume, high-stakes output type. Write a rule-based scorer for the most obvious failure mode (missing required fields, off-topic response, excessive hedging). Run it on production traffic for two weeks and measure the baseline rate. Don't set an SLO target yet — just establish what "normal" looks like.
Phase 2: Threshold and alerting. Set a quality rate target based on the baseline (if normal is 94%, target 92% and treat the gap as your initial error budget). Wire a burn rate alert. Do a tabletop exercise: if this alert fires on a Tuesday morning, what do we do? Write that runbook before you need it.
Phase 3: Win-rate regression tests. Add a weekly automated win-rate evaluation against the previous week's model. Gate production deployments on this check: if win-rate drops below 48% on the eval set, require a human review before shipping. This turns your SLO from a monitoring instrument into a deployment gate — which is where it stops being optional.
Conclusion
The reliability practices that make traditional services trustworthy — error budgets, burn rate alerts, deployment gates, runbooks — transfer to AI features. They just need to be applied to behavioral quality, not infrastructure availability.
The challenge isn't technical sophistication. It's the organizational discipline to treat quality degradation as a production incident, to invest in evaluation infrastructure with the same seriousness as monitoring infrastructure, and to accept that "the dashboard is green" is no longer sufficient assurance when your users are interacting with a probabilistic system.
Error budgets work because they make the implicit explicit: you've decided in advance what level of failure is acceptable, and you've committed to stopping new work when you exceed it. That discipline is exactly what AI features need — applied to the metric that actually matters.
- https://arxiv.org/html/2408.04667v5
- https://www.evidentlyai.com/llm-guide/llm-evaluation-metrics
- https://www.evidentlyai.com/blog/llm-regression-testing-tutorial
- https://www.fiddler.ai/blog/how-to-monitor-llmops-performance-with-drift
- https://orq.ai/blog/model-vs-data-drift
- https://technivorz.com/production-monitoring-alerts-for-llm-quality-drops-tackling-ai-performance-degradation-alerts-in-enterprise-teams/
- https://arxiv.org/html/2502.10505
- https://deepeval.com/guides/guides-regression-testing-in-cicd
- https://www.braintrust.dev/articles/what-is-llm-monitoring
- https://www.ilert.com/agentic-incident-management-guide
