Skip to main content

Agent SLOs Without Ground Truth: An Error Budget for Outputs You Can't Grade in Real Time

· 11 min read
Tian Pan
Software Engineer

Your agent platform has met its 99.9% "response success" SLO every quarter for a year. Tickets are up 40%. Retention on the agent-touched cohort is down. The on-call rotation is bored, the product manager is panicking, and the executive review keeps asking why the dashboard says everything is fine while the support queue says everything is on fire. The dashboard isn't lying. It's just measuring the wrong thing — because the SRE who wrote the SLO defined success as "the model API returned 200," and that was the only definition of success the telemetry could express in the first place.

This is the central problem of agent reliability engineering: the success signal is not a status code. It is a judgment about whether the agent did the right thing for a specific task, and that judgment is unavailable at request time, often unavailable at session time, and sometimes only resolvable days later when the user files a ticket, edits the output, or quietly stops coming back. You cannot put a 200-vs-500 boolean on a column that doesn't exist yet.

The reflex is to wait for ground truth before declaring an SLO. This is wrong. Reliability does not pause while you build a labeling pipeline. The right move is to write an error budget against proxies you know are imperfect, name them as proxies, set the policy that governs how the team responds when they trip, and back-fill ground truth into the calculation as you produce it. This post is about how to do that without lying to yourself.

Why The Latency-SLO Playbook Breaks

The classic SRE framing assumes a tight loop between the SLI you measure and the user happiness you actually care about. Latency is the canonical example: a request that took 800ms instead of 200ms made a user unhappy in a way you can confirm at request time, in the same telemetry stream, with no labeling required. The SLI is not a proxy for user happiness — it is, for latency-sensitive systems, very nearly the same variable.

Agentic features violate this assumption on every axis. The thing the user wants from the agent is a correct, useful, contextually appropriate completion of their task. The thing your platform can observe is a sequence of tool calls, token streams, and HTTP status codes. None of those are correctness. The model can return a 200 with a confidently wrong answer; it can fail a tool call and recover gracefully into a useful response; it can hand off to a human in the right cases (good) or in the wrong cases (bad), and the dashboard records both as "escalations."

The deeper problem is the timing. Latency is observable inside the request. Tool errors are observable inside the session. Whether the agent did the right thing is sometimes only resolvable when the user comes back tomorrow and edits the artifact, or doesn't come back at all. An SLO whose ground truth has a 24-hour to 30-day lag is structurally different from an SLO whose ground truth is in the response headers. Pretending otherwise is how teams end up with the dashboard-vs-tickets divergence.

Proxy SLIs You Can Measure In Real Time

A proxy SLI is a variable correlated with true success that you can observe inside the telemetry stream you already have. None of these are the truth. Each of them, held to a budget the org agrees represents acceptable degradation, is honest enough to ship.

The proxies that hold up across most agent products:

  • Escalation rate — the percentage of agent sessions handed off to a human. A rising escalation rate is rarely good news, but a falling one isn't necessarily good either. An overcautious threshold escalates everything; a too-loose threshold lets bad answers through. Track the rate, and track the distribution of escalation reasons alongside it.
  • Retry rate — the percentage of sessions where the user re-sent or rephrased their prompt within a short window. Re-prompting is the user telling you the previous turn missed; it is one of the cleanest in-session signals of dissatisfaction you can collect without labels.
  • Abandonment — the percentage of sessions that end without a clear terminal action (purchase, save, send, accept). Abandonment is noisier than retry — users abandon for reasons that have nothing to do with quality — but a sustained delta against your baseline is meaningful.
  • Edit-distance on artifacts — when the agent produces something the user can modify (a draft, a config, a query), the size of the user's subsequent edit is a quantitative proxy for how far off the first draft was. Zero edit means accepted; 90% edit means the agent gave them a coat hanger when they asked for a coat.
  • Tool-call failure cluster rate — not just whether tool calls failed, but whether they failed in patterns the agent didn't recover from. Healthy retries are part of the loop; unbounded retry storms are a reliability event.

The honest framing for any of these is: "exceeding this budget alerts on a regression in something correlated with user dissatisfaction, even though we cannot say from this signal alone what fraction of true errors we are catching." That sentence belongs in the SLO document. Without it, the next quarter's reviewer will treat the proxy as ground truth and conclude the system is fine.

Gold Cohorts: How Asynchronous Ground Truth Closes The Loop

Proxies tell you something is off. They don't tell you what fraction of agent responses are actually wrong. To close that gap, you need a gold cohort: a sample of production traffic, drawn at a known rate, graded asynchronously against a rubric that matches your true success criteria — and then back-propagated into the SLO calculation so you can quote a real error rate, not just a proxy.

The mechanics that matter:

  • Sample at production scale, not at convenience scale. A gold cohort drawn from internal demos or beta users tells you almost nothing about how the system performs for the long tail of real prompts. Standard practice in production observability is to sample 1–10% of high-volume traffic for online scoring, with cheaper checks at higher rates and expensive LLM-judge evaluations at lower rates. Pick your sample rate against your judging budget, not your wishful thinking.
  • Grade asynchronously, off the request path. A judge that runs synchronously doubles your latency and your inference bill. A judge that runs in the background, against logged traces, costs nothing the user can perceive and produces a continuous quality signal you can dashboard.
  • Calibrate the judge against humans before you trust it. LLM-as-judge is a workable scalable proxy for human grading, but only after you've measured the agreement rate between the judge and human reviewers on a held-out set. If your judge agrees with humans 70% of the time, your "real" error rate is also a proxy — it's just a better one than retry rate. Recalibrate quarterly; model upgrades shift judge behavior in ways that look like quality drift but aren't.
  • Back-propagate into the SLO. Once gold-cohort grading is producing a graded error rate at a known confidence interval, the SLO can have two tiers: a fast-moving proxy budget that alerts in real time, and a slow-moving graded budget that updates weekly and is the number the org reviews. The proxy catches regressions; the graded number catches the proxy lying.

The economic reality nobody flags up front: the labeling and gold-set infrastructure required for honest agent SLOs is the same infrastructure required for honest evaluation. Treating these as separate budgets is how teams end up with neither — eval gets defunded because "we have monitoring," monitoring gets defunded because "we have evals," and the system in production has no honest measurement at all.

The "We Don't Know Yet" Tier

Standard alert hierarchies have two states: green and on-fire. Agent SLOs need a third one — explicitly unknown — because the lag between the request and the ground truth means there is a window where the proxy says fine, the graded number hasn't arrived yet, and you have no honest claim either way.

A workable hierarchy:

  • Green — proxies inside their budgets, graded number inside its budget, no anomalous patterns in the spend rate.
  • Yellow / unknown — proxies inside their budgets, graded number not yet available for the current window. Don't page on this; do show it on the dashboard so reviewers don't mistake "no signal" for "good signal."
  • Red proxy — at least one proxy budget is being burned faster than the SLO allows. Page on-call; investigate the underlying signal before you assume the proxy is wrong.
  • Red graded — the graded number has crossed the SLO. Page leadership and the model team; this is the equivalent of a real outage.

The point of the unknown tier is to keep the on-call rotation from getting paged on noise while making sure leadership doesn't get false comfort from a green dashboard during the window when the truth literally hasn't been graded yet. A proxy alert should be treated as a real signal, not a false positive — even when it later turns out the underlying responses were fine. The proxy moved for a reason; understanding that reason is the work.

What An Error Budget Policy Looks Like Without Ground Truth

The error budget policy — the document that says what the team is allowed and required to do when budgets burn — needs an extra paragraph for agent systems that it doesn't need for stateless web services. The standard SRE policy is roughly: "if we burn the budget, we freeze risky changes until we recover." That works because the SLI is the truth.

For agent systems, the policy has to handle four cases the standard one doesn't:

  1. Proxy red, graded green — the proxy budget burned, but the graded number when it arrives says the responses were fine. Investigate why the proxy moved. Don't reflexively loosen the proxy: a moving proxy without a quality drop usually means user behavior changed (they're asking harder questions, they're using a new feature, they discovered an edge case), and the change deserves understanding.
  2. Proxy green, graded red — the proxy stayed in budget but the graded number says quality dropped. This is the most dangerous failure mode: it means your proxy is broken or its correlation with quality has decayed. The budget freeze applies and the proxy itself goes on the next-quarter roadmap.
  3. Both red — standard burn response: freeze, root-cause, fix, post-mortem.
  4. Graded number unavailable — explicitly named as a state, with a deadline. If the graded number is more than N days late, the policy escalates to "we have no honest reliability signal," which is itself an incident class.

The architectural realization here is that AI reliability engineering is not SRE-with-vibes — it requires a different observability primitive than the existing telemetry stack was built to express. The primitive is graded outcomes with asynchronous ground truth, and the entire policy has to be rewritten to handle the time lag and the proxy/graded duality. A team that ports the latency-SLO policy verbatim to an agent platform and ships it has not produced an agent SLO; they have produced a placebo.

What To Do This Quarter

If you are running an agent feature in production without explicit SLOs, the path forward is not to wait until you have a perfect labeling pipeline. The path is:

  1. Pick three proxy SLIs you can measure today against the data you already log. Escalation rate, retry rate, and one product-specific signal (edit distance, abandonment, time-to-completion) is a defensible starting set.
  2. Set initial budgets at the level of "current performance, plus a regression tolerance the team would consider unacceptable." Don't try to set the right budget on the first attempt; the iteration loop is the point.
  3. Stand up a gold-cohort sampler that pulls 1–10% of production traffic into a graded queue. Grade with a calibrated LLM judge first, with a human-review sample on top of that for calibration.
  4. Write the error budget policy with the four-case structure above, and make sure the "unknown" tier is on the dashboard.
  5. Schedule a quarterly review where the proxy budgets get re-validated against the graded number. If correlation has decayed, the proxy itself is the bug.

The team that does this gets to say something the team running on response-success SLOs cannot: when something is wrong, they will know within a window measured in days rather than quarters, and when something looks wrong, they will know within minutes whether it's the system or the signal that has changed. That gap — between "we will be told by users" and "we will be told by our own telemetry" — is the entire difference between operating an agent platform and merely deploying one.

References:Let's stay in touch and Follow me for more thoughts and updates