Skip to main content

Agent SLOs Without Ground Truth: An Error Budget for Outputs You Can't Grade in Real Time

· 11 min read
Tian Pan
Software Engineer

Your agent platform has met its 99.9% "response success" SLO every quarter for a year. Tickets are up 40%. Retention on the agent-touched cohort is down. The on-call rotation is bored, the product manager is panicking, and the executive review keeps asking why the dashboard says everything is fine while the support queue says everything is on fire. The dashboard isn't lying. It's just measuring the wrong thing — because the SRE who wrote the SLO defined success as "the model API returned 200," and that was the only definition of success the telemetry could express in the first place.

This is the central problem of agent reliability engineering: the success signal is not a status code. It is a judgment about whether the agent did the right thing for a specific task, and that judgment is unavailable at request time, often unavailable at session time, and sometimes only resolvable days later when the user files a ticket, edits the output, or quietly stops coming back. You cannot put a 200-vs-500 boolean on a column that doesn't exist yet.

The reflex is to wait for ground truth before declaring an SLO. This is wrong. Reliability does not pause while you build a labeling pipeline. The right move is to write an error budget against proxies you know are imperfect, name them as proxies, set the policy that governs how the team responds when they trip, and back-fill ground truth into the calculation as you produce it. This post is about how to do that without lying to yourself.

Why The Latency-SLO Playbook Breaks

The classic SRE framing assumes a tight loop between the SLI you measure and the user happiness you actually care about. Latency is the canonical example: a request that took 800ms instead of 200ms made a user unhappy in a way you can confirm at request time, in the same telemetry stream, with no labeling required. The SLI is not a proxy for user happiness — it is, for latency-sensitive systems, very nearly the same variable.

Agentic features violate this assumption on every axis. The thing the user wants from the agent is a correct, useful, contextually appropriate completion of their task. The thing your platform can observe is a sequence of tool calls, token streams, and HTTP status codes. None of those are correctness. The model can return a 200 with a confidently wrong answer; it can fail a tool call and recover gracefully into a useful response; it can hand off to a human in the right cases (good) or in the wrong cases (bad), and the dashboard records both as "escalations."

The deeper problem is the timing. Latency is observable inside the request. Tool errors are observable inside the session. Whether the agent did the right thing is sometimes only resolvable when the user comes back tomorrow and edits the artifact, or doesn't come back at all. An SLO whose ground truth has a 24-hour to 30-day lag is structurally different from an SLO whose ground truth is in the response headers. Pretending otherwise is how teams end up with the dashboard-vs-tickets divergence.

Proxy SLIs You Can Measure In Real Time

A proxy SLI is a variable correlated with true success that you can observe inside the telemetry stream you already have. None of these are the truth. Each of them, held to a budget the org agrees represents acceptable degradation, is honest enough to ship.

The proxies that hold up across most agent products:

  • Escalation rate — the percentage of agent sessions handed off to a human. A rising escalation rate is rarely good news, but a falling one isn't necessarily good either. An overcautious threshold escalates everything; a too-loose threshold lets bad answers through. Track the rate, and track the distribution of escalation reasons alongside it.
  • Retry rate — the percentage of sessions where the user re-sent or rephrased their prompt within a short window. Re-prompting is the user telling you the previous turn missed; it is one of the cleanest in-session signals of dissatisfaction you can collect without labels.
  • Abandonment — the percentage of sessions that end without a clear terminal action (purchase, save, send, accept). Abandonment is noisier than retry — users abandon for reasons that have nothing to do with quality — but a sustained delta against your baseline is meaningful.
  • Edit-distance on artifacts — when the agent produces something the user can modify (a draft, a config, a query), the size of the user's subsequent edit is a quantitative proxy for how far off the first draft was. Zero edit means accepted; 90% edit means the agent gave them a coat hanger when they asked for a coat.
  • Tool-call failure cluster rate — not just whether tool calls failed, but whether they failed in patterns the agent didn't recover from. Healthy retries are part of the loop; unbounded retry storms are a reliability event.

The honest framing for any of these is: "exceeding this budget alerts on a regression in something correlated with user dissatisfaction, even though we cannot say from this signal alone what fraction of true errors we are catching." That sentence belongs in the SLO document. Without it, the next quarter's reviewer will treat the proxy as ground truth and conclude the system is fine.

Gold Cohorts: How Asynchronous Ground Truth Closes The Loop

Proxies tell you something is off. They don't tell you what fraction of agent responses are actually wrong. To close that gap, you need a gold cohort: a sample of production traffic, drawn at a known rate, graded asynchronously against a rubric that matches your true success criteria — and then back-propagated into the SLO calculation so you can quote a real error rate, not just a proxy.

The mechanics that matter:

  • Sample at production scale, not at convenience scale. A gold cohort drawn from internal demos or beta users tells you almost nothing about how the system performs for the long tail of real prompts. Standard practice in production observability is to sample 1–10% of high-volume traffic for online scoring, with cheaper checks at higher rates and expensive LLM-judge evaluations at lower rates. Pick your sample rate against your judging budget, not your wishful thinking.
  • Grade asynchronously, off the request path. A judge that runs synchronously doubles your latency and your inference bill. A judge that runs in the background, against logged traces, costs nothing the user can perceive and produces a continuous quality signal you can dashboard.
  • Calibrate the judge against humans before you trust it. LLM-as-judge is a workable scalable proxy for human grading, but only after you've measured the agreement rate between the judge and human reviewers on a held-out set. If your judge agrees with humans 70% of the time, your "real" error rate is also a proxy — it's just a better one than retry rate. Recalibrate quarterly; model upgrades shift judge behavior in ways that look like quality drift but aren't.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates