Skip to main content

The Retry Your Dashboard Counted Three Different Ways

· 11 min read
Tian Pan
Software Engineer

An agent ran. The plan-step crashed. The tool-call step retried twice with a 500, then succeeded on the fourth attempt. The user got their answer.

How many events was that? Ask product, and it's one — the user got a working result, so the funnel counts a conversion. Ask SRE, and it's three failures plus one success, a 75% error rate on the underlying step. Ask finance, and it's four billable inferences, two retried tool calls, and roughly four times the unit cost product is forecasting against. Each team's dashboard is correct. They are also irreconcilable, and the moment someone tries to reconcile them — usually during an incident review — they will discover the team has been operating against three contradictory pictures of reliability for months.

The problem isn't that one team is right and the others are wrong. It's that "the agent failed" is an under-specified phrase, and a healthy observability stack has to answer it at three different layers without those layers contradicting each other. Most stacks were built assuming one definition would do.

Why a single error rate stops working at the agent boundary

For a stateless HTTP service, "did it fail" is unambiguous. A 5xx is a failure. A 2xx after one or more 5xx retries is, by every standard SRE convention, still a success — the Google SRE workbook treats retries as part of the success budget when the eventual outcome is correct. The retry budget literature, including the Microsoft retry-storm antipattern guide, agrees: count attempts and attempt-results separately, but report user-visible success on the outermost call.

Agents break that convention in three ways. First, the unit of work is no longer a single call. A user's task is an opaque graph of plan steps, tool calls, model calls, retrieval steps, and sometimes nested sub-agents. Each step has its own failure semantics, and the relationship between step success and task success is non-monotonic — a step can succeed in a way that causes the task to fail (the model confidently called the wrong tool) or fail in a way that causes the task to succeed (the tool 500'd and the agent rerouted to a different tool that happened to be cheaper).

Second, every retry is a billable inference, and most of them rerun the entire growing context window, not just the failing chunk. The cost difference between "one successful task" and "one task that took three retries to succeed" is not 1 vs 1 — it's frequently 1 vs 8 or 1 vs 12, because the third retry carries the second retry's failed output in its context. Finance teams who built their unit-economics model on a clean success count are off by an order of magnitude, and they don't know it until the invoice lands.

Third, the retry is sometimes the work. A self-correcting agent that verifies its own output and tries again on failure is not "succeeding after failures" in the SRE sense — it's executing its design. Counting the verification-step failures as incidents pollutes the error rate with normal operation. Counting them as successes hides genuine convergence failures where the agent retried thirty times before giving up.

The single error-rate metric collapses under the weight of three different questions: did the user get what they wanted, did the system perform within its budget, and did each component behave as designed. Most teams answer one of these and assume the other two will follow. They don't.

The three layers, named explicitly

The cleanest way out is to stop trying to make one metric serve three audiences. Build three layers and let each team subscribe to the one it needs.

Task outcome is the user-visible answer. Did the user's request, defined at the top of the trace, end in a state the user would call a success? This is binary at the task level and is the only number product should look at. Retries are invisible at this layer. A task that took four attempts to produce a correct answer is a single success; a task that took one attempt to produce a confidently wrong answer is a single failure. The conversion funnel, the activation curve, the "is the feature working" question all live here.

Step outcome is per-component health. For each span in the trace — plan call, tool call, retrieval, model inference — did this attempt succeed or fail on its own terms? This is what SRE wants, because it's the layer where you can compare a tool's reliability against its SLO, detect a degrading retrieval index, or notice that a specific model version is regressing. The step layer counts every attempt. A task that retried a tool three times before succeeding is three step-failures plus one step-success, regardless of what the task layer reports. The dashboards here look like classic service dashboards, and the observability literature for AI agents converges on this as the diagnostic layer.

Budget consumption is what finance reads. How many billable units — input tokens, output tokens, tool invocations, reasoning tokens, cached vs uncached reads — did this task consume from start to finish, summed across every attempt and every retry? This is monotone with respect to work done; it never decreases for a given task. A task that succeeded on the first try costs 1x; a task that succeeded on the fourth try costs 4x or more depending on context growth. The token-attribution playbooks all converge on this being a separate ledger from outcome metrics, and they're right: outcome and cost are independent axes.

These three layers should not be reduced into each other. The "right" task success rate is not the average of step success rates, and the cost per successful task is not derived by dividing total tokens by task count. Each is a different aggregation over the same underlying trace, and each rolls up to a different stakeholder.

What the joins between layers reveal

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates