Skip to main content

MTBF Is Dead When Your Agent Self-Heals

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter had every dashboard green. Tool error rate flat at 0.3%. End-to-end success at 98%. SLO budget barely touched. They were also burning four times their projected token spend, and nobody could explain it. When they finally instrumented retry depth per trace, the picture inverted: the median successful request was making 2.7 tool calls instead of the 1.0 the architecture diagram promised. The agent was not failing. It was failing and recovering, over and over, inside the same span, and the success rate metric had no way to tell them.

This is the part of agentic reliability that the old reliability vocabulary cannot reach. MTBF — mean time between failures — assumes failures are punctuated, observable events you can count between. You measure the gap, you compute the mean, you alert when the gap shrinks. It worked for hard drives, networks, deterministic services. It does not work for systems that retry, reroute, fall back, and recover silently inside a single user-visible operation.

The classical model has a tidy boundary between "the system worked" and "the system did not work." Agents erase that boundary. A modern agent run is a graph of attempts, and the leaf node that returned a green status hides everything the agent had to do to get there. A passing trace might involve three failed tool calls, a re-prompt, a fallback to a cheaper model, and a final successful retrieval — and all of that gets collapsed to a 200 response and a happy customer. The metric you have been watching for years has been quietly redefined under you.

The Failure Distribution Has No Gaps Anymore

MTBF is a calculation that depends on time between failures. There has to be a between. In a deterministic service, between is the easy part: the service is either up or down, and the gaps are obvious. In an agent, the gaps disappear. The agent's runtime is structured to absorb failures as part of normal operation. A timeout becomes a retry. A schema mismatch becomes a re-prompt with a corrected example. A tool error becomes a fallback to a different tool. Each of those was a failure under the old definition. Under the new architecture, none of them count, because the user got what they asked for.

This is not a bug in the runtime. It is the entire point. Self-healing is the feature you paid for when you adopted agent frameworks instead of plain LLM calls. The price is that the failure signal has been eaten. You no longer have a stream of discrete failures to count between — you have a continuous distribution of recovery cost, and the only thing the SLO sees is the tail that escapes recovery entirely. By the time MTBF moves, the system has already been degrading for weeks.

The math is also wrong in a subtler way. MTBF assumes failure events are roughly independent — that's why you can take a mean and have it mean something. Agent failures are systematic. When the model has a wrong belief about a tool's schema, every call into that tool fails the same way, and three retries make the same mistake three times. The retry distribution is correlated, not independent, and averaging it produces a number that does not describe the system you have.

What to Measure Instead

The replacement metric set has three pieces, and they only make sense together.

Success-rate-after-recovery is the user-visible quality signal. It is the rate at which the user got a correct answer, regardless of how many internal failures the agent burned through to produce it. This is what your existing success metric thinks it is measuring, but in most stacks it is contaminated — agents that fail silently and produce a wrong-but-plausible answer count as successes. To make this metric trustworthy, you have to define correctness at the action level, not the response level. ReliabilityBench's framing is useful here: define correctness by end-state rather than output text. Did the booking actually happen? Did the row actually update? Did the email actually send to the right address? The answer text is a witness, not the truth.

Recovery-cost-per-success is the hidden tax. For every successful trace, count the tool calls, retries, fallback hops, and tokens consumed beyond the optimal path. This is the metric that catches a healthy-looking system burning four times its expected spend. Recovery cost is a number you can budget against, and a number you can put on a dashboard next to revenue. When recovery cost rises while success rate stays flat, something has changed underneath you — a tool got slower, a model got dumber, a schema drifted, a user behavior shifted — and the agent is compensating without telling you.

Recovery-attempt distribution per trace is the latent risk indicator. Plot, per trace, how many recovery hops happened before success. A healthy system has most traces at zero recoveries, a thin tail at one, and almost nothing past two. A degrading system thickens its tail. Crucially, this metric moves before success rate moves. It is the early-warning channel for a class of slow degradations that MTBF will only catch after they have already broken something. Treat the shape of the distribution as the signal, not just its mean — a bimodal distribution with a cluster of traces at five-plus retries is a different problem than a uniform shift to one retry.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates