Skip to main content

MTBF Is Dead When Your Agent Self-Heals

· 10 min read
Tian Pan
Software Engineer

A team I talked to last quarter had every dashboard green. Tool error rate flat at 0.3%. End-to-end success at 98%. SLO budget barely touched. They were also burning four times their projected token spend, and nobody could explain it. When they finally instrumented retry depth per trace, the picture inverted: the median successful request was making 2.7 tool calls instead of the 1.0 the architecture diagram promised. The agent was not failing. It was failing and recovering, over and over, inside the same span, and the success rate metric had no way to tell them.

This is the part of agentic reliability that the old reliability vocabulary cannot reach. MTBF — mean time between failures — assumes failures are punctuated, observable events you can count between. You measure the gap, you compute the mean, you alert when the gap shrinks. It worked for hard drives, networks, deterministic services. It does not work for systems that retry, reroute, fall back, and recover silently inside a single user-visible operation.

The classical model has a tidy boundary between "the system worked" and "the system did not work." Agents erase that boundary. A modern agent run is a graph of attempts, and the leaf node that returned a green status hides everything the agent had to do to get there. A passing trace might involve three failed tool calls, a re-prompt, a fallback to a cheaper model, and a final successful retrieval — and all of that gets collapsed to a 200 response and a happy customer. The metric you have been watching for years has been quietly redefined under you.

The Failure Distribution Has No Gaps Anymore

MTBF is a calculation that depends on time between failures. There has to be a between. In a deterministic service, between is the easy part: the service is either up or down, and the gaps are obvious. In an agent, the gaps disappear. The agent's runtime is structured to absorb failures as part of normal operation. A timeout becomes a retry. A schema mismatch becomes a re-prompt with a corrected example. A tool error becomes a fallback to a different tool. Each of those was a failure under the old definition. Under the new architecture, none of them count, because the user got what they asked for.

This is not a bug in the runtime. It is the entire point. Self-healing is the feature you paid for when you adopted agent frameworks instead of plain LLM calls. The price is that the failure signal has been eaten. You no longer have a stream of discrete failures to count between — you have a continuous distribution of recovery cost, and the only thing the SLO sees is the tail that escapes recovery entirely. By the time MTBF moves, the system has already been degrading for weeks.

The math is also wrong in a subtler way. MTBF assumes failure events are roughly independent — that's why you can take a mean and have it mean something. Agent failures are systematic. When the model has a wrong belief about a tool's schema, every call into that tool fails the same way, and three retries make the same mistake three times. The retry distribution is correlated, not independent, and averaging it produces a number that does not describe the system you have.

What to Measure Instead

The replacement metric set has three pieces, and they only make sense together.

Success-rate-after-recovery is the user-visible quality signal. It is the rate at which the user got a correct answer, regardless of how many internal failures the agent burned through to produce it. This is what your existing success metric thinks it is measuring, but in most stacks it is contaminated — agents that fail silently and produce a wrong-but-plausible answer count as successes. To make this metric trustworthy, you have to define correctness at the action level, not the response level. ReliabilityBench's framing is useful here: define correctness by end-state rather than output text. Did the booking actually happen? Did the row actually update? Did the email actually send to the right address? The answer text is a witness, not the truth.

Recovery-cost-per-success is the hidden tax. For every successful trace, count the tool calls, retries, fallback hops, and tokens consumed beyond the optimal path. This is the metric that catches a healthy-looking system burning four times its expected spend. Recovery cost is a number you can budget against, and a number you can put on a dashboard next to revenue. When recovery cost rises while success rate stays flat, something has changed underneath you — a tool got slower, a model got dumber, a schema drifted, a user behavior shifted — and the agent is compensating without telling you.

Recovery-attempt distribution per trace is the latent risk indicator. Plot, per trace, how many recovery hops happened before success. A healthy system has most traces at zero recoveries, a thin tail at one, and almost nothing past two. A degrading system thickens its tail. Crucially, this metric moves before success rate moves. It is the early-warning channel for a class of slow degradations that MTBF will only catch after they have already broken something. Treat the shape of the distribution as the signal, not just its mean — a bimodal distribution with a cluster of traces at five-plus retries is a different problem than a uniform shift to one retry.

These three together replace what MTBF used to do. The user-visible quality, the cost of producing it, and the early signal that the cost is rising. None of them are in your monitoring stack by default. All of them require structured trace data with retry depth and recovery type as first-class attributes.

The 4x Recovery Tax Nobody Budgets For

The single most common failure mode I see is teams discovering, months in, that their agent has been running at three or four times the expected token cost since launch, and the dashboards never moved. The pattern is consistent enough that it deserves its own name. Call it the silent recovery tax.

It happens because agent runtimes are aggressive about retrying. The model, asked to call a tool with a misformed argument, will see the error, apologize, reformat, and retry — and that whole loop happens inside one response. The user sees a slightly slower reply. The aggregate dashboards see a successful request. The token meter sees three input contexts and three output completions instead of one, and the bill at the end of the month has a number on it that does not match the per-request cost projection from launch.

The arithmetic is brutal. A 4x recovery tax does not look like a 4x failure rate. It looks like a 0% failure rate and a 300% cost overrun. Engineering goes hunting for the prompt that's gotten too long, the retrieval that's pulling too many chunks, the model that's been silently re-routed. The actual cause — that the agent has been quietly recovering from a broken tool schema for six weeks — does not surface, because nothing in the standard observability stack treats recovery hops as a first-class entity.

The fix is structural. Every retry, fallback, re-prompt, and recovery action needs to be its own span in the trace, tagged with the reason for the recovery. Not a log line, not an attribute on the parent span — a child span. This is what makes recovery cost computable. Once each recovery is a span, you can group by reason, attribute cost to causes, and answer questions like "which tool's schema mismatch is responsible for 30% of our recovery cost this week?" without forensic effort. Without span-level recovery instrumentation, you are running on hope.

The Observability Architecture That Surfaces Recovery, Not Just Outcomes

Most LLM observability stacks today are built on the implicit assumption that the interesting unit is the request. That assumption was inherited from web service monitoring, and it is wrong for agents. The interesting unit is the recovery graph: the structure of attempts and fallbacks that produced the final outcome. Surfacing that graph is the architectural task.

Three concrete shifts make this tractable. First, retry semantics in the trace: every retry is a sibling span with explicit retry-reason and retry-attempt attributes, not an attribute glommed onto the parent. Second, recovery typing: distinguish a tool retry from a model fallback from a re-prompt from a circuit-breaker reroute. They have different costs and different implications. Third, recovery-aware sampling: keep 100% of traces with non-zero recovery hops, even when sampling everything else. The interesting failures are the ones that succeeded — those are the traces you cannot afford to drop.

There is a second-order benefit that usually justifies the work on its own. A recovery-aware trace store is a postmortem accelerator. When something does break loudly, the question "what was happening underneath the success rate before the failure became visible?" is suddenly answerable. You can rewind to a week before the incident and see the recovery distribution thickening, the recovery cost climbing, the specific tool whose retry rate was creeping up. Postmortems that used to be archaeology become forensics. The cause was always there in the data — the architecture just did not surface it.

What This Changes for SLOs

If you ship an agent with the SLO definitions you used for the deterministic service it replaced, you are flying blind. The new SLO surface needs at least three lines: a quality SLO defined as success-rate-after-recovery against an end-state oracle, a cost SLO defined as recovery-cost-per-success against the architectural ideal path, and a stability SLO defined as the percentage of traces with zero recoveries. The third one is the most uncomfortable to commit to, because it forces you to acknowledge that recovery is something you would prefer not to need — even when it works.

The healthiest production agents I have seen do not optimize for self-healing. They optimize for not needing to. Self-healing is treated as the airbag, not the steering. When recovery rates climb, the team treats it as a degradation, not a feature working harder. They go fix the schema, retrain the tool descriptions, simplify the prompt, narrow the model — whatever is upstream of the recovery — instead of celebrating that the airbag deployed correctly. That posture only becomes possible when the metric stack actually shows recovery as a tax, not a triumph.

MTBF was never going to survive an architecture where failures are absorbed by design. The job now is to measure what the absorption costs you, and to alert on the cost moving before the user-visible quality does. The dashboards that come with most agent platforms are not built for this yet. The teams that build their own — that treat every recovery as a first-class observable, that budget against recovery cost the way they budget against latency, that watch the distribution and not just the mean — are the ones whose agents are still cheap to run a year after launch. The ones that trust the green dashboard are the ones who will be reading their cloud bill in disbelief next quarter.

References:Let's stay in touch and Follow me for more thoughts and updates