Skip to main content

The Retry Budget That Hid Your Provider's Actual Error Rate From Your Dashboard

· 11 min read
Tian Pan
Software Engineer

The weekly review slide said 99.9%. The invoice said the bill had tripled. The two numbers had been on adjacent dashboards for months, and nobody had noticed that they were measuring different worlds. The reliability number was post-retry — every call that eventually returned a 200 counted as a success — and the cost number was every attempt the client made, billed by the token. Between them sat a generous five-attempt retry loop and a provider whose tail latency had been quietly degrading. The first time anyone looked at both numbers together was during an outage, when the cost-anomaly alert fired before the availability alert did.

That is the whole pattern. A retry budget that looks like a reliability mechanism is also a cost-quality knob, and the team that watches only one side of it is paying for an availability number the invoice will eventually correct.

The Two Definitions of "A Call" That Drifted Apart

The retry loop is one of the oldest patterns in distributed systems, and the AWS Builders' Library has been arguing for over a decade that exponential backoff with jitter is non-negotiable for any client talking to a remote service. The SRE Workbook makes the same case from the other direction: retries amplify low error rates into higher levels of traffic, and a single user action that traverses three retrying layers can produce sixty-four attempts on the database. The standard advice is to put a budget on the retries so the amplification has a ceiling.

The advice is right. The problem is the metric the budget hides.

A success-rate SLO computed downstream of the retry loop measures whether the client eventually got an answer. A success-rate SLO computed before any retry happens measures whether the provider gave the client an answer on the first try. Those are different questions about different surfaces, and for years they were close enough that nobody had to distinguish them. LLM workloads have made them diverge.

The reason is that the cost of a retry stopped being negligible. Retrying a GET /healthz against a microservice is essentially free — a handful of bytes, a handful of milliseconds, a CPU cycle the autoscaler didn't notice. Retrying a tool-using LLM call is a re-serialization of the system prompt, the conversation history, the retrieval payload, and the user message — twenty thousand input tokens billed in full, every attempt. The Anthropic and OpenAI SDKs both retry twice by default on 429s and 5xxs, which means a moderately bursty provider can multiply your input-token bill by 1.5x without changing your post-retry success metric by a single basis point.

The CloudZero observability writeup describes this gap as the missing layer for the people who pay the bills. Engineers see the availability number, finance sees the invoice, and the two parties have no shared metric that explains why one is steady while the other is climbing. The cost is the leading indicator. The availability is the lagging one.

What a 99.9% Success Rate Is Hiding

Imagine a provider whose first-attempt success rate degrades linearly from 99% to 95% over six weeks. With a five-attempt retry budget and reasonably independent failure draws, the post-retry success rate stays above 99.99% for the entire window. The dashboard does not move. The weekly review reports green.

Meanwhile, the average number of attempts per successful call rises from 1.01 to 1.20. On a workload that runs ten million calls a month at twenty thousand input tokens each, that is forty billion additional input tokens billed, or roughly the cost of the entire input-token budget at the previous attempt rate. The invoice triples not because the volume grew but because the same volume now costs more per success.

The first time anyone connects the two numbers is usually during an outage. The provider's first-attempt rate falls off a cliff — 95% to 60% in an hour — the retry budget can no longer absorb the failure, and the post-retry number finally moves. The cost-anomaly alert fires four hours earlier than the availability alert because the cost signal is a continuous integral over attempts and the availability signal is a step function across the budget ceiling.

This is the same pattern the SRE Workbook discusses in its alerting guidance: alerts have to be sensitive to the symptom you actually care about, and a post-retry success rate is not the symptom. It is a heavily smoothed transformation of the symptom that the retry layer was designed to produce. Smoothed metrics are what you put on a status page. They are not what you alert on.

Pre-Retry and Post-Retry Are Two Different SLOs

The fix is conceptually small and operationally large: expose pre-retry success rate and post-retry success rate as separate metrics, each with its own threshold, each with its own alert.

Pre-retry success rate is the provider's actual error rate as your client sees it. It is the number that should drive a conversation with the vendor when it degrades. It is the number that should drive a model-or-region failover when it crosses a threshold. It is the number that an SRE team should consider the real availability of the dependency, because every basis point it gives up is a basis point your retry budget has to absorb in cost.

Post-retry success rate is the user-perceived availability of your product. It is the number that should drive an SLO conversation with leadership. It is the number that should govern whether your error budget is being burned. It is the number a status page should reflect.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates