Skip to main content

Hidden SDK Retries: Why You're Paying Twice and Don't Know It

· 10 min read
Tian Pan
Software Engineer

Open the OpenAI Python SDK source and you will find a quiet line: DEFAULT_MAX_RETRIES = 2. The Anthropic SDK ships the same default. Most TypeScript SDKs match. Two retries, exponential backoff, automatic on connection errors, 408, 409, 429, and any 5xx — fired before your code ever sees the failure. You do not configure this. You do not opt in. You usually do not know it is happening, because the metric your app records is request_count, not attempt_count, and the only span your tracer ever sees is the outer one the SDK closes after the final attempt.

This is fine, mostly, until it is not. Add an application-level retry decorator on top of that SDK call — the kind every team writes after their first 429 — and you have built a 3×3 storm: the SDK tries three times, your wrapper tries three times around the SDK, and a single user request fans out to nine inference calls during a provider degradation. The provider's bill counts every attempt. Your dashboards count one. The reconciliation, when someone finally runs it, is a quarter-end conversation nobody enjoys.

The Default You Did Not Configure

The SDK default is sensible in isolation. A 429 from a transient throttle, a 503 from a load balancer reshuffle, a connection reset because the keep-alive expired mid-flight — the model would have answered fine on the next try. Two retries with backoff handles the most common transient errors invisibly and your code stays clean.

The trouble is that the SDK was written for a world where its caller was a single small service. In a real production stack, that single SDK call is wrapped by:

  • A tenacity decorator the platform team added to make "everything retry on 5xx."
  • A queue worker that retries the whole job on any exception, including the LLM call inside it.
  • An orchestrator (LangGraph, a step function, an agent loop) that retries failed steps as a node-level concern.
  • An ingress gateway with its own retry policy on upstream timeouts.

Each layer's author was thinking locally. None of them are wrong individually. The product of their retry budgets is the actual retry budget your inference workload is operating under, and nobody on the team has done the multiplication. A microservice intuition that "retries are cheap" calibrates well at 50ms RPC scale, where three attempts cost 150ms and a few CPU cycles. The same intuition transferred to a 4-second, 8000-token completion is operating with a cost coefficient orders of magnitude higher than the engineer who chose retries=3 was modeling.

Why Your Trace Doesn't See It

The dirty part is that the retries are invisible to the tools you would normally use to find them.

Most APM and tracing setups instrument at the application boundary. You wrap your client.chat.completions.create(...) call in a span; the span starts when you enter, ends when the SDK returns. The SDK's internal retry loop runs entirely inside that span. From the trace's point of view, you made one call that took 9 seconds. From the provider's point of view, you made three calls that each took 3 seconds. From your bill's point of view, you bought three completions.

The standard OpenTelemetry semantic conventions for gen_ai.* attributes are emitted once per outer call. They record the prompt, the response, the token counts of the final successful attempt. The two failed attempts that preceded it disappear into the void unless you have explicitly instrumented at the HTTP layer below the SDK — which most teams do not, because the SDK presents itself as the network boundary.

The first telltale is the gap between provider-side metrics and your own. Pull the request count from the provider dashboard for a one-hour window; pull the request count from your application metrics for the same window. They should agree to within a few percent. They usually do not. The delta is invisible retry traffic and it gets larger every time the provider has a bad afternoon.

A more reliable audit is to disable SDK retries entirely in a controlled experiment and add explicit application-level retries on top. Now every attempt becomes a span. Re-run a representative slice of production load. The total attempt count, divided by the request count, is the multiplier you were paying for and could not see. In most teams that have not tuned this, the multiplier sits between 1.05 and 1.3 — and spikes to 3 or 4 during provider incidents.

The 3×3 Storm in a Provider Blip

The storm is not theoretical. The arithmetic is simple and ugly.

Imagine a normal afternoon. Provider rate limits tighten because a noisy neighbor is burning quota in the same region. Your error rate goes from 0.1% to 8%. The SDK's default policy kicks in: every failing call is automatically retried twice. Most of those retries succeed on the second attempt — the throttle was brief — and your application metrics show a 99% success rate, slightly elevated p95 latency, and a green status page.

Underneath, every user request that hit a 429 actually hit the API three times. Your token bill for that 30-minute window is roughly triple what your dashboard says. The provider's load on that endpoint is roughly triple what they would have seen without your SDK retries. Other customers' SDKs are also retrying. The throttle that was supposed to relieve pressure is now being fed three attempts for every one it rejected. This is the retry storm in textbook form, and the amplification is multiplicative across cooperating fleets.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates