Skip to main content

The Backpressure Signal Your Inference Provider Refuses to Send

· 9 min read
Tian Pan
Software Engineer

Your retry logic backs off on 429. Your queue depth alarm fires when latency rises. Between those two signals there is a region of provider load where the right action is "slow down by twenty percent" — and the only thing the provider will tell you is the binary throttle that arrives too late. The single most useful signal for an agent fleet to coordinate on is the one no inference API actually exposes.

A 429 is a tombstone, not a warning. By the time you receive one, the provider has already decided your traffic is excessive, you have already wasted a request's worth of token accounting, and — if you are sharing a tenant with other consumers — they have probably gotten one too. The interesting failure mode is not the 429 itself; it is the seconds before it, when every client in the world is flying blind between "everything is fine" and "you are cut off."

The Dead Zone Between 200 and 429

Frame the problem as a control loop. Your client has two inputs for deciding whether to send the next request: the response code and the response latency. Neither is sufficient on its own.

The response code is a step function. It is 200 until it is 429, with nothing in between. There is no "you are at 87% of capacity" code. The rate-limit headers help — Anthropic returns anthropic-ratelimit-tokens-remaining on every response, OpenAI returns x-ratelimit-remaining-tokens — but those describe your bucket, not the provider's underlying capacity. You can be at 40% of your token bucket while the upstream GPU cluster is at 95% and about to start shedding load across all customers. The headers tell you about your contract; they do not tell you about the substrate.

Response latency is the other axis, and it is noisy in a way that defeats most simple thresholds. Latency on an LLM call is a function of prompt length, output length, model selection, prompt cache state, and provider load. The first four are properties of your request. The fifth is the signal you actually want, and it is buried under the others. A naive "slow down when p95 latency rises" rule will throttle you when a user pastes in a long document, not when the provider is under pressure.

So you end up with a control loop driven by two unreliable signals: a binary code that fires too late and a continuous metric whose variance comes mostly from things you control. The dead zone in the middle is where every team writes a different heuristic, every heuristic is wrong in a different way, and the bill arrives at the end of the month.

Correlated Overshoot Is a Multi-Tenant Disease

The single-client version of this problem is annoying. The multi-tenant version is structural.

If every consumer of an inference API is running its own client-side controller with no visibility into anyone else's traffic, the moment provider load rises, every client independently keeps sending at its previous rate until each one hits a 429. The 429s arrive in a wavefront. Every client backs off. The provider's load drops sharply. The clients' retry timers expire roughly together (especially if they use the same library's defaults, which they do). They all resume. The load spikes again. The provider throttles again. The recovery oscillates instead of converging.

This is the classic thundering herd, but with two LLM-specific twists. First, the unit of work is far more expensive than an HTTP call — a single retried request is dozens of seconds of GPU time, not milliseconds. Second, the consumers are not just retrying user actions; they are agent loops that will themselves issue more calls in response to whatever they get back, so a successful retry is the start of an N-step burst, not the end of one.

The result is that "I am a well-behaved client" is not actually a defense. You can have perfectly polite jittered exponential backoff and still contribute to the correlated overshoot, because everyone else has it too, and the synchronization comes from the shared signal (the 429 wavefront) rather than from any coordination between clients. The problem is not bad client behavior. It is the absence of a shared upstream signal that lets clients adjust before the wavefront forms.

What TCP Got Right That LLM APIs Got Wrong

The networking analogy is not subtle. TCP solved this problem in the 1980s.

TCP does not just send packets until the receiver complains. It runs a congestion window (cwnd) that grows additively when things are working and shrinks multiplicatively when loss is detected. The genius of the design is that the sender infers congestion from observable signals (loss, ack timing, ECN bits) and adjusts before the network melts down. The protocol exposes enough information for the sender to make a graduated decision — slow down by half, not slow down to zero — and AIMD across many independent senders produces a roughly fair, roughly converged allocation of bandwidth without anyone coordinating.

The LLM API equivalent does not exist. There is no header that says "the cluster you are calling is at 92% of capacity right now, please ease off." There is no equivalent of ECN, where the upstream can mark a successful response with a "by the way, you are part of the problem" flag. There is no equivalent of the slow-start phase, where a new client probes capacity gradually instead of arriving at its configured concurrency on the first second. The closest thing — the rate-limit headers — describes your contractual entitlement, not the substrate's health, and is therefore useless for the coordination problem.

This is a protocol-design gap, not a client-side bug. You can write the cleverest adaptive concurrency controller in the world and you will still be inferring upstream load from a noisy proxy, because the provider does not consider its own load to be your business. The team treating this as a client problem will keep writing more elaborate Kalman filters over latency variance. The team treating it as a protocol problem will, eventually, demand the missing header.

The Patterns That Close the Gap From Below

You cannot wait for the providers to ship the header you want. In the meantime, three patterns help, ranked roughly by how much pain they remove and how much complexity they add.

Infer load from latency variance, not just latency. A single p95 latency number is too noisy to act on. The change in p95 over a rolling window, normalized for request shape, is a much better signal — when the variance of completion latency for a fixed prompt-and-output-length profile rises, that is upstream queueing leaking into your view. A controller that watches the second-derivative of latency, not its absolute value, can throttle voluntarily a couple of seconds before the 429 wavefront. Vector's adaptive request concurrency and Uber's Cinnamon auto-tuner both do versions of this for general HTTP traffic; the LLM-specific twist is that you want to bucket by output length so a long completion does not look like congestion.

Use the headers as a leading indicator, even though they are about your bucket. A common rule that actually works: when x-ratelimit-remaining-tokens drops below 20% of the limit, preemptively cut your send rate by half. This does not solve the substrate-load problem, but it eliminates one of the two paths to a 429 (the path where you are the problem) so when a 429 does arrive, you know it is provider-side and can react differently. Treating the two cases identically is what produces the worst behavior.

Self-throttle a circuit-breaker tier before the provider does. Pick a concurrency ceiling below the contractual maximum and treat it as a soft limit. When latency variance crosses a threshold, drop the ceiling further. This is voluntary headroom — you are paying for capacity you do not use during good times — and it is the price of not participating in the correlated overshoot during bad times. The teams that do this end up with smoother bills than the ones that run at 100% of contractual capacity and absorb the throttle storms when they arrive.

For platform teams running an internal AI gateway in front of multiple consumer applications, there is a fourth pattern: do the coordination yourself. The gateway sees all the traffic. It can allocate capacity across consumers, throttle low-priority traffic when it sees variance rising, and present a single, well-behaved client to the provider. This is not a fix for the protocol gap; it is centralizing the workaround so each downstream team does not have to reinvent it. It works because you control both sides of the gateway boundary. It does not work across organizations.

The Contract-Design Argument

The longer-horizon answer is that providers should expose a load-pressure signal as a first-class header. Something like x-provider-load-class: green | yellow | red, or a numeric x-provider-utilization: 0.0–1.0, returned on every response. It does not need to be precise. It does not need to be SLA-backed. It just needs to be a non-binary, frequently-updated signal that well-behaved clients can use to ramp down voluntarily before the wavefront forms.

The objection providers will raise is that exposing the signal teaches clients to exploit it — to push right up to the line and then retreat. That objection misunderstands the asymmetry. Clients are already pushing up to the line, because the line is the only signal they have. Giving them a graduated signal does not make them push harder; it lets the polite ones back off earlier, which means the 429s arrive later, which means the throttle storms recover faster, which means the provider's tail latency improves. This is the same argument TCP's designers had to win in the 1980s, and they won it for the same reason: a shared signal that lets many independent actors coordinate is better than no signal, even if a few actors abuse it.

Until that header ships, every team building on inference APIs is implicitly betting that their adaptive controller is smarter than everyone else's. That bet is not winnable. The team treating backpressure as a client-side bug will keep building more elaborate machinery against a missing signal. The team treating it as a protocol-design gap will lobby the providers, build the gateway in the meantime, and stop blaming their own code for the wavefronts.

The backpressure signal you want is the one your provider has not yet decided to send. Build around its absence; ask for its presence.

References:Let's stay in touch and Follow me for more thoughts and updates