Skip to main content

One post tagged with "agent-fleets"

View all tags

The Backpressure Signal Your Inference Provider Refuses to Send

· 9 min read
Tian Pan
Software Engineer

Your retry logic backs off on 429. Your queue depth alarm fires when latency rises. Between those two signals there is a region of provider load where the right action is "slow down by twenty percent" — and the only thing the provider will tell you is the binary throttle that arrives too late. The single most useful signal for an agent fleet to coordinate on is the one no inference API actually exposes.

A 429 is a tombstone, not a warning. By the time you receive one, the provider has already decided your traffic is excessive, you have already wasted a request's worth of token accounting, and — if you are sharing a tenant with other consumers — they have probably gotten one too. The interesting failure mode is not the 429 itself; it is the seconds before it, when every client in the world is flying blind between "everything is fine" and "you are cut off."