Backpressure for LLM Pipelines: Queue Theory Applied to Token-Based Services
A retry storm at 3 a.m. usually starts the same way: a brief provider hiccup pushes a few requests over the rate limit, your client library retries them, those retries land on a still-recovering endpoint, more requests fail, and within ninety seconds your queue depth has gone vertical while your provider dashboard shows you sitting at 100% of your tokens-per-minute quota with a backlog measured in five-figure dollars. The post-mortem will say "thundering herd." The honest answer is that you built a fixed-throughput retry policy on top of a variable-capacity downstream and forgot that queue theory has opinions about that.
Most of the well-known service resilience patterns were written for downstreams whose throughput is a wall: a database with a connection pool, a microservice with a known concurrency limit. LLM providers are not that. Your effective throughput is a moving target shaped by your tier, the model you picked, the size of the prompt, the size of the response, the time of day, and whether someone else on the same provider is fine-tuning a frontier model right now. Treating it like a fixed pipe is the root cause of most of the LLM outages I've seen this year.
This post walks through how to apply standard queue-theory tools — Little's Law, bulkheads, admission control, and token-bucket backpressure — specifically to the shape of LLM workloads, and why the naive retry logic that ships in most SDK examples is closer to a denial-of-service tool than a resilience pattern.
Your Provider Is a Variable-Capacity Downstream
Start with what the provider actually exposes. OpenAI, Anthropic, and the major hosted providers all enforce at least two limits in parallel: requests per minute (RPM) and tokens per minute (TPM). Anthropic adds a third dimension by splitting input and output tokens, so a Claude integration can hit one of three different ceilings depending on whether it's prompt-heavy, generation-heavy, or chat-style balanced. None of these limits are guaranteed; they're sliding-window soft caps that move with capacity, your account standing, and the provider's overall load.
This is already different from a database connection pool in two important ways. First, the binding constraint changes between requests — a long-context summarization request might hit TPM while a high-frequency classification workload hits RPM, even if the dollar cost is identical. Second, the actual capacity is partially observable: you only learn you're over the limit after the provider returns a 429, which means your local view of "how full is the pipe" is always stale and reactive. You cannot poll a "current quota usage" gauge the way you'd query a connection pool.
Add to this the variable-token-length problem. Recent queue-theory work on LLM inference using M/G/1 models showed that under heavy-tailed output distributions, a small fraction of unusually long responses can dominate queueing delay for everyone else. A single 8K-token generation behind your customer's request blocks an arbitrary number of small requests behind it. Your p99 isn't shaped by your average load; it's shaped by the longest output your traffic happens to produce in the window.
The implication is that the standard "treat the downstream as a fixed pipe and retry on failure" mental model doesn't survive contact with reality. You need to size, schedule, and shed load against a moving target.
Little's Law for Token-Based Services
Little's Law is the cheapest piece of math in distributed systems, and it applies directly here. The long-run average number of requests in flight in your system equals the arrival rate multiplied by the average time each request spends in the system: L = λW.
For an LLM call path, plug in real numbers. If your agent endpoint receives 50 requests per second and the average end-to-end latency (queueing + provider call + post-processing) is 4 seconds, you have 200 requests in flight on average. That's the number you need to size your worker pool, your connection pool, your in-memory buffers, and your provider concurrency limits against. If your worker pool can hold 100 concurrent in-flight requests, you're already in queueing territory at steady state — and steady-state queueing is not steady, it's a slow accumulation that turns sharp the moment latency wobbles.
The more useful application is in the other direction: Little's Law tells you what happens when latency degrades. If the provider's p50 doubles from 2s to 4s — a normal occurrence during a model rollout or regional load shift — your in-flight count doubles for the same arrival rate. Anything in your system that has a fixed concurrency cap (HTTP server worker count, async runtime task slots, provider client pool size) is now operating at a different load factor than it was an hour ago, and the slack you thought you had has silently evaporated.
Two practical rules fall out of this:
- Size every concurrency limit for 2x your steady-state Little's Law number, not 1.1x. The variance on LLM latency is large enough that running close to the line means a single bad provider minute breaks you.
- Track the in-flight number as a first-class metric. Most teams instrument arrival rate and latency separately and never compute the product. Wire
λ × Winto a single dashboard panel and put alerts on its second derivative — that's the leading indicator for when your bulkheads are about to fail.
The Admission-Control Layer
Once you accept that your downstream is variable and your in-flight count moves with it, the next question is what to do at the front door. Naive systems accept every request, queue it internally, and let pressure propagate as latency. This works until it doesn't, and "doesn't" arrives suddenly because user-facing latency hides queue growth until the queue is already pathological.
Admission control flips this around. Before a request enters your processing pipeline, you decide whether the system can absorb it. If not, you reject — explicitly and immediately, with a clear error, ideally with a Retry-After hint. The rejected request becomes the caller's problem, which sounds harsh but is actually the only stable equilibrium. A system that accepts requests it can't process eventually fails everything; a system that rejects requests it can't process fails some things and stays up.
The admission decision should be driven by your local view of in-flight work, not by the provider's response. By the time you see a 429, you've already paid the round-trip latency, you've consumed a request slot, and you've added one more failed call to the provider's view of your account. Your admission layer should reject requests before they're allowed to attempt the call.
A reasonable admission pipeline for an LLM endpoint looks like this:
- Compute an estimated token cost for the request from prompt length plus a max-output assumption.
- Check the estimate against a token-bucket budget that's sized at, say, 80% of your provider TPM.
- If the bucket has room, take the tokens and admit the request.
- If not, return a structured rejection telling the caller how long to wait.
The token-bucket part is critical. Provider rate limits are token-denominated, so a request-count bucket will silently let through a few enormous prompts that blow your TPM ceiling. The bucket has to be denominated in the same unit as the limit you're protecting.
Bulkheads and the Noisy-Neighbor Problem
Even with a working admission layer, you usually need to slice your capacity into separate pools rather than running everything against a single global budget. The bulkhead pattern, borrowed from ship design, prevents one workload from sinking everything else.
Consider a typical multi-feature application: you have a chat assistant on the hot path, a background summarization worker, and an internal agent that runs scheduled enrichment jobs. All three call the same provider on the same API key. Without bulkheads, a runaway batch job — maybe an internal user kicked off a 50,000-document re-summarization — eats the entire account TPM, and your customer-facing chat assistant starts returning 429s. The user-visible feature with the tightest latency budget is the first one to fail.
The fix is to give each workload a dedicated slice of the global budget. The chat path gets, say, 60% of TPM with a hard reservation. Background workers get 30% and can burst up to 50% only when chat is underutilized. Scheduled jobs get 10% and run on a leaky-bucket schedule that smooths their load. None of these slices is enforced by the provider; you enforce them locally with separate token buckets, separate worker pools, and separate priority queues.
The discipline this requires is treating provider quota as an internal multi-tenant resource rather than a free pool. Once you have more than one product surface using LLM calls, every new feature has to come with a stated capacity ask and a slice of the budget allocated to it. This is the kind of platform work that nobody plans for at MVP and everyone is forced to do by quarter three.
Why Naive Retry Logic Breaks Backpressure
Most retry logic ships as a feature in client SDKs or HTTP wrappers, and almost all of it is wrong for LLM workloads. The default policy — exponential backoff with jitter, three attempts, fixed budget per request — was designed for stateless idempotent requests against a service whose throughput recovers quickly when you back off. Neither assumption holds well here.
The first failure mode is retry amplification. If your agent makes a chain of five LLM calls and each call has a 3-attempt retry policy, a single transient provider blip during the call chain produces up to 3⁵ = 243 attempted requests for one user-visible operation. Each of those attempts costs tokens. Each costs an RPM slot. Each delays the subsequent step. This is how a 1% transient error rate at the provider becomes a 50% latency degradation and a 5x cost spike at your application.
The second failure mode is retries colliding with backpressure. When your admission layer is doing its job and rejecting requests with a Retry-After, naive retry logic at the client immediately tries again, ignoring the hint or honoring it only loosely. The retries themselves are the load you were trying to shed. A correctly designed retry policy has to be subordinate to the admission decision: if the system says "we're full, come back in 30 seconds," the client's job is to wait at least that long, with jitter, and to count the wait against an overall request budget.
The third failure mode is the retry storm during recovery. When a provider degradation ends, every client that has been backing off comes back at roughly the same time. If your retry policy is exponential without proper jitter, you see a synchronized wave of retries hit the freshly-recovered provider, and the recovery doesn't take. Real jitter — a uniform random delay across the entire backoff window, not a small additive perturbation — is the difference between recovery and a second outage.
The retry policy that actually works combines four things:
- Per-call attempt cap, sized so the chained worst case stays inside the user's latency budget.
- Per-session retry budget across all calls in a workflow, so an agent can't burn its entire token allocation on retries.
- Honoring
Retry-Afterstrictly, with the server-provided value as a floor, not a hint. - Full-window jitter on backoff, especially for recovery scenarios.
What to Build First
If you're starting from a system that has none of this — a direct provider call with the SDK's default retry policy and no admission control — the ordering matters. Build the in-flight metric first; you can't size anything else without it. Then add a token-bucket admission layer in front of the provider call, sized at 70-80% of your TPM. Then partition the bucket by workload. Then fix the retry policy last, because a correctly partitioned bucket and an admission layer make most retry pathologies impossible by construction.
The mental model worth carrying into all of this: your LLM provider is not a service, it's a resource you're sharing with strangers, and the only number you fully control is how much of it you let your own application try to use. Everything else is queue theory and humility about variance.
The teams that survive their first big provider incident learn this the hard way. The teams that survive the second one have built admission control before the third happens.
- https://en.wikipedia.org/wiki/Little's_law
- https://blog.danslimmon.com/2022/06/07/using-littles-law-to-scale-applications/
- https://arxiv.org/abs/2407.05347
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://cookbook.openai.com/examples/how_to_handle_rate_limits
- https://platform.openai.com/docs/guides/rate-limits
- https://markaicode.com/anthropic-api-rate-limits-429-errors/
- https://www.typedef.ai/resources/handle-token-limits-rate-limits-large-scale-llm-inference
- https://compute.hivenet.com/post/llm-rate-limiting-quotas
