Skip to main content

Backpressure for LLM Pipelines: Queue Theory Applied to Token-Based Services

· 11 min read
Tian Pan
Software Engineer

A retry storm at 3 a.m. usually starts the same way: a brief provider hiccup pushes a few requests over the rate limit, your client library retries them, those retries land on a still-recovering endpoint, more requests fail, and within ninety seconds your queue depth has gone vertical while your provider dashboard shows you sitting at 100% of your tokens-per-minute quota with a backlog measured in five-figure dollars. The post-mortem will say "thundering herd." The honest answer is that you built a fixed-throughput retry policy on top of a variable-capacity downstream and forgot that queue theory has opinions about that.

Most of the well-known service resilience patterns were written for downstreams whose throughput is a wall: a database with a connection pool, a microservice with a known concurrency limit. LLM providers are not that. Your effective throughput is a moving target shaped by your tier, the model you picked, the size of the prompt, the size of the response, the time of day, and whether someone else on the same provider is fine-tuning a frontier model right now. Treating it like a fixed pipe is the root cause of most of the LLM outages I've seen this year.

This post walks through how to apply standard queue-theory tools — Little's Law, bulkheads, admission control, and token-bucket backpressure — specifically to the shape of LLM workloads, and why the naive retry logic that ships in most SDK examples is closer to a denial-of-service tool than a resilience pattern.

Your Provider Is a Variable-Capacity Downstream

Start with what the provider actually exposes. OpenAI, Anthropic, and the major hosted providers all enforce at least two limits in parallel: requests per minute (RPM) and tokens per minute (TPM). Anthropic adds a third dimension by splitting input and output tokens, so a Claude integration can hit one of three different ceilings depending on whether it's prompt-heavy, generation-heavy, or chat-style balanced. None of these limits are guaranteed; they're sliding-window soft caps that move with capacity, your account standing, and the provider's overall load.

This is already different from a database connection pool in two important ways. First, the binding constraint changes between requests — a long-context summarization request might hit TPM while a high-frequency classification workload hits RPM, even if the dollar cost is identical. Second, the actual capacity is partially observable: you only learn you're over the limit after the provider returns a 429, which means your local view of "how full is the pipe" is always stale and reactive. You cannot poll a "current quota usage" gauge the way you'd query a connection pool.

Add to this the variable-token-length problem. Recent queue-theory work on LLM inference using M/G/1 models showed that under heavy-tailed output distributions, a small fraction of unusually long responses can dominate queueing delay for everyone else. A single 8K-token generation behind your customer's request blocks an arbitrary number of small requests behind it. Your p99 isn't shaped by your average load; it's shaped by the longest output your traffic happens to produce in the window.

The implication is that the standard "treat the downstream as a fixed pipe and retry on failure" mental model doesn't survive contact with reality. You need to size, schedule, and shed load against a moving target.

Little's Law for Token-Based Services

Little's Law is the cheapest piece of math in distributed systems, and it applies directly here. The long-run average number of requests in flight in your system equals the arrival rate multiplied by the average time each request spends in the system: L = λW.

For an LLM call path, plug in real numbers. If your agent endpoint receives 50 requests per second and the average end-to-end latency (queueing + provider call + post-processing) is 4 seconds, you have 200 requests in flight on average. That's the number you need to size your worker pool, your connection pool, your in-memory buffers, and your provider concurrency limits against. If your worker pool can hold 100 concurrent in-flight requests, you're already in queueing territory at steady state — and steady-state queueing is not steady, it's a slow accumulation that turns sharp the moment latency wobbles.

The more useful application is in the other direction: Little's Law tells you what happens when latency degrades. If the provider's p50 doubles from 2s to 4s — a normal occurrence during a model rollout or regional load shift — your in-flight count doubles for the same arrival rate. Anything in your system that has a fixed concurrency cap (HTTP server worker count, async runtime task slots, provider client pool size) is now operating at a different load factor than it was an hour ago, and the slack you thought you had has silently evaporated.

Two practical rules fall out of this:

  • Size every concurrency limit for 2x your steady-state Little's Law number, not 1.1x. The variance on LLM latency is large enough that running close to the line means a single bad provider minute breaks you.
  • Track the in-flight number as a first-class metric. Most teams instrument arrival rate and latency separately and never compute the product. Wire λ × W into a single dashboard panel and put alerts on its second derivative — that's the leading indicator for when your bulkheads are about to fail.

The Admission-Control Layer

Once you accept that your downstream is variable and your in-flight count moves with it, the next question is what to do at the front door. Naive systems accept every request, queue it internally, and let pressure propagate as latency. This works until it doesn't, and "doesn't" arrives suddenly because user-facing latency hides queue growth until the queue is already pathological.

Admission control flips this around. Before a request enters your processing pipeline, you decide whether the system can absorb it. If not, you reject — explicitly and immediately, with a clear error, ideally with a Retry-After hint. The rejected request becomes the caller's problem, which sounds harsh but is actually the only stable equilibrium. A system that accepts requests it can't process eventually fails everything; a system that rejects requests it can't process fails some things and stays up.

The admission decision should be driven by your local view of in-flight work, not by the provider's response. By the time you see a 429, you've already paid the round-trip latency, you've consumed a request slot, and you've added one more failed call to the provider's view of your account. Your admission layer should reject requests before they're allowed to attempt the call.

A reasonable admission pipeline for an LLM endpoint looks like this:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates