Backpressure Patterns for LLM Pipelines: Why Exponential Backoff Isn't Enough
During peak usage, some LLM providers experience failure rates exceeding 20%. When your system hits that wall and responds by doubling its wait time and retrying, you are solving the wrong problem. Exponential backoff handles a single call's resilience. It does nothing for the system as a whole — nothing for wasted tokens, nothing for connection pool exhaustion, nothing for the 50 other requests queued behind the one that just got a 429.
The traffic patterns hitting LLM APIs have also changed fundamentally. Simple sub-100-token queries dropped from 80% to roughly 20% of traffic between 2023 and 2025, while requests over 500 tokens became the consistent majority. Agentic workflows chain 10–20 sequential calls in rapid bursts, generating traffic patterns that look indistinguishable from a DDoS attack under traditional request-per-minute rate limits. The infrastructure built for REST APIs with predictable payloads is not the infrastructure you need for LLM pipelines.
This post covers the production patterns that actually work: token bucket queuing, priority lane routing, circuit breakers with token-budget awareness, and proactive load shedding. Each layer addresses a failure mode that the previous layer cannot.
Why Exponential Backoff Fails at the System Level
Exponential backoff is reactive and stateless. It fires only after a 429 arrives, doubles the wait time (base × 2^n, capped at ~60 seconds), adds jitter to prevent synchronized retries from multiple clients, and tries again. In isolated testing, this works fine. In production with multiple callers sharing a rate limit, it creates a thundering herd problem.
Here is what actually happens: a traffic spike triggers 429s across your worker pool. All workers back off. The rate limit window resets. All workers — having received the same 429 at roughly the same time — resume simultaneously. The resulting burst reproduces the original overload. You have converted a traffic spike into a sustained oscillation.
The deeper problem is that exponential backoff operates at the wrong level of abstraction. It measures requests, but providers rate-limit on tokens. A 50-token prompt and a 10,000-token prompt count as one request each under RPM limits but have radically different resource consumption profiles. A system that treats them identically will hit token-per-minute limits long before request-per-minute limits, and no amount of backoff tuning will fix that mismatch.
Token Bucket Queuing: Matching Your Rate to the Provider's Reality
The token bucket algorithm maintains a virtual balance of capacity. Tokens are added at a fixed refill rate up to a maximum. Each outbound request consumes tokens proportional to its estimated token count. When the bucket is empty, requests wait rather than fire and fail.
The key insight is to run this logic locally, in front of the provider — not reactively in response to 429s. A typical configuration looks like:
- Maximum bucket size: 10,000 tokens (absorbs legitimate bursts)
- Refill rate: 1,000 tokens per second (matches your provider TPM tier)
- Consumption: estimated prompt tokens + max_tokens at queue entry
This is already how providers enforce their own limits — using continuous refill up to a maximum rather than hard resets at fixed intervals. Running a synchronized bucket locally means your system stays aligned with the provider's window. You also need to parse the rate limit headers on every response (x-ratelimit-remaining-tokens, x-ratelimit-reset-tokens, retry-after) to keep your local bucket synchronized with actual provider state rather than relying on estimates.
Token bucket differs from leaky bucket in an important way: it accommodates legitimate bursts and then enforces steady-state, whereas leaky bucket smooths all traffic to a constant output rate. For interactive LLM traffic, which is inherently bursty, token bucket is generally the right default. For background batch inference where the priority is not overloading the provider, leaky bucket is the correct choice.
Priority Lanes: Not All Requests Are Equal
Once you have a queue, you have to decide what order requests leave it. The default is FIFO, and FIFO is a disaster for mixed-workload LLM systems.
A straightforward three-tier model covers most production cases:
- Interactive (P0): User-facing chat, real-time completions. Requires sub-5-second response. Direct churn risk if delayed.
- Non-interactive API (P1): Automated code reviews, webhook-triggered summaries. Tolerates seconds to minutes.
- Batch/Scheduled (P2+): Document indexing, eval runs, bulk classification. Can run over hours or overnight.
Without explicit priority lanes, a burst of background P2 batch jobs clogs the queue and blocks every P0 interactive request. IBM Research's queue management research showed that purpose-built priority scheduling improves SLO attainment by 40–90% and throughput by 20–400% compared to FCFS baselines. That range is wide because the benefit scales with how much workload variance you have — but teams with any mix of interactive and batch traffic consistently see gains.
For self-hosted inference serving (vLLM and similar), priority scheduling now includes preemption: when a P0 interactive request arrives, the scheduler pauses a P2 batch request, preserves its KV cache, and resumes it when capacity is available. This prevents the latency penalty to interactive users without fully discarding in-flight work.
Two tactical points for production priority lanes:
First, measure P99 latency, not mean response time. If 1% of your users experience a 30-second delay, your system is unreliable regardless of average latency. The goal of priority lanes is to guarantee a P99 SLO for P0 traffic, not to maximize average throughput.
Second, use deadline propagation. Clients should pass deadline hints upstream — via a header or gRPC deadline — so servers can decline requests that have already expired before inference starts. A request waiting 60+ seconds in queue is almost certainly past its client timeout. Starting inference on it burns tokens and produces output that will never reach a user.
Circuit Breakers with Token-Budget Awareness
Circuit breakers are a standard resilience pattern: after a threshold of consecutive failures, stop sending requests to the failing endpoint (OPEN state), wait a cooldown period, then probe with a single request (HALF-OPEN), and reopen only if that probe succeeds. The performance difference is measurable — during simulated outages, circuit breakers reduce user-visible errors by 90%+ and eliminate thousands of wasted API calls to endpoints that are not responding.
The LLM-specific problem is that the standard triggers — HTTP error rate, consecutive 429 counts — miss three failure modes that cause damage before error rates climb:
Slow degradation. Provider responses slow from 2 seconds to 25 seconds before any 429s appear. Error rate stays low; latency climbs and token cost skyrockets. A circuit breaker watching only error rate stays closed while the system degrades.
Partial context failures. Providers sometimes return malformed or truncated completions on large-context requests before formally rate-limiting. The HTTP status code is 200. Standard circuit breakers do not see this.
Cost runaway. Agentic loops keep succeeding (200 OK) while burning token budget at an unsustainable rate. The GetOnStack incident — an undetected agent-to-agent infinite loop that ran for 11 days and drove costs from 47,000 — is the canonical example of a circuit breaker that watched the wrong metric.
A production LLM circuit breaker should monitor at least four signals:
- Token consumption rate against provider TPM limit — trip at 85% to leave headroom
- P95 latency — if P95 exceeds 3× baseline, open the circuit before errors accumulate
- Cost per hour — a dollar-denominated cap that catches runaway agents
- Consecutive 429 count — the traditional trigger, still necessary
When the circuit is OPEN and you send a probe in HALF-OPEN state, use a lightweight canary: a simple "Respond with OK" call with max_tokens=5 on a fast model, with a 5-second timeout. Never use a real production prompt as a probe — it burns tokens, risks leaking data, and takes far longer to fail when the provider is still degraded.
Design your fallback chain to terminate at a resource you control:
Primary provider → Secondary provider → Local model (always-available)
The third tier is critical. If your fallback chain ends at another rate-limited cloud endpoint, you have not actually terminated the dependency — you have shifted it.
Proactive Load Shedding: Reject Before You Fail
The previous patterns are flow control mechanisms — they regulate how fast requests enter and exit the system. Load shedding is different: it is the decision to not serve certain requests at all, rather than queue them indefinitely.
The core insight from distributed systems is that under overload, accepting every request and then failing most of them is strictly worse than accepting fewer requests and completing all of them. The former wastes compute, burns tokens, exhausts connection pools, and extends the latency for everything in the queue. The latter maintains a reliable service for the requests you do accept.
Queue-depth shedding is the simplest implementation: set a maximum pending queue depth (and an associated maximum wait time). Once either threshold is exceeded, new arrivals get an immediate 503 rather than a queue slot. This rejection is cheapest at the API gateway layer — the request never reaches an inference worker, no tokens are consumed, and the client gets a fast definitive signal to try again later rather than waiting in an overloaded queue.
For agentic pipelines, extend shedding to cumulative token budget within a session. Hard-stop a pipeline when it exceeds a token budget threshold within a single task or conversation. Shopify measured that tool call outputs consume roughly 100× more tokens than user messages — an agent in a tool-calling loop can exhaust a monthly budget in hours. A session-level token budget cap is the structural safeguard that a circuit breaker watching request counts will not catch.
One practical detail: under high load, prefer shedding the newest arrivals, not the oldest ones. Requests that have been waiting longest have likely already exceeded their client timeout budget. Keeping them in the queue while shedding new arrivals wastes capacity on requests that will fail to deliver even if processed.
Putting It Together
These four layers address different failure modes and complement rather than replace each other:
- Token bucket queuing prevents 429s from occurring in the first place by pacing outbound traffic to match provider capacity
- Priority lanes ensure that when capacity is constrained, it flows to the most valuable work first
- Circuit breakers detect provider degradation quickly and reroute traffic before the failure propagates through the system
- Load shedding provides the last line of defense by converting queue overflow into fast, explicit rejections rather than slow cascading failures
The pattern that ties them together is asynchronous decoupling:
API Gateway → Message Queue → Worker Pool (token bucket) → LLM Provider
Workers pull from the queue at a rate governed by the token bucket. The queue enforces priority ordering. The circuit breaker sits between the worker pool and the provider. The gateway enforces queue-depth limits and sheds excess load before it ever enters the queue. Each layer can fail independently without bringing down the others.
The operational cost of this architecture is real — it is more complex to operate than a simple retry loop. But it is the architecture that holds under the traffic patterns LLM applications actually generate: bursty interactive traffic, background batch jobs, agentic workflows with unpredictable token consumption, and providers that occasionally degrade before they fail. Exponential backoff handles the easy case. The production case requires the full stack.
- https://dasroot.net/posts/2026/02/rate-limiting-backpressure-llm-apis/
- https://agentgateway.dev/blog/2025-11-02-rate-limit-quota-llm/
- https://markaicode.com/llm-api-rate-limiting-load-balancing-guide/
- https://markaicode.com/circuit-breaker-resilient-ai-systems/
- https://www.getmaxim.ai/articles/retries-fallbacks-and-circuit-breakers-in-llm-apps-a-production-guide/
- https://portkey.ai/blog/report/
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://zuplo.com/learning-center/token-based-rate-limiting-ai-agents
- https://www.truefoundry.com/blog/rate-limiting-in-llm-gateway
- https://stochasticsandbox.com/posts/api-rate-limits-compared-2026-03-22/
- https://cloud.google.com/blog/products/ai-machine-learning/learn-how-to-handle-429-resource-exhaustion-errors-in-your-llms
- https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/
- https://dl.acm.org/doi/10.1145/3698038.3698523
- https://arxiv.org/html/2512.12928v1
- https://portkey.ai/blog/tackling-rate-limiting-for-llm-apps/
