LLM Queuing Theory: Why Your Load Balancer Thinks in Requests While Your GPU Thinks in Tokens
Your load balancer distributes requests evenly across your GPU fleet. Each instance gets roughly the same number of concurrent requests. Everything looks balanced. Yet one instance is crawling at 40 tokens per second while another hums along at 200. The dashboard shows equal request counts, but your users are experiencing wildly different latencies.
The problem is fundamental: traditional load balancing operates at the request level, but LLM inference costs scale with tokens. A single request asking for a 4,000-token essay consumes 50x more GPU time than a request generating an 80-token classification. Treating them as equivalent units is like a highway toll booth counting vehicles without distinguishing motorcycles from 18-wheelers.
This mismatch between request-level thinking and token-level reality is where classical queuing theory meets its most interesting modern challenge.
Little's Law Doesn't Care About Your Tokens (Until It Does)
Little's Law — L = λW, where average queue length equals arrival rate times average wait time — is the bedrock of queuing theory. It holds for any stable system regardless of arrival distribution or service discipline. But applying it to LLM inference requires redefining what you're actually measuring.
In a traditional web service, a "unit of work" is a request. The service time is roughly predictable: a database query takes 5-50ms, an API call takes 100-500ms. You can model capacity as requests per second and plan accordingly.
LLM inference breaks this assumption in three ways:
- Bimodal processing: Each request has a prefill phase (processing the input prompt, parallelizable) and a decode phase (generating tokens sequentially, one per forward pass). These have fundamentally different computational profiles.
- Variable output length: You don't know the service time when a request arrives. A request might generate 10 tokens or 4,000. The variance in service time can span two orders of magnitude.
- Memory-bound scaling: Each active request holds a key-value (KV) cache that grows with every generated token. GPU memory, not compute, often becomes the binding constraint.
The practical implication: you need to apply Little's Law at the token level, not the request level. Your system's throughput capacity is measured in tokens per second, and the "queue" you need to manage is the total token workload — input tokens waiting for prefill plus output tokens being generated across all active sequences.
When researchers modeled LLM inference as a discrete-time queuing system where each time slot corresponds to one GPU forward pass, the stability condition becomes:
λ(m_prefill + m_decode) < B / t_step
Where λ is request arrival rate, m_prefill and m_decode are average token counts, B is the per-step token budget, and t_step is the time per forward pass. Cross this threshold and your queue grows without bound — regardless of how clever your scheduler is.
Why Request-Level Load Balancing Fails
Consider a GPU instance with a token budget of 512 tokens per forward pass. Here are two scenarios with identical request counts:
Scenario A: 10 concurrent requests, each generating ~50 tokens. Total active decode tokens per step: ~10. Prefill is fast, decode steps are light. The GPU is underutilized.
Scenario B: 10 concurrent requests, each generating ~2,000 tokens. KV cache for all sequences: massive. The GPU runs out of memory at 6 concurrent sequences, forcing 4 requests into the waiting queue. Effective throughput craters.
A request-level load balancer sees "10 requests" in both cases and calls it balanced. A token-aware system sees a 40x difference in actual GPU workload.
This is why the N+1 query problem has an analog in LLM serving: the load balancer makes N routing decisions without knowing the actual cost of each decision. The information it needs — output token count — doesn't exist yet when the routing decision is made.
Practical workarounds include:
- Prompt-length-weighted routing: Use input token count as a proxy for total cost. Longer prompts correlate with longer outputs, though imperfectly.
- Active-token-count routing: Route to the instance with the fewest total tokens in flight (prefill + decode), not the fewest requests.
- KV-cache-aware routing: Route based on available GPU memory rather than request count. Some systems like NVIDIA Dynamo implement this by exposing memory utilization as a routing signal.
None of these fully solve the prediction problem, but they reduce the variance from 100x to roughly 3-5x — enough to keep tail latencies manageable.
The Scheduling Discipline That Actually Matters
Classical queuing theory offers a menu of scheduling disciplines: FIFO, shortest-job-first (SJF), priority queuing, fair queuing. For LLM inference, the choice that matters most isn't which request to serve next — it's how to fill each GPU iteration with tokens.
Recent research has formalized this as the "work-conserving" property: a scheduler is work-conserving if it fills each iteration's token budget to capacity whenever sufficient tokens are available. The key insight is that mixing prefill and decode tokens in the same batch is essential for throughput optimality.
Here's why. In a decode-only batch, you might have 8 active sequences each contributing 1 token per step = 8 tokens processed per forward pass against a budget of 512. That's 1.5% utilization. A work-conserving scheduler would pack prefill tokens from waiting requests into the remaining 504 token slots, dramatically increasing GPU utilization per step.
The practical validation is striking:
- Sarathi-Serve and Orca: Proven throughput-optimal. Both mix prefill and decode tokens in the same batch using chunked prefill.
- FasterTransformer: Not throughput-optimal. Separates prefill and decode into distinct batches, leaving GPU cycles stranded.
- Vanilla vLLM (pre-chunked-prefill): Not throughput-optimal in its original form. Prefill-prioritized scheduling without mixing could starve decode tokens under certain arrival patterns.
The lesson: if your serving infrastructure separates prefill and decode into distinct phases that can't share a batch, you are leaving 30-70% of your GPU throughput on the table. Continuous batching with chunked prefill isn't an optimization — it's a correctness requirement for stable serving under load.
Priority Queuing: The Three-Tier Pattern
Not all inference requests deserve equal treatment. The standard pattern emerging in production systems uses three priority tiers:
Tier 1 — Interactive (latency-sensitive): Chat responses, real-time completions, streaming UI. Target: time-to-first-token under 500ms. These requests should preempt lower-priority work.
Tier 2 — Standard (balanced): API calls with reasonable SLAs, background feature generation, search augmentation. Target: end-to-end completion under 10 seconds. Can tolerate brief queuing.
Tier 3 — Batch (throughput-optimized): Bulk classification, dataset annotation, offline summarization. Target: maximize tokens per dollar. Can wait minutes or hours.
The implementation challenge is preemption. When a Tier 1 request arrives and the GPU is fully committed to Tier 3 work, you need to evict lower-priority sequences. This means saving their KV cache state (either to CPU memory or discarding it for later recomputation) and immediately starting the high-priority prefill.
vLLM 0.9+ supports continuous priority numbers where higher-priority requests can preempt lower-priority ones from the active batch. But the scheduling system alone isn't sufficient — you also need an external admission controller that:
- Assigns priorities based on user context, not just request metadata
- Implements per-tier rate limiting to prevent priority inflation
- Decrements priority for repeated requests from the same user (first request: priority 0, subsequent: 1, 2, 3...) to prevent single-user queue domination
The critical design insight: keep the backend queue short. Target fewer than 3 pending requests in the inference engine's internal queue. Your admission controller should hold requests in an upstream queue where you have full control over ordering, rather than pushing them into the backend where reordering is difficult or impossible.
Admission Control: The Valve Your System Is Missing
Most production LLM deployments have a load balancer and an inference engine. What they're missing is the admission controller — the component that sits between them and answers: "Should this request enter the system right now, or should it wait?"
Without admission control, you get a failure mode that's invisible until it's catastrophic. As load increases, the inference engine accepts all incoming requests, KV cache memory fills up, the system starts evicting and recomputing cached states, throughput drops, latencies spike, and the system enters a death spiral where it's doing more cache management than actual inference.
Effective admission control for LLM inference monitors two signals:
Token generation speed: If the backend is producing fewer than 7 tokens per second per sequence (roughly 150ms per token), stop admitting new requests. This is the single most reliable indicator that the system is overloaded.
Backend queue depth: Fetch this from the inference engine's metrics endpoint (e.g., vLLM's Prometheus /metrics). When it exceeds your target (typically 2-3 requests), hold new arrivals in the admission controller's queue.
The feedback loop works like this:
- Request arrives at the admission controller
- Check backend queue depth and token generation speed
- If both are within thresholds, forward to the inference engine
- If either exceeds thresholds, hold in the upstream queue
- Continuously poll backend metrics and release held requests when capacity frees up
For multi-tenant systems, add per-tenant fair queuing at the admission controller level. Round-robin across tenant queues ensures no single tenant can monopolize inference capacity, even if they're sending requests at 10x the rate of others.
Capacity Planning: The Math That Prevents Surprises
Here's the capacity planning calculation that most teams skip:
Step 1 — Estimate daily token volume:
- Count requests per day × average tokens per request (input + output)
- Example: 100K requests/day × 1,500 avg tokens = 150M tokens/day
Step 2 — Convert to GPU-seconds:
- Measure your system's sustained tokens/second/GPU under realistic load (not benchmark conditions)
- Example: 200 tokens/sec/GPU → 150M / 200 = 750,000 GPU-seconds = ~208 GPU-hours/day
Step 3 — Apply the burstiness multiplier:
- Traffic is never uniform. Peak hours typically see 3-5x average load.
- If you provision for average load, you'll queue (or drop) requests during peaks.
- Multiply by your peak-to-average ratio: 208 × 4 = 832 GPU-hours of peak capacity needed
Step 4 — Add the KV cache tax:
- Each concurrent sequence holds KV cache memory proportional to (num_layers × hidden_dim × sequence_length)
- For a 70B parameter model with 80 layers, each active sequence at 2K context consumes roughly 2.5 GB of KV cache
- If your GPU has 80GB, KV cache alone limits you to ~30 concurrent sequences — regardless of compute availability
The KV cache constraint is what makes LLM capacity planning fundamentally different from CPU-bound services. You can't just "add more compute." Memory is the bottleneck, and it scales linearly with concurrency × sequence length.
What the Textbooks Don't Tell You
Classical queuing theory assumes you know (or can estimate) the service time distribution. LLM inference violates this: you genuinely don't know how many tokens a request will generate until it's done. The output length distribution depends on the prompt, the sampling parameters, the system prompt, and stochastic sampling (temperature > 0).
This means standard results like "SJF minimizes average wait time" are only partially applicable. You can approximate SJF by prioritizing requests with shorter input prompts (since shorter prompts weakly correlate with shorter outputs), but the prediction error is high enough that you'll frequently schedule a "short" job that turns into a 4,000-token generation.
The more practical approaches acknowledge this uncertainty:
- Speculative admission: Admit requests assuming average output length, but reserve capacity for the possibility of longer outputs. If KV cache pressure exceeds thresholds, preempt the longest-running sequences.
- Output-length hints: Allow API callers to specify
max_tokensas a scheduling hint. Even if they set it higher than needed, it provides an upper bound for capacity planning. - Empirical service time distributions: Build per-endpoint, per-model histograms of actual output lengths from production traffic. Use these distributions (not assumptions) for queuing model parameters.
The teams that get this right treat their inference infrastructure like a packet-switched network, not a traditional web service. Tokens are packets. GPU iterations are time slots. KV cache is buffer memory. And just like in networking, the interesting problems aren't in the steady state — they're in what happens when your assumptions about traffic patterns stop holding.
Getting Started
If you're running LLM inference in production and haven't thought about queuing theory, here's the minimum viable improvement:
-
Measure token throughput, not request throughput. Change your primary capacity metric from requests/second to tokens/second. This single change will reveal imbalances your current monitoring misses.
-
Add an upstream queue. Don't push requests directly into your inference engine. Hold them in a queue you control, and release them based on backend health signals.
-
Enable chunked prefill. If your serving framework supports it (vLLM, TensorRT-LLM), turn it on. The throughput improvement from mixing prefill and decode tokens is typically 30-50%.
-
Route by tokens, not requests. Update your load balancer to weight routing decisions by estimated token count rather than request count.
These four changes won't make you a queuing theorist, but they'll prevent the most common failure mode: a system that looks healthy at the request level while silently degrading at the token level.
- https://arxiv.org/html/2504.07347v1
- https://pubsonline.informs.org/doi/10.1287/stsy.2025.0106
- https://huggingface.co/blog/tngtech/llm-performance-request-queueing
- https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html
- https://www.ubicloud.com/blog/life-of-an-inference-request-vllm-v1
- https://arxiv.org/html/2407.05347v1
- https://dl.acm.org/doi/10.1145/3698038.3698523
- https://www.bentoml.com/blog/6-production-tested-optimization-strategies-for-high-performance-llm-inference
- https://www.truefoundry.com/blog/llm-load-balancing
