LLM Queuing Theory: Why Your Load Balancer Thinks in Requests While Your GPU Thinks in Tokens
Your load balancer distributes requests evenly across your GPU fleet. Each instance gets roughly the same number of concurrent requests. Everything looks balanced. Yet one instance is crawling at 40 tokens per second while another hums along at 200. The dashboard shows equal request counts, but your users are experiencing wildly different latencies.
The problem is fundamental: traditional load balancing operates at the request level, but LLM inference costs scale with tokens. A single request asking for a 4,000-token essay consumes 50x more GPU time than a request generating an 80-token classification. Treating them as equivalent units is like a highway toll booth counting vehicles without distinguishing motorcycles from 18-wheelers.
This mismatch between request-level thinking and token-level reality is where classical queuing theory meets its most interesting modern challenge.
Little's Law Doesn't Care About Your Tokens (Until It Does)
Little's Law — L = λW, where average queue length equals arrival rate times average wait time — is the bedrock of queuing theory. It holds for any stable system regardless of arrival distribution or service discipline. But applying it to LLM inference requires redefining what you're actually measuring.
In a traditional web service, a "unit of work" is a request. The service time is roughly predictable: a database query takes 5-50ms, an API call takes 100-500ms. You can model capacity as requests per second and plan accordingly.
LLM inference breaks this assumption in three ways:
- Bimodal processing: Each request has a prefill phase (processing the input prompt, parallelizable) and a decode phase (generating tokens sequentially, one per forward pass). These have fundamentally different computational profiles.
- Variable output length: You don't know the service time when a request arrives. A request might generate 10 tokens or 4,000. The variance in service time can span two orders of magnitude.
- Memory-bound scaling: Each active request holds a key-value (KV) cache that grows with every generated token. GPU memory, not compute, often becomes the binding constraint.
The practical implication: you need to apply Little's Law at the token level, not the request level. Your system's throughput capacity is measured in tokens per second, and the "queue" you need to manage is the total token workload — input tokens waiting for prefill plus output tokens being generated across all active sequences.
When researchers modeled LLM inference as a discrete-time queuing system where each time slot corresponds to one GPU forward pass, the stability condition becomes:
λ(m_prefill + m_decode) < B / t_step
Where λ is request arrival rate, m_prefill and m_decode are average token counts, B is the per-step token budget, and t_step is the time per forward pass. Cross this threshold and your queue grows without bound — regardless of how clever your scheduler is.
Why Request-Level Load Balancing Fails
Consider a GPU instance with a token budget of 512 tokens per forward pass. Here are two scenarios with identical request counts:
Scenario A: 10 concurrent requests, each generating ~50 tokens. Total active decode tokens per step: ~10. Prefill is fast, decode steps are light. The GPU is underutilized.
Scenario B: 10 concurrent requests, each generating ~2,000 tokens. KV cache for all sequences: massive. The GPU runs out of memory at 6 concurrent sequences, forcing 4 requests into the waiting queue. Effective throughput craters.
A request-level load balancer sees "10 requests" in both cases and calls it balanced. A token-aware system sees a 40x difference in actual GPU workload.
This is why the N+1 query problem has an analog in LLM serving: the load balancer makes N routing decisions without knowing the actual cost of each decision. The information it needs — output token count — doesn't exist yet when the routing decision is made.
Practical workarounds include:
- Prompt-length-weighted routing: Use input token count as a proxy for total cost. Longer prompts correlate with longer outputs, though imperfectly.
- Active-token-count routing: Route to the instance with the fewest total tokens in flight (prefill + decode), not the fewest requests.
- KV-cache-aware routing: Route based on available GPU memory rather than request count. Some systems like NVIDIA Dynamo implement this by exposing memory utilization as a routing signal.
None of these fully solve the prediction problem, but they reduce the variance from 100x to roughly 3-5x — enough to keep tail latencies manageable.
The Scheduling Discipline That Actually Matters
Classical queuing theory offers a menu of scheduling disciplines: FIFO, shortest-job-first (SJF), priority queuing, fair queuing. For LLM inference, the choice that matters most isn't which request to serve next — it's how to fill each GPU iteration with tokens.
Recent research has formalized this as the "work-conserving" property: a scheduler is work-conserving if it fills each iteration's token budget to capacity whenever sufficient tokens are available. The key insight is that mixing prefill and decode tokens in the same batch is essential for throughput optimality.
Here's why. In a decode-only batch, you might have 8 active sequences each contributing 1 token per step = 8 tokens processed per forward pass against a budget of 512. That's 1.5% utilization. A work-conserving scheduler would pack prefill tokens from waiting requests into the remaining 504 token slots, dramatically increasing GPU utilization per step.
The practical validation is striking:
- Sarathi-Serve and Orca: Proven throughput-optimal. Both mix prefill and decode tokens in the same batch using chunked prefill.
- FasterTransformer: Not throughput-optimal. Separates prefill and decode into distinct batches, leaving GPU cycles stranded.
- Vanilla vLLM (pre-chunked-prefill): Not throughput-optimal in its original form. Prefill-prioritized scheduling without mixing could starve decode tokens under certain arrival patterns.
The lesson: if your serving infrastructure separates prefill and decode into distinct phases that can't share a batch, you are leaving 30-70% of your GPU throughput on the table. Continuous batching with chunked prefill isn't an optimization — it's a correctness requirement for stable serving under load.
Priority Queuing: The Three-Tier Pattern
Not all inference requests deserve equal treatment. The standard pattern emerging in production systems uses three priority tiers:
Tier 1 — Interactive (latency-sensitive): Chat responses, real-time completions, streaming UI. Target: time-to-first-token under 500ms. These requests should preempt lower-priority work.
Tier 2 — Standard (balanced): API calls with reasonable SLAs, background feature generation, search augmentation. Target: end-to-end completion under 10 seconds. Can tolerate brief queuing.
Tier 3 — Batch (throughput-optimized): Bulk classification, dataset annotation, offline summarization. Target: maximize tokens per dollar. Can wait minutes or hours.
The implementation challenge is preemption. When a Tier 1 request arrives and the GPU is fully committed to Tier 3 work, you need to evict lower-priority sequences. This means saving their KV cache state (either to CPU memory or discarding it for later recomputation) and immediately starting the high-priority prefill.
vLLM 0.9+ supports continuous priority numbers where higher-priority requests can preempt lower-priority ones from the active batch. But the scheduling system alone isn't sufficient — you also need an external admission controller that:
- Assigns priorities based on user context, not just request metadata
- Implements per-tier rate limiting to prevent priority inflation
- https://arxiv.org/html/2504.07347v1
- https://pubsonline.informs.org/doi/10.1287/stsy.2025.0106
- https://huggingface.co/blog/tngtech/llm-performance-request-queueing
- https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html
- https://www.ubicloud.com/blog/life-of-an-inference-request-vllm-v1
- https://arxiv.org/html/2407.05347v1
- https://dl.acm.org/doi/10.1145/3698038.3698523
- https://www.bentoml.com/blog/6-production-tested-optimization-strategies-for-high-performance-llm-inference
- https://www.truefoundry.com/blog/llm-load-balancing
