Scheduling Fairness in Multi-Tenant LLM Inference: Why FIFO Is the Wrong Default
Your company runs a shared LLM serving cluster. Two tenants use it: a customer-facing chatbot with a 500ms first-token latency SLO, and a batch document enrichment pipeline that processes thousands of long-context prompts overnight. One morning, the chatbot team pages you at 3am because their P95 TTFT spiked to 12 seconds. Root cause: the batch job started earlier than expected, filled the GPU memory with prefill work, and the chatbot's short requests sat in queue behind a parade of 8,000-token prompts. Your FIFO scheduler gave them equal priority. The chatbot's SLO was violated 4,000 times before you killed the batch job manually.
This failure mode is common, well-understood in theory, and surprisingly widespread in practice. Most teams deploy vLLM or TGI with the default FIFO scheduler, add multiple workloads over time, and only discover the priority inversion when an incident happens.
Why LLM Scheduling Is Fundamentally Different
The standard intuition from web service scheduling doesn't transfer cleanly to LLM inference. In a typical API, requests have roughly similar execution times — a few milliseconds — so FIFO works fine. LLM inference breaks this assumption in three ways.
Output length is unknown at admission time. An LLM generates tokens autoregressively until it decides to stop. A request that looks like a simple classification prompt might turn into a 2,000-token reasoning chain. A scheduler that tries to estimate job duration (as shortest-job-first strategies do) has to either guess or measure — and prediction error of 80% or more is common without specialized models trained on your traffic distribution.
Input and output tokens have asymmetric costs. Processing input tokens (the prefill stage) is compute-bound and happens in parallel. Generating output tokens (the decode stage) is memory-bound and sequential — each token requires loading the full model weight matrix. A 4,000-token prompt with a 10-token output costs very differently from a 100-token prompt with a 500-token output, even if the total token count looks similar. Schedulers that count tokens without distinguishing phase treat these as equivalent and make poor decisions.
GPU memory is the binding constraint, not compute time. The key-value (KV) cache for in-flight requests consumes GPU HBM in proportion to sequence length and batch size. When the KV cache is full, the scheduler must preempt a running request — discarding its cache and forcing a restart from scratch. This isn't like swapping memory in a CPU process; it means re-running every prefill token again. A batch of large requests doesn't just delay small requests — it can trigger cascading evictions that multiply total GPU work by 2–3x under high load.
How FIFO Fails at Scale
Head-of-line blocking is the most visible failure. A single 8,000-token prompt takes the prefill slot for hundreds of milliseconds, during which every request behind it waits regardless of how short they are or what SLO they carry. This is structurally identical to a slow query blocking a database connection pool, but with higher stakes: TTFT SLOs measured in hundreds of milliseconds can be violated by a single "elephant" request.
The convoy effect compounds this. If your traffic mix includes both large batch jobs and small interactive requests, the batch jobs cluster at GPU memory limits and force the scheduler into defensive preemption. Short requests that arrived after the batch job are repeatedly context-switched, their KV caches evicted and recomputed, paying the cost of the batch job's existence over and over.
Priority inversion is the subtler version: a low-priority batch workload starts before a high-priority interactive request arrives. With FIFO, the batch job runs to completion (or until memory forces preemption) before the interactive request is served, even though the interactive request has a far tighter deadline. The scheduler sees arrival order; it doesn't see SLOs.
KV cache thrashing closes the loop. Under sustained load with FIFO, the scheduler admits requests faster than they can complete. Memory fills. The scheduler preempts running requests to make room for new ones. Those preempted requests restart and refill memory. Throughput collapses while utilization stays high — the GPU is busy recomputing work it already did.
Token-Weighted Fair-Share Queuing
The core insight from systems like Justitia and the Virtual Token Counter (VTC) algorithm — introduced in a 2024 OSDI paper on fairness in LLM serving — is borrowed from network fair queuing: track how much service each tenant has received and always schedule the most underserved tenant next.
The VTC approach assigns each tenant a counter that increments as their requests consume tokens. The scheduler picks the next request from the tenant with the smallest counter. New tenants joining mid-stream get their counter "lifted" to match the current minimum, so they don't immediately starve everyone else by appearing perpetually underserved.
The critical detail is how tokens are counted. Simple token counting (treating every token equally) still lets tenants with many short output requests starve tenants with fewer long output requests, because decode tokens dominate both latency and GPU memory pressure. Better approaches weight output tokens more heavily than input tokens, reflecting that output token generation is what actually holds GPU resources over time.
The Equinox system pushes this further with a dual-counter design: a User Fairness Counter tracking weighted token consumption from the tenant's perspective, and a Resource Fairness Counter tracking actual GPU utilization from the operator's perspective. These two views diverge when output token length predictions are wrong — and predictions are frequently wrong. Equinox adds a mixture-of-experts predictor trained on traffic patterns to reduce output token prediction error from roughly 80% to 33%, allowing the fairness counters to stay accurate even before requests complete.
The theoretical guarantee from virtual-time fair queuing is useful: for any two tenants who are both backlogged, the difference in service they receive is bounded by a constant proportional to the maximum request size. This means no tenant starves indefinitely, and the bound is computable from your request size distribution — not just a hand-wavy promise.
SLO-Tiered Admission Control
- https://www.usenix.org/system/files/osdi24-sheng.pdf
- https://arxiv.org/html/2510.17015v1
- https://arxiv.org/html/2508.16646v1
- https://arxiv.org/html/2505.23022
- https://www.usenix.org/system/files/osdi24-sun-biao.pdf
- https://arxiv.org/pdf/2512.12928
- https://docs.vllm.ai/en/latest/api/vllm/config/scheduler/
- https://arxiv.org/html/2604.11001v1
- https://arxiv.org/html/2407.00047v1
- https://arxiv.org/html/2602.16603
- https://fergusfinn.com/blog/scheduling-in-inference-engines/
- https://huggingface.co/blog/martinigoyanes/llm-inference-at-scale-with-tgi
