Continuous Batching: The Single Biggest GPU Utilization Unlock for LLM Serving
Most LLM serving infrastructure failures in production aren't model failures—they're scheduling failures. Teams stand up a capable model, load test it, and discover they're burning expensive GPU time at 35% utilization while users wait. The culprit is almost always static batching: a default inherited from conventional deep learning that fundamentally doesn't fit how language models generate text.
Continuous batching—also called iteration-level scheduling or in-flight batching—is the mechanism that fixes this. It's not a tuning knob; it's an architectural change to how the serving loop runs. The difference between a system using it and one that isn't can be 4–8x in throughput for the same hardware.
Understanding why requires understanding what's actually broken about the naive approach.
Why Static Batching Wastes Your GPU
Conventional batching collects a set of requests, processes them as a unit through the model until every sequence has finished generating, then moves to the next batch. This works fine for image classification, where every input produces exactly one output of fixed size. For language generation, it's a disaster.
The problem is output length variance. A chat request might complete in 15 tokens. A code generation request in the same batch might run to 800 tokens. For the 785 iterations after the short request finishes, its allocated GPU memory and compute slots sit idle—padding—while the batch waits for the longest sequence to terminate. You're paying full throughput cost while the utilization curves show 30–60% GPU engagement.
Dynamic batching improves on this by grouping requests within a time window (say, 50ms) to reduce admission latency, but the batch-level scheduling problem remains: once the window closes, the batch runs as a unit until the last sequence completes.
Continuous batching solves this by moving the scheduling decision from the request granularity to the iteration granularity. The scheduler runs once per model forward pass, not once per request. When a sequence emits an end-of-sequence token and finishes, its memory slot is freed immediately. The next waiting request is inserted into the batch before the next iteration begins. No request waits for another request to finish—the batch composition changes on every decoding step.
The throughput implication is significant. The ORCA paper (OSDI 2022), which introduced iteration-level scheduling at scale, demonstrated 36.9x throughput improvement over FasterTransformer at equivalent latency targets. Anyscale's real-world benchmarks showed 8x improvement over naive HuggingFace Transformers serving. Combined with PagedAttention-based KV cache management, vLLM's original release reached 24x higher throughput than HuggingFace Transformers and 3.5x over HuggingFace TGI.
How the Scheduler Actually Works
Each forward pass, the continuous batching scheduler executes a short loop:
- Scan the running batch for sequences that completed (EOS emitted)
- Free their KV cache blocks
- Pull waiting requests from the queue—as many as memory and batch-size limits allow
- Concatenate all active sequences into a single compound batch
- Run one model forward pass; each sequence produces its next token
- Repeat
The concatenation step is what makes this structurally different. Static batching requires sequences to be padded to equal length because batched matrix operations need uniform tensor shapes. Continuous batching instead constructs a "super-sequence" with attention masks that prevent any request from attending to any other request's tokens. No padding, no wasted computation—every GPU FLOP processes a real token.
This concatenated formulation integrates naturally with FlashAttention's variable-length kernel variants, which process all sequences in a single GPU kernel call despite different lengths. The result is high GPU occupancy even when the batch contains a mix of short and long in-progress generations.
PagedAttention: The Memory Management Layer
Continuous batching addresses when to schedule requests. PagedAttention (vLLM, SOSP 2023) addresses where to store their KV caches.
Prior to PagedAttention, LLM serving frameworks pre-allocated a contiguous block of GPU memory for each request sized to its maximum possible output length. This caused 60–80% memory waste from fragmentation: over-reserved slots, alignment gaps, and the simple fact that most sequences don't use their maximum allocation.
PagedAttention applies the OS virtual memory paging model to KV cache management. KV cache is partitioned into fixed-size blocks (16 tokens per block in vLLM's default), allocated on-demand as sequences generate tokens rather than up-front. A block table maps each sequence's logical blocks to physical GPU memory locations—blocks need not be contiguous. Memory waste drops to under 4% (only the last partially-filled block is wasted per sequence).
The second benefit: physical blocks can be shared across sequences via copy-on-write semantics. Beam search branches, parallel samples, and requests sharing a common system prompt can all reference the same physical KV blocks until they diverge. For beam search, this reduces overhead by up to 55% and yields up to 2.2x throughput improvement over unshared allocation.
SGLang extended this further with RadixAttention: a radix tree data structure that maintains KV cache across different requests, enabling automatic prefix reuse. Requests sharing a system prompt, few-shot examples, or RAG context reuse each other's cached KV blocks rather than recomputing them. On workloads with heavy prefix sharing, this delivers up to 5x faster inference.
