Skip to main content

Continuous Batching: The Single Biggest GPU Utilization Unlock for LLM Serving

· 11 min read
Tian Pan
Software Engineer

Most LLM serving infrastructure failures in production aren't model failures—they're scheduling failures. Teams stand up a capable model, load test it, and discover they're burning expensive GPU time at 35% utilization while users wait. The culprit is almost always static batching: a default inherited from conventional deep learning that fundamentally doesn't fit how language models generate text.

Continuous batching—also called iteration-level scheduling or in-flight batching—is the mechanism that fixes this. It's not a tuning knob; it's an architectural change to how the serving loop runs. The difference between a system using it and one that isn't can be 4–8x in throughput for the same hardware.

Understanding why requires understanding what's actually broken about the naive approach.

Why Static Batching Wastes Your GPU

Conventional batching collects a set of requests, processes them as a unit through the model until every sequence has finished generating, then moves to the next batch. This works fine for image classification, where every input produces exactly one output of fixed size. For language generation, it's a disaster.

The problem is output length variance. A chat request might complete in 15 tokens. A code generation request in the same batch might run to 800 tokens. For the 785 iterations after the short request finishes, its allocated GPU memory and compute slots sit idle—padding—while the batch waits for the longest sequence to terminate. You're paying full throughput cost while the utilization curves show 30–60% GPU engagement.

Dynamic batching improves on this by grouping requests within a time window (say, 50ms) to reduce admission latency, but the batch-level scheduling problem remains: once the window closes, the batch runs as a unit until the last sequence completes.

Continuous batching solves this by moving the scheduling decision from the request granularity to the iteration granularity. The scheduler runs once per model forward pass, not once per request. When a sequence emits an end-of-sequence token and finishes, its memory slot is freed immediately. The next waiting request is inserted into the batch before the next iteration begins. No request waits for another request to finish—the batch composition changes on every decoding step.

The throughput implication is significant. The ORCA paper (OSDI 2022), which introduced iteration-level scheduling at scale, demonstrated 36.9x throughput improvement over FasterTransformer at equivalent latency targets. Anyscale's real-world benchmarks showed 8x improvement over naive HuggingFace Transformers serving. Combined with PagedAttention-based KV cache management, vLLM's original release reached 24x higher throughput than HuggingFace Transformers and 3.5x over HuggingFace TGI.

How the Scheduler Actually Works

Each forward pass, the continuous batching scheduler executes a short loop:

  1. Scan the running batch for sequences that completed (EOS emitted)
  2. Free their KV cache blocks
  3. Pull waiting requests from the queue—as many as memory and batch-size limits allow
  4. Concatenate all active sequences into a single compound batch
  5. Run one model forward pass; each sequence produces its next token
  6. Repeat

The concatenation step is what makes this structurally different. Static batching requires sequences to be padded to equal length because batched matrix operations need uniform tensor shapes. Continuous batching instead constructs a "super-sequence" with attention masks that prevent any request from attending to any other request's tokens. No padding, no wasted computation—every GPU FLOP processes a real token.

This concatenated formulation integrates naturally with FlashAttention's variable-length kernel variants, which process all sequences in a single GPU kernel call despite different lengths. The result is high GPU occupancy even when the batch contains a mix of short and long in-progress generations.

PagedAttention: The Memory Management Layer

Continuous batching addresses when to schedule requests. PagedAttention (vLLM, SOSP 2023) addresses where to store their KV caches.

Prior to PagedAttention, LLM serving frameworks pre-allocated a contiguous block of GPU memory for each request sized to its maximum possible output length. This caused 60–80% memory waste from fragmentation: over-reserved slots, alignment gaps, and the simple fact that most sequences don't use their maximum allocation.

PagedAttention applies the OS virtual memory paging model to KV cache management. KV cache is partitioned into fixed-size blocks (16 tokens per block in vLLM's default), allocated on-demand as sequences generate tokens rather than up-front. A block table maps each sequence's logical blocks to physical GPU memory locations—blocks need not be contiguous. Memory waste drops to under 4% (only the last partially-filled block is wasted per sequence).

The second benefit: physical blocks can be shared across sequences via copy-on-write semantics. Beam search branches, parallel samples, and requests sharing a common system prompt can all reference the same physical KV blocks until they diverge. For beam search, this reduces overhead by up to 55% and yields up to 2.2x throughput improvement over unshared allocation.

SGLang extended this further with RadixAttention: a radix tree data structure that maintains KV cache across different requests, enabling automatic prefix reuse. Requests sharing a system prompt, few-shot examples, or RAG context reuse each other's cached KV blocks rather than recomputing them. On workloads with heavy prefix sharing, this delivers up to 5x faster inference.

The Tradeoff Curve: When Continuous Batching Helps and When It Doesn't

Continuous batching's benefit scales directly with output length variance. On a workload where every request generates exactly 50 tokens, the advantage over static batching shrinks toward zero—there's no idle padding to eliminate. On a mixed chat workload where outputs range from 5 to 1000 tokens, the improvement is 4–8x.

The workload fit looks like this:

  • Chat, agents, interactive assistants: high variance, many concurrent users, mixed short/long responses. Continuous batching is strongly beneficial here—this is the workload it was designed for.
  • Online APIs under variable QPS: sequences are admitted immediately rather than waiting for a batch to fill, which reduces TTFT significantly at moderate load.
  • RAG pipelines with shared prefixes: RadixAttention (SGLang) or prefix caching (vLLM) compound the benefit via KV reuse across requests.
  • Offline batch inference with homogeneous outputs: static batching is competitive here and often simpler. When sequence lengths are known in advance and uniform, the scheduler overhead of continuous batching offers little gain.
  • Very low QPS (single-digit requests per second): all approaches perform similarly; the scheduling overhead of continuous batching matters more than the utilization benefit.

The interference problem at high concurrency is the most important nuance practitioners miss. When a long-context request enters the batch—a 32K-token RAG document, a long code file—its prefill computation is compute-bound and saturates GPU matrix multiply units for many milliseconds. While that prefill runs, the decode iterations of all currently active sequences stall. Users with in-progress generations see sudden TBT (time between tokens) spikes. This is the prefill-decode interference problem.

The standard mitigation is chunked prefill: rather than processing a long prompt in one shot, split it into chunks (512 tokens each, a common default) and interleave each chunk with regular decode iterations. vLLM exposes this via --enable-chunked-prefill. The tradeoff is a slight reduction in pure throughput in exchange for much more predictable TBT for active sessions.

A more aggressive solution—prefill-decode disaggregation (the DistServe architecture)—assigns prefill computation and decode computation to entirely separate GPU pools, eliminating interference at the cost of KV cache transfer overhead between pools. This pattern is emerging as the standard for large-scale deployments where interactive and long-context workloads share infrastructure.

Failure Modes at High Concurrency

Understanding the failure modes is what separates teams that operate continuous batching well from teams that don't.

The Eviction Cascade

The most severe production failure happens when long-context requests overwhelm KV cache capacity. The sequence:

  1. Several long-context requests (RAG, document analysis) consume large KV cache blocks
  2. Their memory pressure forces the scheduler to preempt shorter-context requests—their KV blocks are evicted, they're sent back to the waiting queue
  3. When memory frees, preempted requests re-enter the queue and must recompute their KV cache from scratch—including all tokens they already generated
  4. If new requests keep arriving, the GPU spends all its cycles computing prefills for preempted sequences, producing zero new output tokens

The visible signature: GPU utilization at 100%, throughput flat or declining, preemption counters rising, P99 latency spiking from 200ms to several seconds. The system has entered a state where it's busier than ever but doing no useful work.

Prevention requires workload segregation: never route long-context requests (RAG with 10K+ context) to the same serving instance as short-context interactive requests. The capacity math is unforgiving—on a 40GB A100 serving Llama 13B, the model weights consume ~26GB, leaving ~14GB for KV cache. At 2048-token average sequence length, that's roughly 7 concurrent sequences before preemptions begin.

Memory-Based Head-of-Line Blocking

Continuous batching eliminates the classic head-of-line blocking problem where short requests wait behind long ones. But a subtler form persists at saturation.

When GPU memory is exhausted and the waiting queue cannot be admitted, FIFO-ordered queues stall entirely—even if shorter requests further back in the queue would fit in memory. Systems enforcing fairness constraints can't reorder the queue. Research has documented HOL blocking times reaching tens of seconds under continuous batching at sustained saturation. Properly designed schedulers detect this condition and either admit shorter requests out of FIFO order or apply backpressure at the API gateway before queue depth builds.

Block Size Fragmentation

vLLM's default KV cache block size of 16 tokens is intentional. Increasing it to 128 tokens—a tempting configuration change for workloads with long contexts—causes significant internal fragmentation under high concurrency: with 256 concurrent sequences, 128-token blocks waste roughly 16,000 token slots that could fit additional sequences. Block size changes rarely improve throughput and frequently hurt it.

Choosing and Configuring a Continuous Batching Runtime

Current production options and their tradeoffs:

vLLM is the right default for most teams: widest model support, straightforward deployment, active development, and the v0.6.0 release (September 2024) resolved significant CPU overhead issues—before that release, only 38% of wall time was actual GPU computation; HTTP server overhead alone consumed 33%. Post-v0.6.0, Llama 8B shows 2.7x higher throughput and 5x faster time-per-output-token versus prior versions.

TensorRT-LLM delivers the highest raw throughput on NVIDIA hardware once compiled, but requires a compilation step that can take hours and is not trivially portable. Best suited for long-running single-model deployments on controlled hardware.

SGLang performs comparably to TensorRT-LLM on modern hardware and adds RadixAttention for cross-request KV reuse—a decisive advantage for chatbot and RAG workloads with heavy prefix sharing. LMSYS benchmarks showed SGLang achieving up to 3.1x higher throughput than vLLM on Llama-70B in 2024.

Critical configuration parameters for vLLM:

  • max_num_seqs: controls maximum concurrent sequences. Lowering this when seeing high preemption rates reduces cascade risk at the cost of throughput.
  • --enable-chunked-prefill: strongly recommended for mixed interactive/long-context workloads. Interleaves prefill chunks with decode to prevent TBT spikes.
  • gpu_memory_utilization (default 0.9): adjusting upward can increase KV cache capacity but leaves less headroom for model weight overhead. Don't push above 0.95.

For monitoring, track vllm:num_preemptions as the primary health signal. A rising preemption rate is an early warning of approaching the eviction cascade. Set an alert before the metric becomes critical—by the time throughput visibly degrades, the cascade is already underway.

The Scheduling Problem Is Not Fully Solved

Continuous batching brought LLM serving from 30–60% GPU utilization to 80–95% under realistic workloads. The ORCA paper's insight—that iteration-level scheduling matches the statistical structure of language generation better than request-level scheduling—was a genuine step change.

The open problems are at the edges: handling very mixed workloads without interference, distributing KV cache across disaggregated prefill and decode pools efficiently, and scheduling across multiple nodes without head-of-line blocking at the inter-node level. Each of these is an active research and engineering area, with chunked prefill and disaggregation being the current frontier answers.

For most production teams today, the practical action is simpler: ensure your serving runtime uses continuous batching (vLLM, SGLang, TGI, and TensorRT-LLM all do by default), segregate your long-context and short-context workloads onto separate instances, enable chunked prefill if you see TBT spikes, and watch your preemption counters. The hardware you already have will serve significantly more traffic.

References:Let's stay in touch and Follow me for more thoughts and updates