Continuous Batching: The Single Biggest GPU Utilization Unlock for LLM Serving

April 9, 2026 · 11 min read

Software Engineer

Most LLM serving infrastructure failures in production aren't model failures—they're scheduling failures. Teams stand up a capable model, load test it, and discover they're burning expensive GPU time at 35% utilization while users wait. The culprit is almost always static batching: a default inherited from conventional deep learning that fundamentally doesn't fit how language models generate text.

Continuous batching—also called iteration-level scheduling or in-flight batching—is the mechanism that fixes this. It's not a tuning knob; it's an architectural change to how the serving loop runs. The difference between a system using it and one that isn't can be 4–8x in throughput for the same hardware.

Understanding why requires understanding what's actually broken about the naive approach.

Why Static Batching Wastes Your GPU

Conventional batching collects a set of requests, processes them as a unit through the model until every sequence has finished generating, then moves to the next batch. This works fine for image classification, where every input produces exactly one output of fixed size. For language generation, it's a disaster.

The problem is output length variance. A chat request might complete in 15 tokens. A code generation request in the same batch might run to 800 tokens. For the 785 iterations after the short request finishes, its allocated GPU memory and compute slots sit idle—padding—while the batch waits for the longest sequence to terminate. You're paying full throughput cost while the utilization curves show 30–60% GPU engagement.

Dynamic batching improves on this by grouping requests within a time window (say, 50ms) to reduce admission latency, but the batch-level scheduling problem remains: once the window closes, the batch runs as a unit until the last sequence completes.

Continuous batching solves this by moving the scheduling decision from the request granularity to the iteration granularity. The scheduler runs once per model forward pass, not once per request. When a sequence emits an end-of-sequence token and finishes, its memory slot is freed immediately. The next waiting request is inserted into the batch before the next iteration begins. No request waits for another request to finish—the batch composition changes on every decoding step.

The throughput implication is significant. The ORCA paper (OSDI 2022), which introduced iteration-level scheduling at scale, demonstrated 36.9x throughput improvement over FasterTransformer at equivalent latency targets. Anyscale's real-world benchmarks showed 8x improvement over naive HuggingFace Transformers serving. Combined with PagedAttention-based KV cache management, vLLM's original release reached 24x higher throughput than HuggingFace Transformers and 3.5x over HuggingFace TGI.

How the Scheduler Actually Works

Each forward pass, the continuous batching scheduler executes a short loop:

Scan the running batch for sequences that completed (EOS emitted)
Free their KV cache blocks
Pull waiting requests from the queue—as many as memory and batch-size limits allow
Concatenate all active sequences into a single compound batch
Run one model forward pass; each sequence produces its next token
Repeat

The concatenation step is what makes this structurally different. Static batching requires sequences to be padded to equal length because batched matrix operations need uniform tensor shapes. Continuous batching instead constructs a "super-sequence" with attention masks that prevent any request from attending to any other request's tokens. No padding, no wasted computation—every GPU FLOP processes a real token.

This concatenated formulation integrates naturally with FlashAttention's variable-length kernel variants, which process all sequences in a single GPU kernel call despite different lengths. The result is high GPU occupancy even when the batch contains a mix of short and long in-progress generations.

PagedAttention: The Memory Management Layer

Continuous batching addresses when to schedule requests. PagedAttention (vLLM, SOSP 2023) addresses where to store their KV caches.

Prior to PagedAttention, LLM serving frameworks pre-allocated a contiguous block of GPU memory for each request sized to its maximum possible output length. This caused 60–80% memory waste from fragmentation: over-reserved slots, alignment gaps, and the simple fact that most sequences don't use their maximum allocation.

PagedAttention applies the OS virtual memory paging model to KV cache management. KV cache is partitioned into fixed-size blocks (16 tokens per block in vLLM's default), allocated on-demand as sequences generate tokens rather than up-front. A block table maps each sequence's logical blocks to physical GPU memory locations—blocks need not be contiguous. Memory waste drops to under 4% (only the last partially-filled block is wasted per sequence).

The second benefit: physical blocks can be shared across sequences via copy-on-write semantics. Beam search branches, parallel samples, and requests sharing a common system prompt can all reference the same physical KV blocks until they diverge. For beam search, this reduces overhead by up to 55% and yields up to 2.2x throughput improvement over unshared allocation.

SGLang extended this further with RadixAttention: a radix tree data structure that maintains KV cache across different requests, enabling automatic prefix reuse. Requests sharing a system prompt, few-shot examples, or RAG context reuse each other's cached KV blocks rather than recomputing them. On workloads with heavy prefix sharing, this delivers up to 5x faster inference.

The Tradeoff Curve: When Continuous Batching Helps and When It Doesn't

Continuous batching's benefit scales directly with output length variance. On a workload where every request generates exactly 50 tokens, the advantage over static batching shrinks toward zero—there's no idle padding to eliminate. On a mixed chat workload where outputs range from 5 to 1000 tokens, the improvement is 4–8x.

The workload fit looks like this:

Chat, agents, interactive assistants: high variance, many concurrent users, mixed short/long responses. Continuous batching is strongly beneficial here—this is the workload it was designed for.
Online APIs under variable QPS: sequences are admitted immediately rather than waiting for a batch to fill, which reduces TTFT significantly at moderate load.
RAG pipelines with shared prefixes: RadixAttention (SGLang) or prefix caching (vLLM) compound the benefit via KV reuse across requests.
Offline batch inference with homogeneous outputs: static batching is competitive here and often simpler. When sequence lengths are known in advance and uniform, the scheduler overhead of continuous batching offers little gain.
Very low QPS (single-digit requests per second): all approaches perform similarly; the scheduling overhead of continuous batching matters more than the utilization benefit.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Continuous Batching: The Single Biggest GPU Utilization Unlock for LLM Serving

Why Static Batching Wastes Your GPU

How the Scheduler Actually Works

PagedAttention: The Memory Management Layer

The Tradeoff Curve: When Continuous Batching Helps and When It Doesn't

Recommended Reading

About Tian Pan

Why Static Batching Wastes Your GPU​

How the Scheduler Actually Works​

PagedAttention: The Memory Management Layer​

The Tradeoff Curve: When Continuous Batching Helps and When It Doesn't​

Recommended Reading

About Tian Pan

Why Static Batching Wastes Your GPU

How the Scheduler Actually Works

PagedAttention: The Memory Management Layer

The Tradeoff Curve: When Continuous Batching Helps and When It Doesn't