GPU Scheduling for Mixed LLM Workloads: The Bin-Packing Problem Nobody Solves Well

April 14, 2026 · 10 min read

Software Engineer

Most GPU clusters running LLM inference are wasting between 30% and 50% of their available compute. Not because engineers are careless, but because the scheduling problem is genuinely hard—and the tools most teams reach for first were never designed for it.

The standard approach is to stand up Kubernetes, request whole GPUs per pod, and let the scheduler figure it out. This works fine for training jobs. For inference across a heterogeneous set of models, it quietly destroys utilization. A cluster running three different 7B models with sporadic traffic will find each GPU busy less than 15% of the time, while remaining fully "allocated" and refusing to schedule new work.

The root cause is a mismatch between how Kubernetes thinks about GPUs and what LLM inference actually requires.

Why Kubernetes Gets This Wrong

Kubernetes treats GPUs as atomic units. A pod requesting nvidia.com/gpu: 1 receives an entire physical GPU—40GB of HBM on an A100, 80GB on an H100—regardless of how much the workload actually needs. This was a reasonable design decision when workloads were training runs or batch inference jobs that genuinely consumed the whole device. It breaks down completely for mixed serving scenarios.

The specific ways it breaks:

No KV cache visibility. The Kubernetes scheduler has zero awareness of how full a running pod's KV cache is. In LLM inference, KV cache occupancy is the single most important resource constraint at runtime—it determines how many concurrent requests the instance can handle and whether new requests will queue or fail. Routing a long-context request to a pod with 90% cache utilization produces dramatically different latency than routing it to one at 10%. The scheduler cannot see this distinction at all.

Model loading latency is invisible. Spinning up a new inference pod isn't instantaneous. Container image pulls alone can take 2–5 minutes for images containing multi-gigabyte model weights. Then there's weight deserialization on CPU, transfer to GPU memory, and a warm-up forward pass. Total cold-start time in production is typically 2–5 minutes. Kubernetes will happily schedule a pod on a node that will make users wait four minutes for a response, because it has no concept of model loading state.

Batch coalescing is impossible. Horizontal Pod Autoscaler triggers on CPU and memory metrics. At the threshold where LLM inference most needs to scale—a deep queue with requests waiting for token generation—memory utilization might be 30% (normal for an idle inference server) and CPU utilization flat. The system looks healthy while users experience unbounded queue delays.

The consequence: teams overprovision heavily, keeping warm instances for every model they serve, paying for idle GPU time to avoid the cold-start penalty. This isn't a configuration problem. It's a fundamental gap between what Kubernetes exposes and what LLM scheduling needs.

The Memory Fragmentation Problem

Even before requests arrive, mixed-model clusters lose significant capacity to fragmentation.

A vLLM instance serving an 8B model might pre-allocate 60GB of GPU memory while only using 35–40GB under typical load. This isn't waste in the traditional sense—the allocation is correct given the potential worst-case context length—but it means other models cannot share that device even when it's underutilized.

The numbers get starker when you look at KV cache requirements specifically. A 7B model with 100,000-token context needs roughly 50GB for KV cache—versus 14GB for the model weights themselves. Three-quarters of memory consumption is transient context state, not the model. Traditional static allocation treats this as permanent, blocking the capacity from other uses.

Before PagedAttention (the core innovation in vLLM), systems wasted 60–80% of allocated KV cache memory through fragmentation: non-contiguous allocations, worst-case pre-allocation, internal fragmentation between requests. PagedAttention applies the same paging technique operating systems use for virtual memory, reducing KV cache waste to under 4%. This enabled 2–4x throughput on the same hardware—not by adding GPUs, but by using existing ones better.

Quantization adds another lever. Moving from FP16 to FP8 for KV cache reduces cache size by 2x with minimal quality impact; NVFP4 reduces it further, enabling larger batches or longer contexts on the same device. But quantization alone doesn't solve the scheduling problem—it just changes the numbers the scheduler needs to track.

What Actually Reclaims Stranded Capacity

Three strategies genuinely help, in roughly increasing order of implementation complexity.

Time-Slicing and MIG: Partitioning the GPU

When you need to run multiple small models on a single large GPU, partitioning is the most direct approach.

Time-slicing (available on all NVIDIA architectures) uses round-robin context switching between workloads, each getting 1–2ms before yielding. It's simple to enable and works universally, but the context-switch overhead is real and accumulates under sustained load. Better for bursty, latency-tolerant workloads than continuous inference.

Multi-Instance GPU (MIG) (Ampere and newer only) partitions the physical GPU at the hardware level, giving each instance isolated compute cores, dedicated memory, and its own L2 cache. No context-switch overhead, guaranteed resource isolation. The tradeoff is inflexibility: MIG partition sizes are fixed at provisioning time, and any mismatch between partition size and workload size wastes capacity.

A hybrid approach—MIG partitions plus time-slicing within each partition—combines their strengths. One benchmark showed 6.2x throughput improvement and nearly 6x energy savings for general workloads; LLM-specific workloads showed more modest 1.4x throughput gains. The range reflects how sensitive these numbers are to workload characteristics.

Neither MIG nor time-slicing touches the scheduling decisions that determine which request goes to which pod. They're resource provisioning strategies, not routing intelligence.

Continuous Batching: Eliminate Head-of-Line Blocking

Static batching—waiting for a fixed number of requests before starting a batch—is the default behavior inherited from traditional ML inference. It creates head-of-line blocking: an early request with a long prompt holds the batch until it finishes, blocking all subsequent requests regardless of their size.

Continuous batching eliminates this by running iteration-level scheduling. Each time a sequence completes a decoding step, the batch can absorb a new request from the queue. No request waits for another to finish its full generation. The GPU stays busy, short requests don't queue behind long ones, and throughput scales with load rather than batch configuration.

This isn't optional for production LLM serving. It's table stakes, and vLLM implemented it early. The gap between static and continuous batching at moderate concurrency is 2–4x throughput on identical hardware.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

GPU Scheduling for Mixed LLM Workloads: The Bin-Packing Problem Nobody Solves Well

Why Kubernetes Gets This Wrong

The Memory Fragmentation Problem

What Actually Reclaims Stranded Capacity

Time-Slicing and MIG: Partitioning the GPU

Continuous Batching: Eliminate Head-of-Line Blocking

Recommended Reading

About Tian Pan

Why Kubernetes Gets This Wrong​

The Memory Fragmentation Problem​

What Actually Reclaims Stranded Capacity​

Time-Slicing and MIG: Partitioning the GPU​

Continuous Batching: Eliminate Head-of-Line Blocking​

Recommended Reading

About Tian Pan

Why Kubernetes Gets This Wrong

The Memory Fragmentation Problem

What Actually Reclaims Stranded Capacity

Time-Slicing and MIG: Partitioning the GPU

Continuous Batching: Eliminate Head-of-Line Blocking