Skip to main content

GPU Scheduling for Mixed LLM Workloads: The Bin-Packing Problem Nobody Solves Well

· 10 min read
Tian Pan
Software Engineer

Most GPU clusters running LLM inference are wasting between 30% and 50% of their available compute. Not because engineers are careless, but because the scheduling problem is genuinely hard—and the tools most teams reach for first were never designed for it.

The standard approach is to stand up Kubernetes, request whole GPUs per pod, and let the scheduler figure it out. This works fine for training jobs. For inference across a heterogeneous set of models, it quietly destroys utilization. A cluster running three different 7B models with sporadic traffic will find each GPU busy less than 15% of the time, while remaining fully "allocated" and refusing to schedule new work.

The root cause is a mismatch between how Kubernetes thinks about GPUs and what LLM inference actually requires.

Why Kubernetes Gets This Wrong

Kubernetes treats GPUs as atomic units. A pod requesting nvidia.com/gpu: 1 receives an entire physical GPU—40GB of HBM on an A100, 80GB on an H100—regardless of how much the workload actually needs. This was a reasonable design decision when workloads were training runs or batch inference jobs that genuinely consumed the whole device. It breaks down completely for mixed serving scenarios.

The specific ways it breaks:

No KV cache visibility. The Kubernetes scheduler has zero awareness of how full a running pod's KV cache is. In LLM inference, KV cache occupancy is the single most important resource constraint at runtime—it determines how many concurrent requests the instance can handle and whether new requests will queue or fail. Routing a long-context request to a pod with 90% cache utilization produces dramatically different latency than routing it to one at 10%. The scheduler cannot see this distinction at all.

Model loading latency is invisible. Spinning up a new inference pod isn't instantaneous. Container image pulls alone can take 2–5 minutes for images containing multi-gigabyte model weights. Then there's weight deserialization on CPU, transfer to GPU memory, and a warm-up forward pass. Total cold-start time in production is typically 2–5 minutes. Kubernetes will happily schedule a pod on a node that will make users wait four minutes for a response, because it has no concept of model loading state.

Batch coalescing is impossible. Horizontal Pod Autoscaler triggers on CPU and memory metrics. At the threshold where LLM inference most needs to scale—a deep queue with requests waiting for token generation—memory utilization might be 30% (normal for an idle inference server) and CPU utilization flat. The system looks healthy while users experience unbounded queue delays.

The consequence: teams overprovision heavily, keeping warm instances for every model they serve, paying for idle GPU time to avoid the cold-start penalty. This isn't a configuration problem. It's a fundamental gap between what Kubernetes exposes and what LLM scheduling needs.

The Memory Fragmentation Problem

Even before requests arrive, mixed-model clusters lose significant capacity to fragmentation.

A vLLM instance serving an 8B model might pre-allocate 60GB of GPU memory while only using 35–40GB under typical load. This isn't waste in the traditional sense—the allocation is correct given the potential worst-case context length—but it means other models cannot share that device even when it's underutilized.

The numbers get starker when you look at KV cache requirements specifically. A 7B model with 100,000-token context needs roughly 50GB for KV cache—versus 14GB for the model weights themselves. Three-quarters of memory consumption is transient context state, not the model. Traditional static allocation treats this as permanent, blocking the capacity from other uses.

Before PagedAttention (the core innovation in vLLM), systems wasted 60–80% of allocated KV cache memory through fragmentation: non-contiguous allocations, worst-case pre-allocation, internal fragmentation between requests. PagedAttention applies the same paging technique operating systems use for virtual memory, reducing KV cache waste to under 4%. This enabled 2–4x throughput on the same hardware—not by adding GPUs, but by using existing ones better.

Quantization adds another lever. Moving from FP16 to FP8 for KV cache reduces cache size by 2x with minimal quality impact; NVFP4 reduces it further, enabling larger batches or longer contexts on the same device. But quantization alone doesn't solve the scheduling problem—it just changes the numbers the scheduler needs to track.

What Actually Reclaims Stranded Capacity

Three strategies genuinely help, in roughly increasing order of implementation complexity.

Time-Slicing and MIG: Partitioning the GPU

When you need to run multiple small models on a single large GPU, partitioning is the most direct approach.

Time-slicing (available on all NVIDIA architectures) uses round-robin context switching between workloads, each getting 1–2ms before yielding. It's simple to enable and works universally, but the context-switch overhead is real and accumulates under sustained load. Better for bursty, latency-tolerant workloads than continuous inference.

Multi-Instance GPU (MIG) (Ampere and newer only) partitions the physical GPU at the hardware level, giving each instance isolated compute cores, dedicated memory, and its own L2 cache. No context-switch overhead, guaranteed resource isolation. The tradeoff is inflexibility: MIG partition sizes are fixed at provisioning time, and any mismatch between partition size and workload size wastes capacity.

A hybrid approach—MIG partitions plus time-slicing within each partition—combines their strengths. One benchmark showed 6.2x throughput improvement and nearly 6x energy savings for general workloads; LLM-specific workloads showed more modest 1.4x throughput gains. The range reflects how sensitive these numbers are to workload characteristics.

Neither MIG nor time-slicing touches the scheduling decisions that determine which request goes to which pod. They're resource provisioning strategies, not routing intelligence.

Continuous Batching: Eliminate Head-of-Line Blocking

Static batching—waiting for a fixed number of requests before starting a batch—is the default behavior inherited from traditional ML inference. It creates head-of-line blocking: an early request with a long prompt holds the batch until it finishes, blocking all subsequent requests regardless of their size.

Continuous batching eliminates this by running iteration-level scheduling. Each time a sequence completes a decoding step, the batch can absorb a new request from the queue. No request waits for another to finish its full generation. The GPU stays busy, short requests don't queue behind long ones, and throughput scales with load rather than batch configuration.

This isn't optional for production LLM serving. It's table stakes, and vLLM implemented it early. The gap between static and continuous batching at moderate concurrency is 2–4x throughput on identical hardware.

KV-Cache-Aware Routing: The New Frontier

The most impactful—and least commonly deployed—optimization is routing requests to pods that already hold relevant context in their KV cache.

The gains are significant. If a pod already has the first 4,000 tokens of a user's conversation cached, routing the next request in that conversation to the same pod reduces time-to-first-token by up to 74% compared to routing it to a cold pod. At enterprise scale—150 customers with 6,000-token shared context prefixes, total KV-cache demand at 73% of cluster capacity—intelligent routing can make the difference between a system that works and one that doesn't.

The Kubernetes Gateway API Inference Extension (GA in early 2026) added the primitives needed for this: model-aware routing, custom metrics for HPA based on queue depth and KV-cache utilization, and traffic splitting by model. Built on top of this, llm-d provides a Kubernetes-native inference scheduler that filters and scores pods by KV-cache state, prefill/decode phase, SLA constraints, and current load. When a request arrives, it can identify which pod holds the most relevant cached context and route there—turning the cluster's distributed KV cache into a coherent resource rather than isolated silos.

The catch is operational complexity. KV-cache-aware routing requires the gateway to know which pods have cached which prefixes. This state must be maintained, synchronized, and invalidated as caches evict old blocks. It's a non-trivial distributed systems problem.

The Cold-Start Trap

All of the above applies to steady-state serving. The cold-start problem operates at a different timescale and undermines dynamic scaling strategies.

Breaking down a typical cold-start in production:

  • VM or node provisioning: 30–60 seconds
  • Container initialization: 5–10 seconds
  • Image pull: 1–5 minutes (10–30GB images)
  • Runtime startup and weight fetch: 30–60 seconds
  • CPU deserialization: 30–120 seconds
  • GPU memory allocation and weight transfer: 15–80 seconds
  • Warm-up forward pass: 5–30 seconds

Total: often exceeding 5 minutes from zero to first token.

A system that scales to zero to save money will violate SLOs during any traffic spike. A system that keeps instances warm to avoid cold starts pays for idle GPUs continuously. The industry is converging on predictive autoscaling as the answer: start instances before demand arrives by forecasting load from historical patterns and leading indicators. DynamoLLM (HPCA 2025) demonstrated this approach, predicting peak load for the next scheduling epoch and proactively provisioning instances. Lazy image pulling (stargz Snapshotter) and direct-to-GPU weight streaming can cut cold-start time by 60–80% on top of that.

Choosing a Serving Framework

Your scheduling strategy interacts directly with your inference framework choice.

vLLM is the current standard for high-throughput scenarios. PagedAttention, continuous batching, and prefix caching are all native. Highest throughput at scale.

SGLang outperforms vLLM by 10–20% for workloads with shared prompt prefixes—conversational AI, agent workflows, anything where multiple requests share a long common prefix. RadixAttention (its equivalent of prefix caching) is more aggressive about reuse.

TGI entered maintenance mode in December 2025. Hugging Face recommends vLLM or SGLang for new deployments. If you're running TGI in production, plan your migration.

Triton / NVIDIA Dynamo targets multi-framework clusters where you're serving models from different frameworks simultaneously. Not LLM-optimized, but the right choice for multi-modal or mixed-framework serving.

Ray Serve is a general-purpose Python serving framework—useful for flexibility, suboptimal without manual optimization for LLM-specific concerns.

What Actually Matters

The practical hierarchy, in order of impact:

  1. Switch to continuous batching. If you're still using static batching, this is your highest-leverage change. 2–4x throughput, same hardware, no scheduling changes required.

  2. Adopt PagedAttention. vLLM's memory management reduces KV cache waste from 60–80% to under 4%. This unlocks the capacity that fragmentation was hiding.

  3. Enable prefix caching. For workloads with repeated system prompts or shared context, automatic prefix caching can reduce time-to-first-token by 74% on cache hits. It's a configuration flag in vLLM; turn it on.

  4. Replace Kubernetes GPU atomicity with partitioning. MIG or time-slicing for models that don't need a full GPU. This is operationally straightforward and reclaims 1.4–6x capacity depending on workload mix.

  5. Add KV-cache-aware routing. Higher complexity, higher reward. Start with the Gateway API Inference Extension if you're Kubernetes-native; llm-d adds the scheduling intelligence on top.

  6. Address cold starts with predictive scaling. Once the above are in place, cold-start latency becomes the dominant failure mode under load spikes. Lazy pulling and predictive autoscaling close that gap.

The teams who build scalable LLM serving infrastructure don't do it by adding GPUs. They do it by making the GPUs they have stop being idle while marked as allocated.

References:Let's stay in touch and Follow me for more thoughts and updates