Skip to main content

GPU Memory Math for Multi-Model Serving: Why Most Teams Over-Provision by 3x

· 8 min read
Tian Pan
Software Engineer

Most teams serving LLMs in production are burning money on GPU capacity they don't need. The root cause isn't carelessness — it's that GPU memory sizing for LLM inference involves four interacting variables (model weights, KV cache, activation memory, and framework overhead), and getting any one wrong means you over-provision the entire stack. When you multiply that error across multiple models on shared infrastructure, the waste compounds fast.

The math itself isn't hard. But most teams never do it, because "just give it an 80GB A100" is easier than calculating whether a 48GB L40S would suffice. This article walks through the arithmetic that determines how many models you can pack onto a single GPU — and the quantization tradeoffs that make it possible.

The Four Components of GPU Memory in LLM Inference

Every model you serve consumes GPU memory across four categories, and you need to account for all of them:

Model weights are the baseline. A model with P parameters at B bytes per parameter consumes P × B bytes. A 70B-parameter model at FP16 (2 bytes) needs 140 GB. At INT4 (0.5 bytes), that same model fits in 35 GB. This is the number most people calculate and then stop, which is where the trouble starts.

KV cache is the variable that actually dominates at production scale. For each token in the sequence, the model stores key and value vectors across every layer. The formula:

KV cache per token = 2 × layers × kv_heads × head_dim × bytes_per_element

For Llama 3.1 70B with Grouped Query Attention (80 layers, 8 KV heads, 128-dim heads, BF16): that's roughly 0.31 MB per token. At a 128K context window, a single request consumes ~40 GB of KV cache alone. Four concurrent requests at full context? 160 GB — more than the model weights themselves.

Activation memory holds intermediate computation results during the forward pass. This is typically 5–15% of model weight memory, varying with batch size and sequence length.

Framework overhead from vLLM, TGI, or whatever serving engine you use. Memory managers, scheduling buffers, CUDA contexts, and Python runtime all claim GPU memory. Budget 1–3 GB depending on the framework and number of loaded models.

The critical insight: at short contexts (under 2K tokens), model weights dominate. At long contexts (32K+), KV cache dominates. Your capacity plan must account for your actual context length distribution, not the model's maximum.

The KV Cache Math That Changes Everything

Here's why teams over-provision: most serving engines pre-allocate KV cache memory for the worst case. If your model supports 128K context and you configure 16 concurrent slots, the engine reserves 16 × 40 GB = 640 GB of KV cache space — even if 90% of your requests use under 4K tokens.

This is the single biggest source of memory waste in LLM serving. The fix is understanding your actual context length distribution and configuring accordingly.

If your P95 request length is 4K tokens, each slot only needs ~1.25 GB of KV cache for a 70B model. Sixteen concurrent slots at 4K context: 20 GB of KV cache instead of 640 GB. That's the difference between needing eight A100s and needing two.

Practical KV cache sizes for common models (per token, BF16):

  • Llama 3.1 8B (32 layers, 8 KV heads, 128 dim): ~0.03 MB/token
  • Mistral 7B (32 layers, 8 KV heads, 128 dim): ~0.03 MB/token
  • Llama 3.1 70B (80 layers, 8 KV heads, 128 dim): ~0.31 MB/token
  • Llama 3.1 405B (126 layers, 8 KV heads, 128 dim): ~0.49 MB/token

The Grouped Query Attention (GQA) design used by most modern models is a massive win here. Llama 3.1 70B uses only 8 KV heads versus 64 query heads, reducing KV cache by 8x compared to full multi-head attention. Older models without GQA will consume proportionally more cache per token.

The Quantization Decision Matrix

Quantization is where you reclaim the memory to make multi-model serving viable. But the landscape has fragmented into enough options that teams either pick randomly or default to the most conservative choice. Here's how to decide.

FP8 (8-bit floating point): The safe default for production. Near-zero quality degradation — perplexity increases by less than 0.5% on most benchmarks. Halves weight memory compared to FP16. Native hardware support on H100/H200/B100 GPUs means no throughput penalty. If you're on recent NVIDIA hardware, start here.

INT4 with AWQ (Activation-aware Weight Quantization): The aggressive choice that usually works. Cuts weight memory by 4x versus FP16. Quality retention is ~95% on general tasks, and with Marlin acceleration kernels, AWQ achieves 741 tokens/second versus 461 tok/s at FP16 baseline — 60% faster, not slower. The catch: code generation accuracy drops about 7.7% (from 56.1% to 51.8% Pass@1 on HumanEval). For non-code tasks, this is often the sweet spot.

INT4 with GPTQ: Similar compression to AWQ but slightly lower quality on code tasks (46.3% Pass@1, a 17% drop). Throughput with Marlin kernels is comparable at 712 tok/s. Use GPTQ when AWQ quantized weights aren't available for your model.

GGUF: The right choice for llama.cpp and Ollama deployments. Q4_K_M quality is on par with AWQ, but throughput in vLLM is poor (93 tok/s with 958ms time-to-first-token). If you're already in the vLLM ecosystem, avoid GGUF.

The key finding from recent benchmarks: kernels matter more than algorithms. The same AWQ quantized weights run at 67 tok/s with standard kernels versus 741 tok/s with Marlin — a 10.9x difference. Always check that your serving framework supports optimized kernels for your quantization format.

Mixed-precision is the emerging trend: using FP8 for attention layers (where precision matters most) and INT4 for MLP layers (which are more compression-tolerant). This approach reduces perplexity by 0.14 compared to pure INT4, while achieving nearly the same compression ratio.

Bin-Packing: Fitting 4 Models Where You Budgeted for 1

With the memory math in hand, multi-model serving becomes a bin-packing problem: given a GPU's total memory, how many models (with their KV caches and overhead) can you fit?

Example: Single A100 80GB

Start with usable memory after CUDA overhead: ~77 GB.

ConfigurationWeightsKV Cache (8 slots × 2K)OverheadTotalFits?
1× Llama 70B FP16140 GB5 GB2 GB147 GBNo
1× Llama 70B INT435 GB5 GB2 GB42 GBYes (35 GB free)
1× Llama 70B INT4 + 1× Mistral 7B INT438.5 GB5.5 GB3 GB47 GBYes (30 GB free)
1× Llama 70B INT4 + 2× 7B INT442 GB6 GB4 GB52 GBYes (25 GB free)
4× 7B INT414 GB2 GB5 GB21 GBYes (56 GB free)

At INT4 quantization, a single A100 can comfortably serve a 70B model alongside two 7B models — hardware that teams typically dedicate to one FP16 model. That's the 3x over-provisioning gap.

Dynamic memory swapping takes this further. NVIDIA's Run:ai GPU memory swap technique offloads inactive models to CPU memory and reactivates them on-GPU within 2–3 seconds. Testing showed Llama 3.1 8B (32 GB) and Mistral 7B (27.5 GB) sharing a single 48 GB L40S GPU despite combined weights exceeding GPU capacity. Compared to cold-starting from zero (140–208 seconds), memory swap achieves 50–66x improvement in time-to-first-token.

This works when traffic patterns are complementary: if Model A peaks during business hours and Model B peaks overnight, they can time-share a single GPU effectively.

The Scheduling Layer That Makes It Work

Static allocation is the enemy of utilization. Research from the Aegaeon system (SOSP '25) and others shows that token-level scheduling across multiple models on shared GPUs can dramatically improve efficiency. The key patterns:

Workload-aware KV cache allocation. Instead of pre-allocating max context for every slot, allocate KV cache dynamically based on actual request length. vLLM's PagedAttention already does this to a degree, but most teams configure max_model_len conservatively, which still wastes memory.

Priority-based preemption. When memory pressure hits, evict KV cache from lower-priority requests rather than rejecting new ones. This requires tracking request priority and re-computation cost, but keeps utilization high during bursts.

Cross-model memory coordination. The Aladdin system models latency using prefill/decode estimators and solves a bin-packing optimization to find the minimum-cost GPU configuration that satisfies all active SLOs, reporting up to 71% cost savings while maintaining latency guarantees.

The Sizing Checklist

Before you provision GPU infrastructure for LLM serving, work through this:

  1. Calculate model weight memory at your chosen precision. FP16 baseline, then evaluate FP8 and INT4 for your quality requirements.
  2. Profile your context length distribution. Use P95, not max. Most production workloads have dramatically shorter average contexts than the model's maximum.
  3. Size KV cache for actual concurrency × actual context. Not max concurrency × max context.
  4. Add framework overhead. 1–3 GB per loaded model, depending on your serving stack.
  5. Leave 10–15% headroom for memory fragmentation and burst handling.
  6. Verify kernel support for your quantization format in your serving framework. The wrong kernel can cost you 10x throughput.

The teams that do this math end up serving 3–4x more models per GPU than those who don't. In an environment where GPU hours are the largest line item in the AI infrastructure budget, that's not optimization — it's table stakes.

References:Let's stay in touch and Follow me for more thoughts and updates