Skip to main content

GPU Memory Math for Multi-Model Serving: Why Most Teams Over-Provision by 3x

· 9 min read
Tian Pan
Software Engineer

Most teams running LLM inference treat GPU provisioning like a guessing game. They see a model needs "140 GB at FP16," panic, requisition four A100-80GB cards, and call it done. What they don't calculate is how KV cache, concurrency, and quantization interact to determine the actual memory footprint — and that miscalculation typically means they're paying 3x more than necessary.

The math isn't complicated. But almost nobody does it before signing the cloud contract. This article walks through the exact formulas, shows where the hidden memory sinks live, and explains the bin-packing strategies that let you serve four models on hardware budgeted for one.

The Memory Formula Most People Get Wrong

The starting formula is simple: Memory = Parameters × Bytes per Parameter. A 70B-parameter model at FP16 (2 bytes per parameter) needs 140 GB just for weights. At INT8, that drops to 70 GB. At INT4, 35 GB.

Here's what this looks like across popular models:

ModelParametersFP16INT8INT4
Mistral 7B7B~14 GB~7 GB~3.5 GB
Llama 3.1 70B70B~140 GB~70 GB~35 GB
Llama 4 Scout (MoE)109B (17B active)~218 GB~109 GB~55 GB

But this is just the model weights. The number most teams stop at. In production, weights are often less than half the total memory footprint. The rest comes from three sources that scale with your traffic, not your model.

Activation memory consumes 5–10% of model weight size during inference. Framework overhead from vLLM, TGI, or your serving stack adds another 10–15%. And then there's the KV cache — the memory component that dominates everything else under concurrent load.

KV Cache: The Memory Component That Eats Your GPU

Every token your model processes generates key and value tensors that must be stored for attention computation. The formula is:

KV Cache per token = 2 × Layers × KV Heads × Head Dimension × Bytes per Element

For Llama 3.1 70B with its 80 layers, 8 KV heads (grouped-query attention), and 128-dimensional heads at BF16 precision, that's approximately 0.31 MB per token.

Sounds tiny. But it compounds fast:

Context LengthKV Cache (1 request)KV Cache (16 concurrent)
2K tokens~0.6 GB~10 GB
8K tokens~2.5 GB~40 GB
32K tokens~10 GB~160 GB
128K tokens~40 GB~640 GB

At 128K context with 16 concurrent requests, the KV cache alone (640 GB) dwarfs the model weights (140 GB) by more than 4x. This is where the over-provisioning trap springs. Teams size their GPUs for model weights plus a vague buffer, then discover they can serve either long contexts or high concurrency — but not both.

The critical insight: KV cache scales linearly with both context length and batch size. If you double either, you double the cache. Double both, and you quadruple it. Your capacity planning must account for your actual traffic distribution — the 95th percentile context length times your target concurrency — not the maximum context the model theoretically supports.

Static Allocation: The Silent Waste

Most serving frameworks statically pre-allocate KV cache for the maximum supported sequence length. If your model supports 128K tokens but your median request uses 2K, you've reserved 64x more memory per request slot than typical traffic requires.

vLLM's PagedAttention addresses this by allocating KV cache in fixed-size blocks on demand, similar to virtual memory paging. Instead of reserving 128K tokens of cache per request, it allocates pages as the sequence grows. This alone enables 2–4x more concurrent requests on identical hardware.

The practical impact: a team running Llama 3.1 70B on 4×A100-80GB with naive static allocation might support 8 concurrent requests at 8K context. With PagedAttention, the same hardware serves 20–30 concurrent requests at the same context length, because the memory freed from over-allocated slots becomes available for additional sequences.

If your serving stack doesn't support paged allocation, you're almost certainly over-provisioned. This is the single highest-ROI optimization for most inference deployments.

The Quantization-Quality Tradeoff Curve That Actually Matters

Quantization is the most effective lever for reducing memory requirements. But the benchmarks teams typically cite — perplexity on Wikitext-2 — don't tell the full story. Here's what production-relevant evaluation looks like:

Quality retention across methods (Llama-based models):

MethodPerplexity (Wikitext-2)HumanEval Pass@1Memory Savings
FP16 (baseline)6.5656.1%
BitsandBytes INT86.6751.8%~50%
AWQ INT46.8451.8%~75%
GGUF Q4_K_M6.7451.8%~75%
GPTQ INT46.9046.3%~75%

Perplexity differences between quantization methods are small — all within 6% of baseline. But HumanEval tells a different story: GPTQ drops code generation accuracy by nearly 10 percentage points, while AWQ and GGUF maintain parity with BitsandBytes at the same 4-bit compression.

The throughput story is even more surprising. Quantized models aren't just smaller — with the right kernel, they're faster:

| Method | Output Throughput | vs. FP16 Baseline |

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates