GPU Memory Math for Multi-Model Serving: Why Most Teams Over-Provision by 3x
Most teams running LLM inference treat GPU provisioning like a guessing game. They see a model needs "140 GB at FP16," panic, requisition four A100-80GB cards, and call it done. What they don't calculate is how KV cache, concurrency, and quantization interact to determine the actual memory footprint — and that miscalculation typically means they're paying 3x more than necessary.
The math isn't complicated. But almost nobody does it before signing the cloud contract. This article walks through the exact formulas, shows where the hidden memory sinks live, and explains the bin-packing strategies that let you serve four models on hardware budgeted for one.
The Memory Formula Most People Get Wrong
The starting formula is simple: Memory = Parameters × Bytes per Parameter. A 70B-parameter model at FP16 (2 bytes per parameter) needs 140 GB just for weights. At INT8, that drops to 70 GB. At INT4, 35 GB.
Here's what this looks like across popular models:
| Model | Parameters | FP16 | INT8 | INT4 |
|---|---|---|---|---|
| Mistral 7B | 7B | ~14 GB | ~7 GB | ~3.5 GB |
| Llama 3.1 70B | 70B | ~140 GB | ~70 GB | ~35 GB |
| Llama 4 Scout (MoE) | 109B (17B active) | ~218 GB | ~109 GB | ~55 GB |
But this is just the model weights. The number most teams stop at. In production, weights are often less than half the total memory footprint. The rest comes from three sources that scale with your traffic, not your model.
Activation memory consumes 5–10% of model weight size during inference. Framework overhead from vLLM, TGI, or your serving stack adds another 10–15%. And then there's the KV cache — the memory component that dominates everything else under concurrent load.
KV Cache: The Memory Component That Eats Your GPU
Every token your model processes generates key and value tensors that must be stored for attention computation. The formula is:
KV Cache per token = 2 × Layers × KV Heads × Head Dimension × Bytes per Element
For Llama 3.1 70B with its 80 layers, 8 KV heads (grouped-query attention), and 128-dimensional heads at BF16 precision, that's approximately 0.31 MB per token.
Sounds tiny. But it compounds fast:
| Context Length | KV Cache (1 request) | KV Cache (16 concurrent) |
|---|---|---|
| 2K tokens | ~0.6 GB | ~10 GB |
| 8K tokens | ~2.5 GB | ~40 GB |
| 32K tokens | ~10 GB | ~160 GB |
| 128K tokens | ~40 GB | ~640 GB |
At 128K context with 16 concurrent requests, the KV cache alone (640 GB) dwarfs the model weights (140 GB) by more than 4x. This is where the over-provisioning trap springs. Teams size their GPUs for model weights plus a vague buffer, then discover they can serve either long contexts or high concurrency — but not both.
The critical insight: KV cache scales linearly with both context length and batch size. If you double either, you double the cache. Double both, and you quadruple it. Your capacity planning must account for your actual traffic distribution — the 95th percentile context length times your target concurrency — not the maximum context the model theoretically supports.
Static Allocation: The Silent Waste
Most serving frameworks statically pre-allocate KV cache for the maximum supported sequence length. If your model supports 128K tokens but your median request uses 2K, you've reserved 64x more memory per request slot than typical traffic requires.
vLLM's PagedAttention addresses this by allocating KV cache in fixed-size blocks on demand, similar to virtual memory paging. Instead of reserving 128K tokens of cache per request, it allocates pages as the sequence grows. This alone enables 2–4x more concurrent requests on identical hardware.
The practical impact: a team running Llama 3.1 70B on 4×A100-80GB with naive static allocation might support 8 concurrent requests at 8K context. With PagedAttention, the same hardware serves 20–30 concurrent requests at the same context length, because the memory freed from over-allocated slots becomes available for additional sequences.
If your serving stack doesn't support paged allocation, you're almost certainly over-provisioned. This is the single highest-ROI optimization for most inference deployments.
The Quantization-Quality Tradeoff Curve That Actually Matters
Quantization is the most effective lever for reducing memory requirements. But the benchmarks teams typically cite — perplexity on Wikitext-2 — don't tell the full story. Here's what production-relevant evaluation looks like:
Quality retention across methods (Llama-based models):
| Method | Perplexity (Wikitext-2) | HumanEval Pass@1 | Memory Savings |
|---|---|---|---|
| FP16 (baseline) | 6.56 | 56.1% | — |
| BitsandBytes INT8 | 6.67 | 51.8% | ~50% |
| AWQ INT4 | 6.84 | 51.8% | ~75% |
| GGUF Q4_K_M | 6.74 | 51.8% | ~75% |
| GPTQ INT4 | 6.90 | 46.3% | ~75% |
Perplexity differences between quantization methods are small — all within 6% of baseline. But HumanEval tells a different story: GPTQ drops code generation accuracy by nearly 10 percentage points, while AWQ and GGUF maintain parity with BitsandBytes at the same 4-bit compression.
The throughput story is even more surprising. Quantized models aren't just smaller — with the right kernel, they're faster:
| Method | Output Throughput | vs. FP16 Baseline |
- https://www.spheron.network/blog/gpu-memory-requirements-llm/
- https://docs.jarvislabs.ai/blog/vllm-quantization-complete-guide-benchmarks
- https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- https://www.digitalocean.com/community/conceptual-articles/vllm-gpu-sizing-configuration-guide
- https://arxiv.org/html/2503.08311v2
- https://lmsys.org/blog/2023-11-15-slora/
- https://www.infoq.com/news/2025/12/nvidia-dynamo-kubernetes/
- https://www.runpod.io/articles/guides/gpu-memory-management-for-large-language-models-optimization-strategies-for-production-deployment
- https://cast.ai/blog/demystifying-quantizations-llms/
- https://arxiv.org/html/2511.22880v1
