Skip to main content

GPU Memory Math for Multi-Model Serving: Why Most Teams Over-Provision by 3x

· 9 min read
Tian Pan
Software Engineer

Most teams running LLM inference treat GPU provisioning like a guessing game. They see a model needs "140 GB at FP16," panic, requisition four A100-80GB cards, and call it done. What they don't calculate is how KV cache, concurrency, and quantization interact to determine the actual memory footprint — and that miscalculation typically means they're paying 3x more than necessary.

The math isn't complicated. But almost nobody does it before signing the cloud contract. This article walks through the exact formulas, shows where the hidden memory sinks live, and explains the bin-packing strategies that let you serve four models on hardware budgeted for one.

The Memory Formula Most People Get Wrong

The starting formula is simple: Memory = Parameters × Bytes per Parameter. A 70B-parameter model at FP16 (2 bytes per parameter) needs 140 GB just for weights. At INT8, that drops to 70 GB. At INT4, 35 GB.

Here's what this looks like across popular models:

ModelParametersFP16INT8INT4
Mistral 7B7B~14 GB~7 GB~3.5 GB
Llama 3.1 70B70B~140 GB~70 GB~35 GB
Llama 4 Scout (MoE)109B (17B active)~218 GB~109 GB~55 GB

But this is just the model weights. The number most teams stop at. In production, weights are often less than half the total memory footprint. The rest comes from three sources that scale with your traffic, not your model.

Activation memory consumes 5–10% of model weight size during inference. Framework overhead from vLLM, TGI, or your serving stack adds another 10–15%. And then there's the KV cache — the memory component that dominates everything else under concurrent load.

KV Cache: The Memory Component That Eats Your GPU

Every token your model processes generates key and value tensors that must be stored for attention computation. The formula is:

KV Cache per token = 2 × Layers × KV Heads × Head Dimension × Bytes per Element

For Llama 3.1 70B with its 80 layers, 8 KV heads (grouped-query attention), and 128-dimensional heads at BF16 precision, that's approximately 0.31 MB per token.

Sounds tiny. But it compounds fast:

Context LengthKV Cache (1 request)KV Cache (16 concurrent)
2K tokens~0.6 GB~10 GB
8K tokens~2.5 GB~40 GB
32K tokens~10 GB~160 GB
128K tokens~40 GB~640 GB

At 128K context with 16 concurrent requests, the KV cache alone (640 GB) dwarfs the model weights (140 GB) by more than 4x. This is where the over-provisioning trap springs. Teams size their GPUs for model weights plus a vague buffer, then discover they can serve either long contexts or high concurrency — but not both.

The critical insight: KV cache scales linearly with both context length and batch size. If you double either, you double the cache. Double both, and you quadruple it. Your capacity planning must account for your actual traffic distribution — the 95th percentile context length times your target concurrency — not the maximum context the model theoretically supports.

Static Allocation: The Silent Waste

Most serving frameworks statically pre-allocate KV cache for the maximum supported sequence length. If your model supports 128K tokens but your median request uses 2K, you've reserved 64x more memory per request slot than typical traffic requires.

vLLM's PagedAttention addresses this by allocating KV cache in fixed-size blocks on demand, similar to virtual memory paging. Instead of reserving 128K tokens of cache per request, it allocates pages as the sequence grows. This alone enables 2–4x more concurrent requests on identical hardware.

The practical impact: a team running Llama 3.1 70B on 4×A100-80GB with naive static allocation might support 8 concurrent requests at 8K context. With PagedAttention, the same hardware serves 20–30 concurrent requests at the same context length, because the memory freed from over-allocated slots becomes available for additional sequences.

If your serving stack doesn't support paged allocation, you're almost certainly over-provisioned. This is the single highest-ROI optimization for most inference deployments.

The Quantization-Quality Tradeoff Curve That Actually Matters

Quantization is the most effective lever for reducing memory requirements. But the benchmarks teams typically cite — perplexity on Wikitext-2 — don't tell the full story. Here's what production-relevant evaluation looks like:

Quality retention across methods (Llama-based models):

MethodPerplexity (Wikitext-2)HumanEval Pass@1Memory Savings
FP16 (baseline)6.5656.1%
BitsandBytes INT86.6751.8%~50%
AWQ INT46.8451.8%~75%
GGUF Q4_K_M6.7451.8%~75%
GPTQ INT46.9046.3%~75%

Perplexity differences between quantization methods are small — all within 6% of baseline. But HumanEval tells a different story: GPTQ drops code generation accuracy by nearly 10 percentage points, while AWQ and GGUF maintain parity with BitsandBytes at the same 4-bit compression.

The throughput story is even more surprising. Quantized models aren't just smaller — with the right kernel, they're faster:

MethodOutput Throughputvs. FP16 Baseline
Marlin-AWQ741 tok/s+61%
Marlin-GPTQ712 tok/s+54%
FP16 baseline461 tok/s
Standard GPTQ277 tok/s-40%
Standard AWQ68 tok/s-85%

The Marlin kernel delivers a 10.9x speedup for AWQ inference over the standard implementation. The quantization algorithm is only half the equation — the compute kernel determines whether you get a speed boost or a speed penalty.

Production recommendation: AWQ with Marlin kernels in vLLM gives the best combination of memory savings, quality retention, and throughput. For CPU or edge deployment, GGUF Q4_K_M is the native format for llama.cpp and Ollama. Avoid standard (non-Marlin) GPTQ and AWQ kernels in production — they're slower than FP16 despite using less memory.

Bin-Packing: Serving 4 Models on Hardware Budgeted for 1

Once you've right-sized individual models through quantization and efficient KV cache management, the next opportunity is packing multiple models onto the same GPU pool. Three patterns work in practice:

Pattern 1: Size-Class Routing

Route incoming requests to different model tiers based on query complexity. A lightweight classifier (itself running on a fraction of a GPU) examines each request and routes it to an appropriate model:

  • Simple queries (FAQ, classification, extraction): 7B model at INT4 — ~4 GB
  • Standard queries (summarization, analysis): 70B model at INT4 — ~35 GB
  • Complex queries (multi-step reasoning, code generation): 70B model at FP16 or reasoning model

On a single A100-80GB, you can simultaneously serve a 7B INT4 model (~4 GB), a 70B INT4 model (~35 GB), and still have ~30 GB for KV cache and overhead. Most production traffic distributions are heavily skewed toward simpler queries, so the small model handles 60–70% of volume while the large model handles the rest.

Pattern 2: LoRA Adapter Multiplexing

Instead of serving N separate fine-tuned models, serve one base model with N LoRA adapters that swap at request time. LoRA adapters are typically less than 1% of base model size — a set of rank-16 adapters for Llama 70B adds roughly 100–300 MB each.

S-LoRA demonstrated serving thousands of concurrent LoRA adapters on a single GPU by dynamically swapping adapter weights between CPU and GPU memory. The base model stays resident in GPU memory while adapters load on demand, with a unified paging system that handles both KV cache and adapter weights without fragmentation.

For teams fine-tuning per customer, per domain, or per task, this is transformative: instead of N × (model size) memory, you need 1 × (model size) + N × (adapter size). A hundred customer-specific models that would require 100 GPU instances become one instance plus 30 GB of adapter storage.

Pattern 3: Prefill-Decode Disaggregation

The prefill phase (processing input context) is compute-bound. The decode phase (generating tokens) is memory-bound. Running both on the same GPU wastes whichever resource the current phase isn't using.

NVIDIA's Dynamo framework and similar systems disaggregate these phases onto separate GPU pools. Prefill GPUs can be packed with compute-heavy workloads, while decode GPUs are optimized for memory bandwidth and KV cache capacity. This architectural split typically improves overall GPU utilization by 40–60% compared to unified serving.

The Capacity Planning Worksheet

Before provisioning GPUs, calculate these four numbers:

1. Weight memory. Parameters × bytes per parameter at your chosen precision. Apply quantization savings.

2. KV cache budget. Use the formula (2 × layers × KV heads × head dim × bytes) × (p95 context length) × (target concurrent requests). This is your dominant cost at scale.

3. Overhead. Add 20% for activation memory, framework buffers, and CUDA context. For vLLM specifically, the gpu-memory-utilization parameter (default 0.9) controls how much of available VRAM the engine will use — the remaining 10% is reserved for this overhead.

4. Headroom. Reserve 10–15% for traffic spikes. If your p99 concurrency is 2x your median, you need buffer for burst absorption without OOM kills.

Example: Llama 3.1 70B serving a chatbot at 8K average context, 32 concurrent users, INT4 quantization.

  • Weights: 70B × 0.5 bytes = 35 GB
  • KV cache: 0.16 MB/token × 8K tokens × 32 users = ~41 GB
  • Overhead: (35 + 41) × 0.2 = ~15 GB
  • Headroom: (35 + 41 + 15) × 0.15 = ~14 GB
  • Total: ~105 GB → 2× A100-80GB (with tensor parallelism)

Without quantization at FP16: weights alone are 140 GB, KV cache at BF16 is ~80 GB, total exceeds 260 GB — requiring 4× A100-80GB. Quantization cut the GPU count in half.

Without the calculation: most teams would provision 4× A100-80GB "to be safe" for a 70B model, then discover they only use 60% of available memory under normal load.

The Mistakes That Cost 3x

The three most common over-provisioning patterns:

Sizing for max context instead of actual distribution. If your model supports 128K but 95% of requests use under 4K tokens, sizing KV cache for 128K wastes 32x the memory per request slot. Profile your traffic before provisioning.

Ignoring quantization because "quality might suffer." INT4 quantization with AWQ preserves 92–95% of model quality across benchmarks while cutting memory by 75%. The quality delta is smaller than the variance between different prompt phrasings of the same question. Run your own eval on your specific task — the numbers almost always justify quantization.

Running one model per GPU when traffic doesn't demand it. A 7B model at INT4 uses 4 GB on an 80 GB card. That's 95% waste. Co-locate smaller models, use LoRA multiplexing for fine-tuned variants, or share the GPU across model sizes with traffic-based routing.

GPU memory is the most expensive resource in your AI infrastructure. The difference between "provision and pray" and "calculate and pack" is typically a 2–3x reduction in hardware cost. The math takes an afternoon. The savings compound every month.

References:Let's stay in touch and Follow me for more thoughts and updates