Skip to main content

Self-Hosted LLMs in Production: The GPU Memory Math Nobody Tells You

· 10 min read
Tian Pan
Software Engineer

Most engineers who decide to self-host an LLM start with the same calculation: the model is 70B parameters, FP16 is 2 bytes per parameter, so that's 140 GB. They check that two A100-80GB GPUs fit 160 GB, feel satisfied, and order the hardware. Then they hit production and discover they've already run out of memory before serving a single real user.

The model weights are only part of the story. The piece that surprises almost every team is the KV cache — and understanding it changes every decision you make, from quantization choice to serving framework to how many GPUs you actually need.

What's Actually in GPU Memory at Inference Time

Total VRAM during inference has four components:

Model weights + KV cache + activations + framework overhead

Model weights are the familiar part. At FP16, a 70B parameter model occupies 140 GB. Quantize to INT4 and that drops to 35 GB. INT8 lands at 70 GB. FP8, which has become the production sweet spot on modern data center GPUs, sits at 70 GB with near-lossless accuracy.

The KV cache is where estimates go wrong. Every autoregressive decoding step needs access to the key and value vectors for all previous tokens in every active sequence. The formula:

KV Cache = 2 × layers × KV heads × head dimension × sequence length × batch size × bytes per element

For Llama 3.1 70B (80 layers, 8 KV heads, 128 head dimension) at BF16, each token in an active sequence costs about 0.31 MB. That sounds small. But multiply by context length and concurrent requests, and it explodes:

  • A single request with 128K context consumes ~40 GB of KV cache alone
  • Four concurrent requests at that context length require 160 GB — just for the cache, before the model weights

At more typical 8K context with a batch of 8 concurrent requests, the KV cache for a 70B FP16 model still consumes around 80 GB. That's more than the weights themselves on most consumer-grade hardware configs.

The naive sizing mistake is to plan memory for the model and leave the rest as headroom. The right framing is the opposite: plan for the KV cache first, then figure out where the model fits.

Quantization: The Real Tradeoffs Between INT4, FP8, and FP16

Most teams treat quantization as a binary choice — "do we compress or not?" — when it's actually three distinct operating regimes with different failure modes.

FP16/BF16 is the baseline. Full weights, no approximation error, maximum compatibility. It's the right choice when accuracy is non-negotiable — medical inference, legal document analysis, financial modeling — or when you're not memory-constrained and don't want to chase quantization bugs.

FP8 has become the production sweet spot for teams running modern data center GPUs. The H100, H200, and B200 have native FP8 tensor cores that deliver roughly 2x throughput over FP16 at identical memory layout, with near-lossless accuracy. Meta's Llama 3.3-70B with FP8 quantization shows 99%+ quality recovery against the FP16 baseline, with 30% lower latency and 50% higher throughput. If you're running on H100s or newer, defaulting to FP8 for inference is almost always correct.

INT4 is the format that gets the most attention because it makes large models fit on consumer hardware — a 70B model compresses to ~35 GB, fitting on a single A100-80GB with room for the KV cache. Modern quantization methods like GPTQ and AWQ recover most of the quality lost by 4-bit compression, often preserving 95%+ of benchmark performance. The tradeoff is that accuracy degradation is not uniform. Models tend to lose precision on edge cases, complex multi-step reasoning, and rare tokens first. If your application's failure cases are concentrated in exactly those areas, INT4 will quietly degrade quality in ways that are hard to catch in aggregate evals.

A pattern gaining traction in 2026 is format hybridization: FP8 for attention layers (where numerical sensitivity is higher) and INT4 for the MLP blocks (where compression is more forgiving). This lets you fit a 70B model on fewer GPUs while protecting the parts of the architecture most sensitive to rounding errors.

One more thing about quantization that teams discover late: the KV cache has its own quantization. Running FP8-KV cache quantization on a 70B model can match the latency of an FP16 model running with twice the tensor parallelism — meaning you can use half the GPUs at equivalent speed. This optimization is available in vLLM and significantly changes the cost math for long-context workloads.

Choosing Between vLLM, TGI, and llama.cpp

The wrong way to pick an inference framework is to benchmark single-user throughput and call it done. The serving characteristics diverge dramatically under concurrency, and the framework you'd choose for a personal assistant is different from the one you'd choose for a customer-facing product.

vLLM is the default choice for multi-user production. Its PagedAttention memory management, borrowed from OS virtual memory concepts, reduces KV cache waste from 60–80% (with naive pre-allocation) to under 4% by managing memory in fixed-size blocks. That improvement alone can double or triple concurrent request capacity on fixed hardware. Benchmark numbers: at 100 concurrent requests for a 7B model, vLLM achieves around 15,000 tokens per second versus TGI's 4,100 — a 3.7x advantage that widens to 24x under extreme load. Against llama.cpp, vLLM delivers over 35x the request throughput at peak. The time-to-first-token stays nearly flat from 1 to 64 concurrent users because the scheduler handles queue pressure efficiently.

The tradeoff: vLLM's inter-token latency increases slightly under high concurrency because large batches take longer to compute each step for any individual request. For interactive applications where perceived responsiveness matters as much as throughput, this is worth measuring rather than assuming.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates