Skip to main content

Self-Hosted LLMs in Production: The GPU Memory Math Nobody Tells You

· 10 min read
Tian Pan
Software Engineer

Most engineers who decide to self-host an LLM start with the same calculation: the model is 70B parameters, FP16 is 2 bytes per parameter, so that's 140 GB. They check that two A100-80GB GPUs fit 160 GB, feel satisfied, and order the hardware. Then they hit production and discover they've already run out of memory before serving a single real user.

The model weights are only part of the story. The piece that surprises almost every team is the KV cache — and understanding it changes every decision you make, from quantization choice to serving framework to how many GPUs you actually need.

What's Actually in GPU Memory at Inference Time

Total VRAM during inference has four components:

Model weights + KV cache + activations + framework overhead

Model weights are the familiar part. At FP16, a 70B parameter model occupies 140 GB. Quantize to INT4 and that drops to 35 GB. INT8 lands at 70 GB. FP8, which has become the production sweet spot on modern data center GPUs, sits at 70 GB with near-lossless accuracy.

The KV cache is where estimates go wrong. Every autoregressive decoding step needs access to the key and value vectors for all previous tokens in every active sequence. The formula:

KV Cache = 2 × layers × KV heads × head dimension × sequence length × batch size × bytes per element

For Llama 3.1 70B (80 layers, 8 KV heads, 128 head dimension) at BF16, each token in an active sequence costs about 0.31 MB. That sounds small. But multiply by context length and concurrent requests, and it explodes:

  • A single request with 128K context consumes ~40 GB of KV cache alone
  • Four concurrent requests at that context length require 160 GB — just for the cache, before the model weights

At more typical 8K context with a batch of 8 concurrent requests, the KV cache for a 70B FP16 model still consumes around 80 GB. That's more than the weights themselves on most consumer-grade hardware configs.

The naive sizing mistake is to plan memory for the model and leave the rest as headroom. The right framing is the opposite: plan for the KV cache first, then figure out where the model fits.

Quantization: The Real Tradeoffs Between INT4, FP8, and FP16

Most teams treat quantization as a binary choice — "do we compress or not?" — when it's actually three distinct operating regimes with different failure modes.

FP16/BF16 is the baseline. Full weights, no approximation error, maximum compatibility. It's the right choice when accuracy is non-negotiable — medical inference, legal document analysis, financial modeling — or when you're not memory-constrained and don't want to chase quantization bugs.

FP8 has become the production sweet spot for teams running modern data center GPUs. The H100, H200, and B200 have native FP8 tensor cores that deliver roughly 2x throughput over FP16 at identical memory layout, with near-lossless accuracy. Meta's Llama 3.3-70B with FP8 quantization shows 99%+ quality recovery against the FP16 baseline, with 30% lower latency and 50% higher throughput. If you're running on H100s or newer, defaulting to FP8 for inference is almost always correct.

INT4 is the format that gets the most attention because it makes large models fit on consumer hardware — a 70B model compresses to ~35 GB, fitting on a single A100-80GB with room for the KV cache. Modern quantization methods like GPTQ and AWQ recover most of the quality lost by 4-bit compression, often preserving 95%+ of benchmark performance. The tradeoff is that accuracy degradation is not uniform. Models tend to lose precision on edge cases, complex multi-step reasoning, and rare tokens first. If your application's failure cases are concentrated in exactly those areas, INT4 will quietly degrade quality in ways that are hard to catch in aggregate evals.

A pattern gaining traction in 2026 is format hybridization: FP8 for attention layers (where numerical sensitivity is higher) and INT4 for the MLP blocks (where compression is more forgiving). This lets you fit a 70B model on fewer GPUs while protecting the parts of the architecture most sensitive to rounding errors.

One more thing about quantization that teams discover late: the KV cache has its own quantization. Running FP8-KV cache quantization on a 70B model can match the latency of an FP16 model running with twice the tensor parallelism — meaning you can use half the GPUs at equivalent speed. This optimization is available in vLLM and significantly changes the cost math for long-context workloads.

Choosing Between vLLM, TGI, and llama.cpp

The wrong way to pick an inference framework is to benchmark single-user throughput and call it done. The serving characteristics diverge dramatically under concurrency, and the framework you'd choose for a personal assistant is different from the one you'd choose for a customer-facing product.

vLLM is the default choice for multi-user production. Its PagedAttention memory management, borrowed from OS virtual memory concepts, reduces KV cache waste from 60–80% (with naive pre-allocation) to under 4% by managing memory in fixed-size blocks. That improvement alone can double or triple concurrent request capacity on fixed hardware. Benchmark numbers: at 100 concurrent requests for a 7B model, vLLM achieves around 15,000 tokens per second versus TGI's 4,100 — a 3.7x advantage that widens to 24x under extreme load. Against llama.cpp, vLLM delivers over 35x the request throughput at peak. The time-to-first-token stays nearly flat from 1 to 64 concurrent users because the scheduler handles queue pressure efficiently.

The tradeoff: vLLM's inter-token latency increases slightly under high concurrency because large batches take longer to compute each step for any individual request. For interactive applications where perceived responsiveness matters as much as throughput, this is worth measuring rather than assuming.

TGI (Text Generation Inference) entered maintenance mode in December 2025. Hugging Face now recommends vLLM or SGLang for new production deployments. Teams already running TGI should plan a migration path, particularly as model architecture changes start breaking compatibility without upstream fixes.

llama.cpp is the right tool when portability matters more than throughput. It runs on any hardware — NVIDIA, AMD, Intel Arc, Apple Silicon, CPU-only — with zero framework dependencies and fast startup time. A single-user interactive application on an RTX 3090 running an 8B model at Q4 quantization gets roughly 120 tokens per second, which is more than fast enough. The problem appears the moment you add concurrent users. llama.cpp uses simple sequential queuing: requests wait in line before being processed. Response times grow exponentially with concurrency, making it structurally unsuitable for applications serving multiple simultaneous users. It also offers no PagedAttention equivalent, so KV cache waste is high under mixed-length workloads.

SGLang is worth mentioning as the framework gaining momentum for structured generation and multi-modal workloads. If your use case involves constrained decoding, function calling, or JSON output at high throughput, SGLang's RadixAttention (which caches KV across requests sharing common prefixes) can significantly outperform vLLM on those specific patterns.

The decision matrix in practice:

  • Single user or developer workstation: llama.cpp
  • Production multi-user serving on NVIDIA data center GPUs: vLLM
  • High-volume structured generation or prefix-heavy workloads: SGLang
  • Existing TGI deployment: migrate to vLLM or SGLang

The Real Break-Even Math

The case for self-hosting is usually framed as "cloud APIs are expensive per token." The math is true but incomplete. A realistic 70B production setup costs around $20,000 per month: two redundant GPU nodes, monitoring infrastructure, networking, and 20% of an ML infrastructure engineer's time. That last line item is the one teams consistently forget to include.

Cloud providers increasingly charge compliance premiums. Data residency guarantees and zero-retention agreements add 20–40% to enterprise API contracts. Some providers charge a multiplier just for US-only processing. For teams in regulated industries — healthcare (HIPAA), finance (SOX/PCI-DSS), government (FedRAMP) — the compliance cost of cloud APIs can be significant enough to shift the break-even point lower.

That said, LLM API pricing dropped roughly 80% from 2025 to 2026. The per-token costs that made self-hosting attractive at 2024 volumes no longer apply at current rates. The break-even thresholds look like:

  • Under $20k/month in API spend: Stay with managed services. Self-hosting overhead exceeds savings.
  • $20k–$80k/month: Optimize routing first. Sending high-volume, low-stakes requests to smaller models often achieves the same savings without infrastructure burden.
  • Over $80k/month: Run a detailed hybrid analysis with your actual traffic distribution. Self-hosting the high-volume workloads while keeping frontier model access for the tail cases often produces the best economics.
  • Data residency requirements: Self-host those specific workloads regardless of volume.

The model quality gap is the hidden consideration in this math. If even 20% of your use cases need frontier-level reasoning — the kind you can only get from the largest proprietary models — you end up maintaining a hybrid infrastructure anyway. The cost savings from self-hosting need to be calculated against the burden of running two inference stacks, not one.

Practical Sizing Workflow

When sizing a self-hosted LLM deployment, work through this sequence:

First, determine your model weights memory. Use the precision appropriate for your quality requirements — FP8 if you're on modern data center GPUs, INT4 if you're memory-constrained and your use case tolerates the tradeoff, FP16 if accuracy is non-negotiable.

Second, estimate your KV cache requirements. Take your expected context length, multiply by the per-token KV cost for your model at your chosen precision, then multiply by your peak concurrent request count. Add 30–50% headroom. This number usually surprises people.

Third, add framework overhead (typically 2–4 GB for CUDA drivers and PyTorch kernels, plus the vLLM safety buffer of 5–20% of remaining GPU memory).

Fourth, compare total VRAM requirements against available GPU configurations. If the numbers don't fit on a single GPU, evaluate tensor parallelism overhead before assuming you need to double hardware.

Finally, validate under realistic load. Single-request benchmarks are misleading. The metrics that matter in production are time-to-first-token and inter-token latency at your actual P99 concurrency, not synthetic peak throughput numbers.

What Most Teams Get Wrong

The most common mistake is sizing for the model, not for the workload. A team that deploys a 70B model on two A100-80GB GPUs with comfortable weight headroom will hit OOM errors the first time a user opens a long document, because nobody calculated what 50 concurrent requests at 4K context each does to KV cache consumption.

The second mistake is treating the serving framework as an implementation detail to figure out later. PagedAttention vs. naive KV cache pre-allocation isn't a minor optimization — it's the difference between serving 8 concurrent users and serving 24 on identical hardware. That decision belongs in the architecture phase.

The third mistake is running INT4 quantization across all layers and assuming quality degradation is uniform. It isn't. The attention mechanisms are more sensitive than the FFN weights. Teams who switch to hybrid quantization schemes — FP8 or INT8 for attention, INT4 for the rest — often recover most of the quality loss from full INT4 while keeping the memory advantages.

Self-hosting an LLM is an infrastructure discipline, not just a model deployment. The teams that get it right treat the KV cache as a first-class resource to be managed, choose serving frameworks based on concurrency profiles rather than single-user benchmarks, and keep honest books on the engineering overhead that goes into running reliable inference at scale.

References:Let's stay in touch and Follow me for more thoughts and updates