Self-Hosted LLMs in Production: The GPU Memory Math Nobody Tells You
Most engineers who decide to self-host an LLM start with the same calculation: the model is 70B parameters, FP16 is 2 bytes per parameter, so that's 140 GB. They check that two A100-80GB GPUs fit 160 GB, feel satisfied, and order the hardware. Then they hit production and discover they've already run out of memory before serving a single real user.
The model weights are only part of the story. The piece that surprises almost every team is the KV cache — and understanding it changes every decision you make, from quantization choice to serving framework to how many GPUs you actually need.
