Skip to main content

3 posts tagged with "quantization"

View all tags

Edge LLM Inference: When Latency, Privacy, or Cost Force You Off the Cloud

· 9 min read
Tian Pan
Software Engineer

A fine-tuned 7B parameter model running on a single RTX 4090 can outperform GPT-4 on domain-specific tasks while costing you nothing per token after the initial hardware investment. That is not a theoretical claim — Diabetica-7B, a diabetes-focused model, hit 87.2% accuracy on clinical queries, beating both GPT-4 and Claude 3.5 on the same benchmark. The catch? Getting there requires understanding exactly when edge inference makes sense and when it is an expensive distraction.

Most teams default to cloud APIs because they are easy — make an HTTP call, get tokens back. But that simplicity has costs that scale in ways engineers do not anticipate until it is too late, and those costs are not always measured in dollars.

GPU Memory Math for Multi-Model Serving: Why Most Teams Over-Provision by 3x

· 9 min read
Tian Pan
Software Engineer

Most teams running LLM inference treat GPU provisioning like a guessing game. They see a model needs "140 GB at FP16," panic, requisition four A100-80GB cards, and call it done. What they don't calculate is how KV cache, concurrency, and quantization interact to determine the actual memory footprint — and that miscalculation typically means they're paying 3x more than necessary.

The math isn't complicated. But almost nobody does it before signing the cloud contract. This article walks through the exact formulas, shows where the hidden memory sinks live, and explains the bin-packing strategies that let you serve four models on hardware budgeted for one.

Self-Hosted LLMs in Production: The GPU Memory Math Nobody Tells You

· 10 min read
Tian Pan
Software Engineer

Most engineers who decide to self-host an LLM start with the same calculation: the model is 70B parameters, FP16 is 2 bytes per parameter, so that's 140 GB. They check that two A100-80GB GPUs fit 160 GB, feel satisfied, and order the hardware. Then they hit production and discover they've already run out of memory before serving a single real user.

The model weights are only part of the story. The piece that surprises almost every team is the KV cache — and understanding it changes every decision you make, from quantization choice to serving framework to how many GPUs you actually need.