The Quantization Quality Cliff: When int4 Passes the Median Eval and Fails on the Long Tail
A team swaps an fp16 model for an int4 quantization to halve serving cost. The eval suite scores within a point of the original on the curated test set. The rollout ships under the rationale "indistinguishable on the benchmark." Six weeks later, support is fielding catastrophic-failure quotes from regulated customers — code that compiles to nonsense, low-resource-language responses that drift into another script, multi-hop arithmetic that confidently returns numbers off by an order of magnitude. The benchmark didn't lie. It just measured the median, and quantization is not a uniform tax on the median. It is a non-uniform tax on the tail.
This is the quantization quality cliff: the moment your eval suite, your rollout discipline, and your cost-savings narrative all simultaneously fail because the metric you used to approve the swap had no signal on the capabilities you destroyed. Recent benchmarks make the magnitude concrete. On long-context tasks, 8-bit quantization preserves accuracy with roughly a 0.8% drop, while 4-bit methods lose up to 59% on the same workload — a regression invisible to any test set that doesn't oversample tail inputs. Median moved one point. Tail moved fifteen, or thirty, or fifty.
