Skip to main content

Quantization Slippage: The Capability Tax Your Eval Set Was Never Built to Catch

· 11 min read
Tian Pan
Software Engineer

A self-hosted LLM team quantizes the production model from fp16 to int4. Memory drops 4×, throughput nearly doubles, the GPU bill shrinks, and the team reruns the same eval suite that gated the fp16 release. MMLU-Pro retains 98.1% of baseline. Aggregate quality looks fine. They ship.

Six weeks later, a support engineer notices the math tutoring feature has gotten quietly worse. The compliance team flags an uptick in policy-violation completions on adversarial prompts. The structured-output retry rate has crept from 1.4% to 6.8%. None of these show up on the eval dashboard, because the eval dashboard was built to validate a different model — the one that shared the same weights file but had four times more bits behind every activation.

This is quantization slippage. The cost analysis priced the memory win and the latency win. It did not price the eval re-anchoring that the swap silently demanded, and the eval suite, calibrated against the fp16 distribution, is now grading the wrong model with the wrong rubric.

A quantized model is a different model that happens to share weights

The mental model most teams carry into a quantization rollout is "same model, less precision." The benchmark numbers reinforce it. INT8 typically shows a 0.04% drop from BF16 on aggregate benchmarks — basically noise. Even aggressive int4 retains around 98% of baseline reasoning capability on MMLU-Pro under good methods like AWQ. The procurement deck rounds that to "free win" and moves on.

The procurement deck is wrong about what changed. Lowering the bit-width does not lower precision uniformly across the model's behavior — it lowers precision on the activations that contribute least to the rounding error of the calibration distribution, which is not the same as the activations that contribute most to your production traffic. A few specific things shift:

  • The output distribution moves. Token probabilities at the tail flatten or sharpen unevenly. The model picks slightly different tokens on prompts where the top-1 and top-2 were close. Aggregate scores barely move because most prompts aren't close calls. Tail prompts, where they matter, are exactly where the model now disagrees with its previous self.
  • Attention patterns flatten under aggressive quantization. This is well-documented in adversarial robustness work — int4 GPTQ checkpoints degrade by more than 10% on prompt-injection resistance relative to the fp16 baseline, while AWQ at the same bit-width sometimes improves over the dense baseline. Same model family, same prompt, completely different injection-resistance posture.
  • Refusal patterns drift. The int4 model is not uniformly more or less likely to refuse — it's differently likely on different prompt shapes. Some categories tighten, others loosen. If the team never measured a "calibrated refusal rate" as an SLI, the drift is invisible until a customer or auditor finds it.
  • Format compliance degrades non-uniformly. Across one published benchmark of 4-bit quantized models on JSON output, response-format success rates ranged from 0% to 100% depending on schema complexity. The model that hit 99% format compliance at fp16 may hit 87% at int4 on composite objects and 99% on flat schemas — and your eval set, if it has three flat-schema fixtures, will report no regression.

The thing to internalize: weight-sharing is a deployment optimization, not a semantic guarantee. The quantized checkpoint is a function with the same parameter count and a different mapping. Treat it that way.

The eval set was calibrated to the old distribution

The eval suite the team uses to greenlight the quantized rollout was not built from a blank slate. It accumulated over months: human annotators graded reference outputs against the fp16 model's stylistic tells, LLM-as-judge graders had their rubric prompts iterated until they correlated with the human scores on fp16 outputs, snapshot fixtures captured fp16's exact phrasing for golden-set comparison, and the team's intuition for "this output looks good" was trained on what fp16 produced.

Every one of those anchors moves silently when the model swaps. Concretely:

LLM-judge calibration drift. The judge was tuned to grade fp16 outputs. Run the same judge on int4 outputs and it now grades against a distribution it wasn't calibrated for. If the int4 model is slightly more verbose, the judge — which had implicitly internalized fp16's length norm — may grade it more harshly on concision and more leniently on completeness. The net score moves, and the team has no way to tell whether the model got worse, better, or just different.

Snapshot tests break for the wrong reason. A snapshot suite that asserts the model's exact wording is now noise. Snapshots fail not because the model regressed but because it took a different but equivalent path. The team disables the snapshots, loses regression coverage, and doesn't notice when a real regression slips through later.

Human-annotated gold labels expire. The "this prompt should produce this answer" labels in the eval set were chosen against fp16's behavior. Some prompts where fp16 confidently produced answer A now have int4 confidently producing answer B, which is also correct — and the gold label scores it wrong. Other prompts where fp16 produced A but int4 produces A' (a worse variant) score correct because A' contains the gold-label keywords. The net effect is noise in both directions, and the team trusts the score anyway.

Downstream classifiers shift their footing. A safety classifier the team trained on fp16 completions is now grading int4 completions. The classifier's input distribution moved. Its precision and recall on the new distribution are unknown until somebody measures, and the team usually doesn't measure because the classifier "wasn't touched."

The eval set is not a fixed asset. It's a configuration coupled to the model that produced it. Swapping the model is a configuration change.

The slices the cost analysis forgot to price

The aggregate benchmark score is the headline number. It's also the wrong number for capability tracking, because capability loss under quantization is non-uniform across slices. The work in the past two years has been consistent on this point: GSM8K and other math tasks degrade earlier than the aggregate suggests because numeric consistency is precision-sensitive in a way semantic tasks aren't. Long-context prompts at 8k+ degrade on some int4 KV-cache configurations while short prompts stay clean. Code generation holds up on common patterns and fails on rare APIs. Low-resource languages degrade faster than English.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates