Silent Quantization: Why the Model You Pay For Today Isn't the Model You Paid For Last Quarter
The model name on your invoice is the same as it was last quarter. The version string in the API response hasn't changed. The model card and pricing page read identically. And yet your eval scores have drifted half a point downward, your refusal patterns shifted in ways your prompts didn't ask for, and a handful of customer complaints came in last Tuesday about output that "feels different." You debug your code. You don't find anything. The code didn't change. The weights did.
Silent quantization is the gap between the model you contracted for and the model the provider is actually serving. It happens because inference economics keep tightening — every dollar of GPU capacity has to feed more requests this quarter than last — and the cheapest way to absorb that pressure is to re-host the same model name on cheaper precision tiers. FP16 becomes FP8. FP8 becomes FP4 in some routes. Mixed-precision shards get swapped in. The version string doesn't move because the version string was never a precision contract; it was a marketing contract.
This isn't a conspiracy theory. OpenRouter has publicly disclosed that across billions of requests they routinely observed the same model with the same nominal quantization producing measurably different outputs depending on which provider served it — different enough that they shipped a separate "Exacto" tier for buyers who specifically wanted FP16 and were willing to pay for the guarantee. The Roo-Code issue tracker has a feature request to filter out quantized routes because aggressive quantization broke multi-byte CJK character decoding for users whose prompts hadn't changed. OpenAI's own developer forum has years of threads where users describe a "quality drop" from one week to the next without any version bump on the provider's side. Anthropic published a postmortem in April 2026 acknowledging that three concurrent changes — none of them advertised as model changes — measurably degraded Claude Code outputs for weeks before the company traced and reverted them. None of these are quantization stories per se. All of them are stories about the model you're paying for not being a stable artifact.
The Precision Tier Nobody Negotiated
The standard enterprise contract for an LLM vendor is a few pages of legal language, a usage-based pricing schedule, and a Service Level Agreement that talks almost exclusively about uptime, latency percentiles, and data handling. None of it specifies serving precision. None of it gives you a contractual handle on the question "is the model behind this endpoint FP16 or FP8 today?" Your contract says "GPT-class." It does not say "GPT at FP16."
This is the gap. The vendor's pricing page advertises a model name. The vendor's terms of service give them latitude to operate that endpoint efficiently. "Efficient operation" reasonably includes choosing a serving precision that meets internal quality bars. The bar the vendor uses is an aggregate one — typically MMLU-Pro, HumanEval+, a refusal-rate gate, and a basic safety panel. The recent benchmark data tells us why this works for the vendor: across 2026's open-weight 70B-class models, FP8 lands within 0.4 points of FP16 on MMLU-Pro, and calibrated INT8 lands within 0.7 points. From the vendor's perspective, that's a free 1.5x to 3x throughput lift on H100 hardware for less than a percentage point of accuracy. The decision is obvious.
It's obvious from your perspective too — until you remember that "less than a percentage point on MMLU-Pro" is an average across thousands of tasks. Your task is not the average. Quantization error is heavily task-dependent. World-knowledge retrieval (the kind MMLU stresses) is more susceptible than common-sense reasoning. Long-context attention is more sensitive than short-context. Code generation has tail behaviors that hide in averages. Multilingual prompts — especially CJK — can break catastrophically at FP4 because the tokenizer's edge cases interact badly with reduced numerical headroom. The vendor's macro number is fine. Your micro number isn't, and you have no contractual surface to discuss it.
How the Drift Actually Lands
The thing to understand about silent quantization is that it almost never lands as a single event. There is no announcement. There is no version bump. The new precision tier rolls out region by region, request by request, often as part of a routing layer rather than a model swap. A request that lands on cluster A in us-east-1 might still hit FP16 weights, and a request that lands on cluster B in us-east-1 might hit FP8. A multi-region deployment might see FP16 in EU and FP8 in US for the same model name. A retry on the same prompt might get a different precision than the original attempt.
This means three things for the team operating on top of the API. First, your aggregate metrics will see a slow drift, not a cliff. The kind of quality regression that snowballs into customer escalations over weeks rather than the kind that triggers an incident in an hour. Second, your reproducibility is gone in a way that was not gone before. The same prompt, same temperature 0, same seed where the API supports it, will not necessarily produce the same output across requests because it didn't necessarily hit the same weights. Third, the ground under your evals is moving while you stand on it. An eval suite scored last quarter with 92.4% accuracy is not lying when it scores 91.7% today; the suite is honest, the model is different, and the two measurements are not measurements of the same system.
Anthropic's April 2026 postmortem is instructive even though it's not about quantization. Three separate changes — a default-reasoning-effort change, a thinking-block-clearing bug, and a verbosity-trimming system prompt — degraded Claude Code outputs simultaneously, none of them visible to API consumers as a version change, and all three required user complaints to surface. The mechanism was different from quantization, but the user experience was identical: the model you bought is not behaving like the model you bought, you can't tell why, and the version string is the same. Treat that postmortem as a template for thinking about quantization-class drift, because the surface area is the same: a vendor making operational decisions inside the boundary you can see, and you finding out only by watching the outputs.
The Fingerprinting Probes
If the version string can't be trusted, you need something else. The thing you need is a fingerprinting probe set: a small, fixed corpus of prompts whose outputs are stable enough at FP16 that any meaningful precision shift will move them. The literature on black-box LLM fingerprinting has gotten useful enough in the last twelve months to actually use in production. ZeroPrint and similar zeroth-order methods can approximate gradient signal from input-output queries alone. Watermarking research has shown that domain-specific probes can achieve detection rates that survive quantization, pruning, and sampling variation — which is to say, they can also detect quantization, pruning, and sampling variation, because what makes a watermark robust is the same property that makes a probe sensitive.
The practical version of this for an engineering team isn't a research paper. It's a nightly job. Pick fifty to two hundred prompts that you understand intimately — long-tail prompts, prompts that exercise long-context attention, prompts that produce numerically-sensitive outputs (think structured JSON with floating-point fields, code that depends on a specific token at a specific position, multilingual prompts with diacritics or CJK), and a few prompts that historically sit on the edge of refusal behavior. Run them at temperature 0 against your provider every night. Hash the outputs. Diff against last night's hashes. When the diff rate jumps from its baseline, something moved on the vendor's side that the version string didn't tell you about.
This is not fancy. It is also far more than most teams have. The reason most teams don't have it is that "the version string didn't change" feels like sufficient assurance. It isn't. The version string is a label. The probe set is a measurement.
A few practical knobs to get the probe set actually useful. Keep it small enough to run cheaply every night — under five hundred dollars per run, otherwise the budget conversation kills it. Keep at least ten prompts per failure mode you care about, because single-prompt drift is noise; pattern drift across a category is signal. Pin the prompt set in version control with a hash, because if your prompts drift you'll mistake your own changes for vendor drift. And keep the prompt set boring on purpose: do not put your most clever production prompts in there, because clever prompts are also the most fragile to model changes you'd actually want to catch later as features rather than regressions.
Routing When You Can't Trust the Name
Once you accept that "model X" doesn't mean the same thing in two regions, the inference-routing strategy changes. You stop treating the model name as a unique identifier. You start treating the (provider, region, precision-tier) tuple as the unique identifier, even when the provider doesn't surface that tuple to you.
What this looks like in practice: a routing layer that knows which (provider, region) combinations have shown drift in the last seven days, and biases new requests toward combinations that haven't. A budget for "preferred precision" requests where you pay a premium to providers like OpenRouter's Exacto tier — or to the upstream provider directly with a contractual carve-out — for the routes you can't tolerate quantization on. A fallback rule that catches drift via the probe set and rolls a percentage of traffic to a secondary vendor when drift exceeds a threshold, the same way a circuit breaker rolls traffic away from a failing service. A pinning policy for the most quality-sensitive surfaces of your product, where the cost premium for guaranteed precision is justified by the user-facing impact, and an explicit acceptance of drift on the surfaces where the premium isn't worth it.
The hard part of this is not the technology. The hard part is admitting that "model X is model X" is no longer a load-bearing assumption you get to make, and reorganizing your inference layer around the fact that the same name can mean different things at the same moment. Teams that built their architecture before this was true tend to have a single global setting that says "use GPT-class for this feature" and then route freely under it. Teams that build their architecture after this becomes true tend to have a feature-by-feature precision policy that lives in code and is reviewed when the policy changes.
What the Discipline Looks Like a Year From Now
The teams that are out ahead on this look like teams that already built a metric-quality discipline for any other vendor. They have a probe suite that runs nightly. They have an internal dashboard that shows drift over the last 90 days for every (provider, model, region) tuple they touch. They have a contract clause — at least with their tier-1 vendor — that specifies serving precision, with a notification window when it changes. They have a routing layer that treats the precision tier as a first-class attribute of a request, not a hidden vendor decision. And they have the cultural muscle to take the eval-drift signal seriously instead of dismissing it as noise, because they've already lived through one incident where they didn't.
The teams that aren't will, for a while longer, ship products on top of an inference substrate they don't actually have a measurement for. Most of the time they'll be fine, because most of the time the average quality on the average task is approximately the same. The trouble shows up at the tail — the customer with the multilingual workflow, the user whose long-context summary changed shape, the agent that started failing a tool call it succeeded at last quarter — and the trouble is hard to diagnose precisely because the version string never moved.
The takeaway worth holding onto is this: the model name on your invoice is a marketing artifact, not a technical contract. The technical contract — if you want one — is the probe set you build, the precision clause you negotiate, and the routing layer you operate. Until those three things exist for your product, you are not buying a model from a vendor. You are buying a probability distribution over models that happens to share a name. Most quarters that's fine. The quarters it isn't, you'll wish you'd been measuring.
- https://openrouter.ai/announcements/provider-variance-introducing-exacto
- https://www.anthropic.com/engineering/april-23-postmortem
- https://venturebeat.com/technology/mystery-solved-anthropic-reveals-changes-to-claudes-harnesses-and-operating-instructions-likely-caused-degradation
- https://community.openai.com/t/could-gpt-4-turbo-be-a-quantized-version/504188
- https://research.aimultiple.com/llm-quantization/
- https://www.digitalapplied.com/blog/quantization-tradeoffs-4bit-8bit-fp8-performance-data
- https://docs.gpustack.ai/2.0/performance-lab/references/the-impact-of-quantization-on-vllm-inference-performance/
- https://www.promptinjection.net/p/ai-llm-when-local-q2-outperforms-cloud-q8
- https://github.com/RooCodeInc/Roo-Code/issues/11325
- https://arxiv.org/pdf/2510.06605
- https://arxiv.org/html/2505.16723v2
- https://venturebeat.com/infrastructure/monitoring-llm-behavior-drift-retries-and-refusal-patterns
