6 posts tagged with "model-evaluation"

The Quantization Quality Cliff: When int4 Passes the Median Eval and Fails on the Long Tail

May 10, 2026 · 11 min read

Software Engineer

A team swaps an fp16 model for an int4 quantization to halve serving cost. The eval suite scores within a point of the original on the curated test set. The rollout ships under the rationale "indistinguishable on the benchmark." Six weeks later, support is fielding catastrophic-failure quotes from regulated customers — code that compiles to nonsense, low-resource-language responses that drift into another script, multi-hop arithmetic that confidently returns numbers off by an order of magnitude. The benchmark didn't lie. It just measured the median, and quantization is not a uniform tax on the median. It is a non-uniform tax on the tail.

This is the quantization quality cliff: the moment your eval suite, your rollout discipline, and your cost-savings narrative all simultaneously fail because the metric you used to approve the swap had no signal on the capabilities you destroyed. Recent benchmarks make the magnitude concrete. On long-context tasks, 8-bit quantization preserves accuracy with roughly a 0.8% drop, while 4-bit methods lose up to 59% on the same workload — a regression invisible to any test set that doesn't oversample tail inputs. Median moved one point. Tail moved fifteen, or thirty, or fifty.

The Ghost in the Weights: How Pretraining Residue Breaks Your Fine-Tuned Model in Production

May 7, 2026 · 10 min read

Tian Pan

Software Engineer

Your fine-tuned model passes your eval suite with 93% accuracy. You ship it. Three weeks later, a customer sends a screenshot: the model answered a question it had never seen in training with complete confidence — and it was completely wrong. The answer wasn't a hallucination in the usual sense. It was a memory. A pattern baked in during pretraining, resurfacing on a distribution the fine-tune never covered. This is pretraining residue, and it's one of the most underdiagnosed failure modes in production fine-tuning.

Fine-tuning adjusts weights. It does not retrain the model from scratch. The patterns — the calibration mechanisms, the confidence signals, the world-model priors — developed during pretraining at trillion-token scale remain in the weights. Your fine-tuning dataset, no matter how carefully curated, is a thin layer on top of a much deeper prior. When inputs arrive that fall outside your fine-tuning distribution, the model doesn't say "I don't know." It reaches back to pretraining and answers as if it does.

What Your Vendor's Model Card Doesn't Tell You

April 20, 2026 · 10 min read

Tian Pan

Software Engineer

A model card will tell you that the model scores 88.7 on MMLU. It will not tell you that the model systematically attributes blame to whichever technology appears first in a list of possibilities, causing roughly 10% of its attribution answers to be semantically wrong even when factually correct. It will not tell you that adding "you are a helpful assistant" to your system prompt degrades performance on structured reasoning tasks compared to leaving the system prompt blank. It will not tell you that under load the 99th-percentile latency is 4x the median, or that the model's behavior on legal and financial queries changes measurably depending on whether you include a compliance disclaimer.

None of this is in the model card. You will learn it by shipping to production and watching things break.

The Capability Elicitation Gap: Why Upgrading to a Newer Model Can Break Your Product

April 19, 2026 · 9 min read

Tian Pan

Software Engineer

You upgraded to the latest model and your product got worse. Not catastrophically — the new model scores higher on benchmarks, handles harder questions, and refuses fewer things it shouldn't. But the thing your product actually needs? It's regressed. Your carefully tuned prompts produce hedged, over-qualified outputs where you need confident assertions. Your domain-specific format instructions are being helpfully "improved" into something generic. The tight instruction-following that made your workflow reliable now feels like it's on autopilot.

This is the capability elicitation gap: the difference between what a model can do in principle and what it actually does under your prompt in production. And it gets systematically wider with each safety-focused training cycle.

The Calibration Gap: Your LLM Says 90% Confident but Is Right 60% of the Time

April 12, 2026 · 10 min read

Tian Pan

Software Engineer

Your language model tells you it is 93% sure that Geoffrey Hinton received the IEEE Frank Rosenblatt Award in 2010. The actual recipient was Michio Sugeno. This is not a hallucination in the traditional sense — the model generated a plausible-sounding answer and attached a high confidence score to it. The problem is that the confidence number itself is a lie.

This disconnect between stated confidence and actual accuracy is the calibration gap, and it is one of the most underestimated failure modes in production AI systems. Teams that build routing logic, escalation triggers, or user-facing confidence indicators on top of raw model confidence scores are building on sand.

Capability Probing: How to Map Your Model's Limitations Before Users Do

April 11, 2026 · 10 min read

Tian Pan

Software Engineer

Most teams discover their model's limitations the same way users do — in production, through support tickets. A customer reports the extraction pipeline silently dropping nested addresses. An internal user notices the summarizer hallucinating dates past 8,000 tokens. A compliance review finds the classifier confidently labeling ambiguous cases instead of abstaining.

None of these are surprises. They are capability boundaries that were always there, waiting for the right input to expose them. You either map those boundaries before deployment, or your users map them for you — one incident at a time.

The difference is cost: a probe failure in CI costs you five minutes. A capability gap discovered in production costs you a customer's trust. The discipline of finding those boundaries systematically is capability probing — fault injection for language models. You wouldn't ship a bridge without load-testing the joints. The same logic applies to any model you put in front of users.

About Tian Pan