Skip to main content

2 posts tagged with "ai-evals"

View all tags

Dogfooding Is Not an Eval Strategy

· 9 min read
Tian Pan
Software Engineer

Every team building an AI product reaches the same comfortable conclusion: "We use it every day, and it works great." That sentence feels like evidence. It is not. It is the single most misleading signal in the room, and it gets stronger — more convincing, more wrong — the better your team is.

Dogfooding tells you the product runs. It does not tell you the product works. Those are different claims, and the gap between them is exactly where your launch goes sideways. The people who built the system are, statistically, the worst possible sample of the people who will use it. They share its mental model, they know its soft spots, and they have spent months training themselves to phrase requests the way the model likes. That is not a test population. That is a control group for a study you never ran.

Quantization Slippage: The Capability Tax Your Eval Set Was Never Built to Catch

· 11 min read
Tian Pan
Software Engineer

A self-hosted LLM team quantizes the production model from fp16 to int4. Memory drops 4×, throughput nearly doubles, the GPU bill shrinks, and the team reruns the same eval suite that gated the fp16 release. MMLU-Pro retains 98.1% of baseline. Aggregate quality looks fine. They ship.

Six weeks later, a support engineer notices the math tutoring feature has gotten quietly worse. The compliance team flags an uptick in policy-violation completions on adversarial prompts. The structured-output retry rate has crept from 1.4% to 6.8%. None of these show up on the eval dashboard, because the eval dashboard was built to validate a different model — the one that shared the same weights file but had four times more bits behind every activation.

This is quantization slippage. The cost analysis priced the memory win and the latency win. It did not price the eval re-anchoring that the swap silently demanded, and the eval suite, calibrated against the fp16 distribution, is now grading the wrong model with the wrong rubric.