Skip to main content

4 posts tagged with "model-evaluation"

View all tags

What Your Vendor's Model Card Doesn't Tell You

· 10 min read
Tian Pan
Software Engineer

A model card will tell you that the model scores 88.7 on MMLU. It will not tell you that the model systematically attributes blame to whichever technology appears first in a list of possibilities, causing roughly 10% of its attribution answers to be semantically wrong even when factually correct. It will not tell you that adding "you are a helpful assistant" to your system prompt degrades performance on structured reasoning tasks compared to leaving the system prompt blank. It will not tell you that under load the 99th-percentile latency is 4x the median, or that the model's behavior on legal and financial queries changes measurably depending on whether you include a compliance disclaimer.

None of this is in the model card. You will learn it by shipping to production and watching things break.

The Capability Elicitation Gap: Why Upgrading to a Newer Model Can Break Your Product

· 9 min read
Tian Pan
Software Engineer

You upgraded to the latest model and your product got worse. Not catastrophically — the new model scores higher on benchmarks, handles harder questions, and refuses fewer things it shouldn't. But the thing your product actually needs? It's regressed. Your carefully tuned prompts produce hedged, over-qualified outputs where you need confident assertions. Your domain-specific format instructions are being helpfully "improved" into something generic. The tight instruction-following that made your workflow reliable now feels like it's on autopilot.

This is the capability elicitation gap: the difference between what a model can do in principle and what it actually does under your prompt in production. And it gets systematically wider with each safety-focused training cycle.

The Calibration Gap: Your LLM Says 90% Confident but Is Right 60% of the Time

· 10 min read
Tian Pan
Software Engineer

Your language model tells you it is 93% sure that Geoffrey Hinton received the IEEE Frank Rosenblatt Award in 2010. The actual recipient was Michio Sugeno. This is not a hallucination in the traditional sense — the model generated a plausible-sounding answer and attached a high confidence score to it. The problem is that the confidence number itself is a lie.

This disconnect between stated confidence and actual accuracy is the calibration gap, and it is one of the most underestimated failure modes in production AI systems. Teams that build routing logic, escalation triggers, or user-facing confidence indicators on top of raw model confidence scores are building on sand.

Capability Probing: How to Map Your Model's Limitations Before Users Do

· 10 min read
Tian Pan
Software Engineer

Most teams discover their model's limitations the same way users do — in production, through support tickets. A customer reports the extraction pipeline silently dropping nested addresses. An internal user notices the summarizer hallucinating dates past 8,000 tokens. A compliance review finds the classifier confidently labeling ambiguous cases instead of abstaining.

None of these are surprises. They are capability boundaries that were always there, waiting for the right input to expose them. You either map those boundaries before deployment, or your users map them for you — one incident at a time.

The difference is cost: a probe failure in CI costs you five minutes. A capability gap discovered in production costs you a customer's trust. The discipline of finding those boundaries systematically is capability probing — fault injection for language models. You wouldn't ship a bridge without load-testing the joints. The same logic applies to any model you put in front of users.