Skip to main content

15 posts tagged with "calibration"

View all tags

The 40-Point Gap Between Your Interviewers When the Candidate Says 'I'd Just Prompt It'

· 9 min read
Tian Pan
Software Engineer

The candidate hit the wall on the system-design question, paused for two seconds, and said: "I'd just prompt it." Your most senior interviewer wrote strong hire — this is exactly how good engineers work in 2026. Your second-most-senior interviewer wrote no hire — handing the problem to a chatbot is not engineering. Same five words. Same forty-minute window. A forty-point gap on the same scorecard.

The candidate didn't fail your loop. Your loop failed to have an opinion. And the worst part of the debrief is not the disagreement — it's the way each interviewer is so confident their read is the correct one that the meeting devolves into a referendum on AI itself rather than on whether this human can ship.

The Eval Harness Whose Judge Model Was Upgraded Silently

· 11 min read
Tian Pan
Software Engineer

A six-point lift across every eval category arrives the same week you shipped a prompt change. The room reads it as proof the change worked. Three weeks later, someone notices the lift also showed up in categories the prompt change could not possibly have touched — a control set you keep specifically to detect this — and the lift is uniformly distributed, the kind of shape a real product improvement never has. The judge model was rolled out under the same endpoint name on a Tuesday. Your scores moved before your system did.

This is the failure mode that breaks LLM-as-a-judge eval pipelines more quietly than any of the failure modes the literature warns about. Not bias, not position effects, not self-preference — those are properties of a judge at a point in time, and your eval design probably already accounts for them. The one that gets you is the judge changing while you're not looking, while your endpoint name and your eval code and your dashboards all keep claiming nothing happened. The unit of measurement shifted under a stable label. Every comparison across the migration boundary is now confounded, and you cannot decompose the delta into "our system improved" and "the ruler got more generous" because you never built the instrument to do that decomposition.

The Eval Rubric Pulled By Two Drift Vectors

· 9 min read
Tian Pan
Software Engineer

Your composite eval score went up two points last quarter. Nobody can tell you whether the system got better, whether the human cohort that scores it got more lenient, or whether the judge model you upgraded in March started weighting verbosity differently. The number moved. The thing the number is supposed to measure did not necessarily move with it.

This is what happens when an eval rubric is read by two populations at once — humans and an LLM judge — and both populations drift on different axes for different reasons. The composite score blends their motion together, and unless you have a measurement protocol that holds one fixed while the other moves, you have shipped a metric whose changes are not attributable to anything.

The Abstention Tax You Didn't Budget For

· 11 min read
Tian Pan
Software Engineer

You taught the agent to say "I don't know" when the context was thin and called it a safety win. The OpenAI bill went down. Everyone agreed it was the responsible move. Three months later your VP of Support is asking why headcount projections are off by 40% and nobody in the AI org has an answer, because the metric you tracked was abstention rate and the metric that moved was tickets-per-week — and nobody owned the line that summed them.

This is the abstention tax. It's not a model cost. It doesn't show up on the inference invoice. It shows up downstream, in the queue depth of the human team that catches every "I cannot answer," in the second model call that runs against the enriched context the human had to assemble, in the customer who churned during the wait. The model-only cost frame quietly hides it. And the org seam where the AI team owns abstention and the ops team owns the queue means nobody is incentivized to see it.

The Confidence Score Your Users Learned to Ignore

· 11 min read
Tian Pan
Software Engineer

You wanted to be honest. You put a little "92%" next to every answer your agent gave. After the third time the agent was confidently wrong at 92%, your users stopped reading the number. They did not get angry about it. They just learned, the way humans always learn around a misbehaving signal, that the gauge on the dashboard is not connected to the engine. The number is still there. It costs you tokens to produce it. It informs no decision anyone makes.

This is the failure mode that calibration UX research keeps rediscovering: surfacing a probability is a trust commitment, and the commitment goes one direction. The moment the number turns out to be uncorrelated with correctness in the user's lived experience, the score is dead — and the trust you spent putting it there is dead with it. You cannot un-ring that bell by fixing the number later. The number is now decoration.

The Confidence-Score Tax: Why Asking the Model How Sure It Is Costs More Than Being Wrong

· 10 min read
Tian Pan
Software Engineer

Somewhere in the evolution of every AI feature, a reviewer asks a reasonable-sounding question: "Can we have the model tell us how confident it is, so we can route the low-confidence answers to a human or a fallback?" It sounds like free insurance. You add a confidence field to the output schema, the model dutifully fills it in, and now you have a dial to turn. Ship it.

That dial is not free, and worse, it is usually not wired to anything. The confidence number is a token sequence the model is happy to produce and under no obligation to mean. Teams pay real tokens and real latency to acquire it, never check whether it correlates with correctness, and then route production traffic on it as if "0.9" were a 90% reliability estimate. It is a gauge bolted to the dashboard with nothing behind the glass.

This post is about the two costs nobody priced: the per-request tax of generating the confidence field at all, and the much larger cost of trusting an uncalibrated number to make routing decisions.

The Confident Hallucinator: Runtime Patterns for Knowledge Boundary Signaling in LLMs

· 10 min read
Tian Pan
Software Engineer

GPT-4 achieves roughly 62% AUROC when its own confidence scores are used to separate correct answers from incorrect ones. That's barely above the 50% baseline of flipping a coin. The model sounds certain and polished in both cases. If you're building a production system that assumes high-confidence responses are reliable, you're working with a signal that's nearly random.

This is the knowledge boundary signaling problem, and it sits at the center of most real-world LLM quality failures. The model doesn't know what it doesn't know — or more precisely, it knows internally but can't be trusted to express it. The engineering challenge isn't getting models to refuse more; it's designing systems that make uncertainty actionable without making your product feel broken.

The LLM-Judge Ceiling: Why Your Auto-Eval Stops Correlating With Users at the Score That Matters

· 10 min read
Tian Pan
Software Engineer

LLM-as-judge is the productivity unlock that let evaluation coverage scale 10x without growing the human grading team. The problem is that the unlock is not uniform across the score range. The judge's agreement with humans is highest in the muddy middle of the distribution — the answers nobody is going to escalate either way — and collapses on the long tail of high-stakes outputs that actually decide whether a feature ships, gets rolled back, or paged at 2am. The dashboard graph stays green through the score range that nobody is ever happy with.

That is the LLM-judge ceiling: a measurement instrument with a non-uniform error profile that the team is reading as a single number. Aggregate agreement of 80% with humans is the headline most vendors put on the page; it is also the number that gets the team to trust the judge most where the judge is least informative.

Calibrated Abstention: The Capability Every Layer of Your LLM Stack Punishes

· 11 min read
Tian Pan
Software Engineer

There is a capability your model could have that would, on the days it mattered, be worth more than any other behavioral upgrade you could ship: the ability to say "I don't have a reliable answer to this" and mean it. Not the keyword-matched safety refusal. Not the hedging tic the model picked up from RLHF on controversial topics. The real thing — a calibrated abstention that fires when, and only when, the model's internal evidence does not support a confident response.

You will never get it by accident. Every default in the LLM stack pushes the other way.

The Eval Pickle: When Your LLM Judge Gets Smarter Than the Model It Grades

· 9 min read
Tian Pan
Software Engineer

A regression alert fires on Monday morning. Faithfulness on your held-out eval set dropped from 0.86 to 0.78 over the weekend. Nobody shipped a new model. Nobody touched the prompt. Nobody changed the retrieval index. The on-call engineer spends three hours digging before noticing the only thing that changed was the judge model — the auto-evaluator quietly rolled forward to a newer snapshot that catches subtle hedging the old one waved through. Same answers. Same model. Worse score. Real number, fake regression.

This is the eval pickle: as your LLM-as-judge gets sharper, your scores on a frozen system slide down, and the dashboard that's supposed to detect regressions starts manufacturing them. The team that doesn't notice spends quarters chasing "quality drift" that lives entirely in the ruler.

Abstain or Escalate: The Two-Threshold Problem in Confidence-Gated AI

· 13 min read
Tian Pan
Software Engineer

Most production AI features ship with a single confidence threshold. Above the line, the model answers. Below it, the user gets a flat "I'm not sure." That single number is doing two completely different jobs at once, and it's why your trust metric has been sliding for two quarters even though your accuracy on answered queries looks fine.

The right design has at least two cutoffs. An abstain threshold sits low: below it, the model declines because no answer is worth more than silence. An escalate threshold sits in the middle: between the two cutoffs, the system hands the case to a human reviewer instead of dropping it on the floor. Collapse them into a single dial and you ship a product that feels equally useless when it's wrong and when it's uncertain — which is the worst possible position to occupy in a market where users have a free alternative one tab away.

This isn't a new idea. The reject-option classifier literature has been arguing for split thresholds since the 1970s, distinguishing ambiguity rejects (the input is between known classes) from distance rejects (the input is far from any training data). Production AI teams keep rediscovering the same lesson the hard way, usually about six months after their first launch, when the support queue is full of people typing "is this thing broken or what."

Your Accuracy Went Up and Your Calibration Collapsed

· 10 min read
Tian Pan
Software Engineer

A team ships a prompt refactor. The offline eval shows accuracy up three points. The PM posts the graph in Slack. Two weeks later, support tickets spike with a pattern nobody has a dashboard for: users trusted an answer they should not have, acted on it, and got burned. The model is right more often than it used to be. Trust in the model has gotten worse.

This is the calibration collapse. The model's confidence no longer matches its error rate, but the accuracy number went up, so the team thinks they shipped a win. They did not. They shipped a system that is more confidently wrong, and users — who calibrate trust on the model's voice (hedges, certainty, refusals) rather than on an accuracy number they never see — are now being misled on the exact fraction of queries where being misled matters most.

Accuracy and calibration are independent axes. You can move one without touching the other. You can improve one while destroying the other. Most teams measure only the first axis and ship against it, and most production incidents in LLM systems live on the second.