The Calibration Gap: Your LLM Says 90% Confident but Is Right 60% of the Time
Your language model tells you it is 93% sure that Geoffrey Hinton received the IEEE Frank Rosenblatt Award in 2010. The actual recipient was Michio Sugeno. This is not a hallucination in the traditional sense — the model generated a plausible-sounding answer and attached a high confidence score to it. The problem is that the confidence number itself is a lie.
This disconnect between stated confidence and actual accuracy is the calibration gap, and it is one of the most underestimated failure modes in production AI systems. Teams that build routing logic, escalation triggers, or user-facing confidence indicators on top of raw model confidence scores are building on sand.
The calibration gap matters more than raw accuracy in many production scenarios. A model that is right 70% of the time but honestly says "I'm uncertain" when it is wrong is far more useful than a model that is right 75% of the time but claims 90%+ confidence on everything. The first model lets you build reliable systems around it. The second one poisons every downstream decision.
What Calibration Actually Means
A model is well-calibrated when its confidence scores match empirical accuracy. If a model says "90% confident" across 1,000 predictions, roughly 900 of those predictions should be correct. If only 600 are correct, the model has an Expected Calibration Error (ECE) that quantifies this gap.
The standard way to measure this is a reliability diagram. You bin predictions by confidence level (0-10%, 10-20%, and so on), compute the actual accuracy within each bin, and plot confidence versus accuracy. A perfectly calibrated model produces points along the diagonal — confidence equals accuracy at every level. In practice, most LLMs produce a curve that sits well above the diagonal in high-confidence regions, meaning they claim certainty far more often than they earn it.
Recent benchmarks paint a stark picture. On the SimpleQA benchmark, ECE values across major models range from 0.631 to 0.810. That means the average gap between confidence and accuracy is 63 to 81 percentage points. Even GPT-4o, one of the strongest models available, achieves only 35% accuracy on this benchmark while maintaining an ECE of 0.45 — nearly half its confidence scores are wrong by almost half.
Why Logprobs and Verbalized Confidence Diverge
There are two common ways to extract confidence from an LLM, and neither is reliable out of the box.
Token-level logprobs give you the model's softmax probability for each generated token. In theory, these are the model's "true" internal uncertainty estimates. In practice, they are poorly calibrated despite high accuracy on common tokens. The probability distribution across the vocabulary is shaped by training dynamics — RLHF in particular pushes models toward confident, assertive outputs because human raters prefer them. A model that hedges gets lower ratings, so the training process systematically strips away calibrated uncertainty.
Verbalized confidence is when you ask the model to state its confidence as a number: "How confident are you? Rate from 0 to 100." This approach has its own problems. Models cluster their verbal confidence scores near the top of the range regardless of actual performance — they have learned that confident-sounding answers get rewarded, and this manifests in self-reported numbers too.
Studies show that verbalized confidence correlates poorly with correctness, and the correlation degrades further on questions where the model lacks knowledge.
The gap between these two signals creates an additional headache. Token-level logprobs and verbalized confidence often disagree, and neither aligns well with ground truth. A model might assign a 0.4 softmax probability to a token while simultaneously claiming 85% verbalized confidence in the complete answer. This inconsistency makes it harder to build any single confidence pipeline.
The RLHF Overconfidence Trap
The root cause of modern LLM miscalibration is not a mystery — it is RLHF. Pre-trained base models are often reasonably well-calibrated. The fine-tuning process, especially reinforcement learning from human feedback, systematically degrades calibration.
The mechanism is straightforward: human raters prefer confident, clear, direct answers. Hedging, qualifying, and expressing uncertainty all reduce preference scores. So RLHF optimizes for confident presentation regardless of underlying certainty. The model learns that saying "The answer is X" gets rewarded more than saying "I think the answer might be X, but I'm not entirely sure."
This creates a paradox for production systems. The same fine-tuning that makes models useful (instruction-following, conversational fluency, helpful responses) also makes their confidence signals unreliable. You cannot simply use a base model instead — base models are difficult to deploy in production. And you cannot trust the fine-tuned model's self-assessment.
The problem is worse than uniform overconfidence. Research shows that large RLHF-tuned models can display increased miscalibration specifically on easier queries — the tasks where you would expect the best calibration. On TriviaQA, GPT-4o's ECE actually increases from 0.071 to 0.083 despite accuracy gains, suggesting that scaling and RLHF together create calibration patterns that are hard to predict.
Measuring Calibration in Your System
Most teams skip calibration measurement entirely. They evaluate accuracy, latency, and cost, but never check whether confidence scores mean anything. Here is how to add calibration measurement without a research team.
Step 1: Collect confidence-accuracy pairs. For every model call where you can later verify correctness, log the confidence score alongside the ground truth label. For classification tasks, this is straightforward. For generation tasks, you need either human evaluation on a sample or an automated correctness check.
- https://arxiv.org/html/2502.11028
- https://arxiv.org/abs/2409.19817
- https://arxiv.org/html/2410.06707v1
- https://news.mit.edu/2024/thermometer-prevents-ai-model-overconfidence-about-wrong-answers-0731
- https://aclanthology.org/2024.naacl-long.366/
- https://github.com/apple/ml-calibration
- https://proceedings.iclr.cc/paper_files/paper/2024/file/06cf4bae7ccb6ea37b968a394edc2e33-Paper-Conference.pdf
