The Calibration Gap: Your LLM Says 90% Confident but Is Right 60% of the Time
Your language model tells you it is 93% sure that Geoffrey Hinton received the IEEE Frank Rosenblatt Award in 2010. The actual recipient was Michio Sugeno. This is not a hallucination in the traditional sense — the model generated a plausible-sounding answer and attached a high confidence score to it. The problem is that the confidence number itself is a lie.
This disconnect between stated confidence and actual accuracy is the calibration gap, and it is one of the most underestimated failure modes in production AI systems. Teams that build routing logic, escalation triggers, or user-facing confidence indicators on top of raw model confidence scores are building on sand.
The calibration gap matters more than raw accuracy in many production scenarios. A model that is right 70% of the time but honestly says "I'm uncertain" when it is wrong is far more useful than a model that is right 75% of the time but claims 90%+ confidence on everything. The first model lets you build reliable systems around it. The second one poisons every downstream decision.
What Calibration Actually Means
A model is well-calibrated when its confidence scores match empirical accuracy. If a model says "90% confident" across 1,000 predictions, roughly 900 of those predictions should be correct. If only 600 are correct, the model has an Expected Calibration Error (ECE) that quantifies this gap.
The standard way to measure this is a reliability diagram. You bin predictions by confidence level (0-10%, 10-20%, and so on), compute the actual accuracy within each bin, and plot confidence versus accuracy. A perfectly calibrated model produces points along the diagonal — confidence equals accuracy at every level. In practice, most LLMs produce a curve that sits well above the diagonal in high-confidence regions, meaning they claim certainty far more often than they earn it.
Recent benchmarks paint a stark picture. On the SimpleQA benchmark, ECE values across major models range from 0.631 to 0.810. That means the average gap between confidence and accuracy is 63 to 81 percentage points. Even GPT-4o, one of the strongest models available, achieves only 35% accuracy on this benchmark while maintaining an ECE of 0.45 — nearly half its confidence scores are wrong by almost half.
Why Logprobs and Verbalized Confidence Diverge
There are two common ways to extract confidence from an LLM, and neither is reliable out of the box.
Token-level logprobs give you the model's softmax probability for each generated token. In theory, these are the model's "true" internal uncertainty estimates. In practice, they are poorly calibrated despite high accuracy on common tokens. The probability distribution across the vocabulary is shaped by training dynamics — RLHF in particular pushes models toward confident, assertive outputs because human raters prefer them. A model that hedges gets lower ratings, so the training process systematically strips away calibrated uncertainty.
Verbalized confidence is when you ask the model to state its confidence as a number: "How confident are you? Rate from 0 to 100." This approach has its own problems. Models cluster their verbal confidence scores near the top of the range regardless of actual performance — they have learned that confident-sounding answers get rewarded, and this manifests in self-reported numbers too.
Studies show that verbalized confidence correlates poorly with correctness, and the correlation degrades further on questions where the model lacks knowledge.
The gap between these two signals creates an additional headache. Token-level logprobs and verbalized confidence often disagree, and neither aligns well with ground truth. A model might assign a 0.4 softmax probability to a token while simultaneously claiming 85% verbalized confidence in the complete answer. This inconsistency makes it harder to build any single confidence pipeline.
The RLHF Overconfidence Trap
The root cause of modern LLM miscalibration is not a mystery — it is RLHF. Pre-trained base models are often reasonably well-calibrated. The fine-tuning process, especially reinforcement learning from human feedback, systematically degrades calibration.
The mechanism is straightforward: human raters prefer confident, clear, direct answers. Hedging, qualifying, and expressing uncertainty all reduce preference scores. So RLHF optimizes for confident presentation regardless of underlying certainty. The model learns that saying "The answer is X" gets rewarded more than saying "I think the answer might be X, but I'm not entirely sure."
This creates a paradox for production systems. The same fine-tuning that makes models useful (instruction-following, conversational fluency, helpful responses) also makes their confidence signals unreliable. You cannot simply use a base model instead — base models are difficult to deploy in production. And you cannot trust the fine-tuned model's self-assessment.
The problem is worse than uniform overconfidence. Research shows that large RLHF-tuned models can display increased miscalibration specifically on easier queries — the tasks where you would expect the best calibration. On TriviaQA, GPT-4o's ECE actually increases from 0.071 to 0.083 despite accuracy gains, suggesting that scaling and RLHF together create calibration patterns that are hard to predict.
Measuring Calibration in Your System
Most teams skip calibration measurement entirely. They evaluate accuracy, latency, and cost, but never check whether confidence scores mean anything. Here is how to add calibration measurement without a research team.
Step 1: Collect confidence-accuracy pairs. For every model call where you can later verify correctness, log the confidence score alongside the ground truth label. For classification tasks, this is straightforward. For generation tasks, you need either human evaluation on a sample or an automated correctness check.
Step 2: Build a reliability diagram. Bin your confidence-accuracy pairs into 10-15 buckets. Plot average confidence (x-axis) versus average accuracy (y-axis). The gap between the curve and the diagonal is your calibration error, visualized. Apple's open-source ml-calibration library provides utilities for computing reliability diagrams and calibration metrics.
Step 3: Compute ECE. Expected Calibration Error is the weighted average of the absolute difference between confidence and accuracy across all bins, weighted by the number of samples in each bin. An ECE of 0.05 is good. An ECE of 0.20 means your confidence scores are almost meaningless. Most production LLMs without recalibration sit in the 0.15-0.40 range.
Step 4: Segment by query type. Calibration varies dramatically across input categories. Person-based queries show ECE of 0.71 in some benchmarks, while place-based queries are much better calibrated. Your model might be well-calibrated for common request types and wildly overconfident on rare ones — exactly the failure mode that hurts most.
For newer calibration metrics, Smooth ECE (smECE) uses kernel smoothing instead of hard binning and is now the recommended measure in the research community. It avoids the bin-boundary artifacts that can make standard ECE misleading on small datasets.
Recalibration Techniques That Work in Production
Once you have measured the gap, you need to close it. The good news: post-hoc recalibration techniques can dramatically improve calibration without retraining the model.
Temperature scaling is the simplest and most widely used approach. You learn a single scalar parameter T that divides the logits before the softmax. If T > 1, the distribution becomes softer (less confident). If T < 1, it becomes sharper. You optimize T on a held-out validation set to minimize negative log-likelihood. Temperature scaling adds zero latency at inference time and preserves the model's predictions — it only adjusts confidence levels.
The limitation is that a single temperature applies uniformly to all predictions. If your model is overconfident on hard questions and well-calibrated on easy ones, a global temperature cannot fix both.
Adaptive temperature scaling addresses this by predicting a per-sample temperature using a small auxiliary model. The MIT Thermometer approach trains a lightweight model on representative task examples, then generalizes to new similar tasks. A key finding: auxiliary models trained on smaller LLMs in a model family can transfer to larger ones, reducing the labeled data requirement.
Platt scaling fits a logistic regression on the model's logits to produce calibrated probabilities. It is more flexible than temperature scaling (two parameters instead of one) but still assumes a specific functional form. It works well when you have enough validation data — typically a few hundred labeled examples per task.
Isotonic regression is the non-parametric option. It fits a piecewise-constant, non-decreasing function from confidence to calibrated probability, making no assumptions about the shape of miscalibration. The tradeoff: it requires more validation data and can overfit on small datasets.
Verbalized confidence calibration is an emerging approach for black-box models where you lack logit access. You prompt the model to output a confidence score, collect a dataset of (verbalized confidence, correctness) pairs, and fit a calibration function on top. Recent work shows this can reduce ECE substantially, though the noisier signal means you need more calibration data.
For most production systems, start with temperature scaling. If your error analysis reveals input-dependent miscalibration patterns, graduate to adaptive temperature scaling or isotonic regression. Reserve verbalized confidence calibration for cases where you truly cannot access logprobs.
Designing Systems Around Uncertain Confidence
Even after recalibration, confidence scores are estimates, not guarantees. The real engineering challenge is building systems that degrade gracefully when confidence is wrong.
Threshold with hysteresis, not hard cutoffs. Instead of routing to human review at confidence < 0.8, use a band: auto-approve above 0.85, auto-reject below 0.3, human review in between. This prevents oscillation around a single threshold and acknowledges that calibration is imperfect near any single boundary.
Monitor calibration drift. Calibration degrades over time as the input distribution shifts. The temperature you optimized last month may be wrong this month. Run your reliability diagram on a rolling window of production data and alert when ECE exceeds your threshold.
Ensemble confidence signals. Combine token-level logprobs, verbalized confidence, and semantic consistency (asking the same question multiple ways and checking agreement). No single signal is well-calibrated, but their disagreement is itself informative. When all three signals agree on high confidence, you can trust it more than any individual score.
Report uncertainty to users. If your product exposes AI-generated answers, consider showing calibrated confidence rather than hiding it. Users who see "medium confidence" make better decisions than users who see a confidently stated wrong answer. The key is that displayed confidence must be calibrated — showing raw model confidence just trains users to ignore the signal.
The Calibration Investment
Adding calibration measurement and recalibration to your pipeline requires modest effort: a few hundred labeled examples, a reliability diagram, and a temperature scaling parameter. The return is disproportionate. Every downstream system that consumes confidence scores — routing, escalation, caching, user-facing indicators — becomes more reliable.
The teams that skip calibration are not saving time. They are deferring a debugging session to a much more expensive moment: when a production system makes a high-stakes decision based on a confidence score that meant nothing. By then, the cost is measured in user trust, not engineering hours.
Calibration is not glamorous work. It does not produce impressive demo numbers or appear in product announcements. But it is the difference between an AI system that knows what it does not know and one that confidently walks off a cliff. In production, that difference is everything.
- https://arxiv.org/html/2502.11028
- https://arxiv.org/abs/2409.19817
- https://arxiv.org/html/2410.06707v1
- https://news.mit.edu/2024/thermometer-prevents-ai-model-overconfidence-about-wrong-answers-0731
- https://aclanthology.org/2024.naacl-long.366/
- https://github.com/apple/ml-calibration
- https://proceedings.iclr.cc/paper_files/paper/2024/file/06cf4bae7ccb6ea37b968a394edc2e33-Paper-Conference.pdf
