The Confidence Score Your Users Learned to Ignore
You wanted to be honest. You put a little "92%" next to every answer your agent gave. After the third time the agent was confidently wrong at 92%, your users stopped reading the number. They did not get angry about it. They just learned, the way humans always learn around a misbehaving signal, that the gauge on the dashboard is not connected to the engine. The number is still there. It costs you tokens to produce it. It informs no decision anyone makes.
This is the failure mode that calibration UX research keeps rediscovering: surfacing a probability is a trust commitment, and the commitment goes one direction. The moment the number turns out to be uncorrelated with correctness in the user's lived experience, the score is dead — and the trust you spent putting it there is dead with it. You cannot un-ring that bell by fixing the number later. The number is now decoration.
The instinct to surface uncertainty comes from a good place. You read the paper about overconfidence. You watched a user act on a wrong answer because the model was too fluent and too definite. You decided your product would be different — it would tell the truth about what it knows. You wrote a sentence in the design doc about "trust calibration," and you shipped a confidence chip on every output. What you actually shipped, in most cases, was the appearance of calibration backed by an uncalibrated number, and your users did exactly what every study of automation bias predicts: they oscillated for a while, hit a few sharp disconfirmations, and quietly defaulted to ignoring the gauge.
What the Number Is Measuring (and What It Isn't)
The number you put on the screen has a name in the literature. It is the model's expressed confidence in its own next tokens — either a verbalized "I'm 92% sure" or a transformation of logit probabilities into a percentage. It is a statement about the model's internal certainty about the string it produced.
That is not what the user thinks the number means. The user reads it as a probability that the answer is correct in the world. Those are different quantities, and on contemporary chat-tuned models they are not even close.
The gap has been measured. Studies on GPT-3, GPT-3.5, and Vicuna-class models find Expected Calibration Error above 0.37 on verbalized confidence — the model's stated probability is off from its true accuracy by roughly thirty-seven percentage points on average, with most predictions clustered in the 90–100% bucket regardless of whether the answer is right. GPT-4 does better but still posts an AUROC of about 0.63 for using its verbalized confidence to discriminate correct from incorrect answers, only slightly better than a coin flip. One study found GPT-4 assigning its highest possible confidence to 87% of responses, including many that were factually wrong. The 90-and-above region of the histogram is doing all the work, and most of what is in that region does not belong there.
The mechanism is not mysterious. RLHF trains the model to sound the way humans approve of, and humans approve of confident-sounding answers. The reward signal during alignment systematically prefers tone of certainty, even when the underlying answer is wrong. Recent mechanistic work has localized a compact set of MLP blocks and attention heads, mostly in middle-to-late layers, that consistently write a confidence-inflation signal at the final token position. Overconfidence is not a bug in the post-processing. It is something the training procedure baked in by design.
So when you take that number, dress it up as a probability, and stick it on the screen, you are surfacing an artifact of the alignment pipeline as if it were a calibrated probability. It is not. And the user does not get to know it isn't.
Why Showing It Anyway Is Worse Than Showing Nothing
Reasonable people will say: imperfect signal is better than no signal. If the number correlates at all, surely surfacing it nudges users in the right direction.
The empirical record disagrees. The trust-calibration literature documents a specific dynamic: when a confidence display has low or unreliable correlation with actual correctness, users do not weight it down to its true informational value. They oscillate. They over-trust early, hit a high-confidence wrong answer, swing into under-trust, hit a low-confidence right answer, swing back. Over a few days of use they settle into ignoring the signal entirely — and, importantly, into a baseline of either over- or under-trust depending on which side of the oscillation they got tired in.
This is the part the "imperfect signal is better than none" argument misses. Showing the number is not a free act. It spends user attention. It spends user trust. If the number does not pay either of those back with usable information, the user ends up worse calibrated than if you had shown nothing — because their priors are now polluted by a phase of trying to use the gauge and learning that it lied.
There is a related, sneakier cost. Once you have shown a number, your product is now in the business of defending it. A user who saw 92% next to a wrong answer is not going to forget. The signal you wanted to give — "we are honest about uncertainty" — is the opposite of the signal they received, which is "this product is wrong with high confidence." You have to either fix the calibration, hide the number, or live with the credibility hit. There is no quiet exit.
The Three Honest Choices
Once you accept that an uncalibrated probability shown to a user is worse than no number, the design space collapses to three options. Each is a real commitment.
Hide the number. This is the option most products should actually take and don't, because it feels like a regression. You stop displaying a confidence score and let the answer stand on its own. The hidden assumption — and it is the right assumption for most consumer surfaces — is that the model's tone already encodes its confidence to the limited extent it can, and that an additional numeric overlay is at best redundant and at worst misleading. Hiding the number is not the same as giving up on calibration. It is recognizing that the user-facing surface is not where a probability belongs unless that probability is good.
Fix the number. This is the expensive option. You build the calibration infrastructure: reliability diagrams against held-out evals, post-hoc temperature scaling on logit-derived scores, or one of the verbal-elicitation strategies that consistently outperform raw probability extraction. You instrument the number against ground truth and you retune until your ECE is somewhere a probability is allowed to be — single-digit percentage points off accuracy, not thirty-seven. You do this per task type, because calibration is not a single number across the space of inputs. And you keep doing it, because the next model snapshot will move the curve. This is real work, and it is the only path that lets you keep showing a percentage with integrity.
Replace the number with a tier. This is the option most product teams should consider seriously and don't because it feels less precise. Instead of "92%," you show one of three labels: "confident," "verify this," or "I'm not sure — check a source." The categorical tier is computed from the same underlying signals but only commits to a coarse-grained statement you can actually back. Tiering works because the threshold between buckets becomes the calibration target, not the absolute number. You only have to be right about the boundary between "confident" and "verify this" being the boundary where accuracy drops below some operating point. That is a much easier surface to keep honest, and crucially, it is a surface that does not invite the user to over-interpret a spurious percentage.
There is a fourth option people sometimes propose — show the number plus a disclaimer — and it is the worst of all four. The number still anchors. The disclaimer is read once and discarded. You have hedged your text without hedging the cognitive effect.
What Calibration Actually Costs to Get Right
If you do choose to fix the number, the work is not glamorous. The honest list of what calibration requires:
- A held-out eval set per task type that mirrors production distribution, refreshed as the production distribution drifts. Without this you cannot draw a reliability diagram, and without a reliability diagram you cannot tell whether the score you are showing means anything.
- Either logit-based scores with post-hoc temperature scaling, or verbalized scores with the elicitation prompt structure tuned against the eval. Recent work suggests verbalized confidence tends to be better-calibrated than raw conditional probabilities for RLHF-tuned models — but only after you have specifically trained or prompted for it. The default verbalized number out of a chat model is the one that is off by thirty-seven percentage points.
- A monitoring system that re-runs the eval whenever the model snapshot moves and produces an updated reliability curve before the new model touches users. Calibration drifts model-to-model and sometimes within a single model over time. Whatever calibration you achieved is a property of a specific binding of model, prompt, and task — it does not transfer.
- An operating point chosen against a business cost, not a vibe. Once the score is calibrated, you still have to decide what to do with it. Selective-prediction work formalizes this as the accuracy-coverage tradeoff: at a given confidence threshold, what fraction of queries do you answer (coverage) and at what accuracy on the answered subset. Picking the threshold is a product decision about the cost of a wrong answer versus the cost of a refusal. It is not the model's job.
This is most of an applied-ML team's calendar. It is not something a UX designer adds in a sprint. Teams that show a confidence number without doing this work are almost always showing the unprocessed model output of an alignment-distorted RLHF score, and the number is doing exactly what the literature predicts it will do.
The Trust You Already Spent
The hardest part of this conversation is the version where the number is already shipped. Users have already learned to ignore it. Your team has invested in the UI affordance. Pulling it would look like a step backward, and the design review will go badly.
The case to make internally is that the trust you spent putting the number there is not recoverable by leaving it. Every additional day the uncalibrated score sits in the interface is another day users are training their priors against your product's honesty signals. The cheapest path back is usually to demote the number — replace the percentage with a tier label or remove it entirely — and rebuild the uncertainty surface around something you can actually back. Users will not notice the absence the way you fear they will. They had already stopped reading the number.
The deeper lesson is that surfacing uncertainty is not a one-line design choice. It is a contract between the model's internal states and the user's decision-making, and the contract is enforceable only as far as the calibration goes. Showing a probability that you have not earned the right to show is worse than not showing one at all, because the only thing more expensive than missing information in a product is misleading information delivered with the visual grammar of a fact. The chip looked precise. The number looked like a fact. Users believed it for a while. Then they didn't.
The next time someone on your team proposes adding a confidence percentage to an output, the right question is not "will this help the user." It is "do we have a reliability diagram for this score on this task, and are we willing to maintain that diagram every time the model moves." If the answer is no, the right product move is to hide the gauge — or replace it with a categorical tier whose boundaries you can defend. Anything else is renting credibility you do not yet have, and the rent comes due fast.
- https://arxiv.org/abs/2503.15850
- https://arxiv.org/abs/2306.13063
- https://arxiv.org/abs/2410.09724
- https://arxiv.org/pdf/2305.14975
- https://arxiv.org/pdf/2412.14737
- https://arxiv.org/html/2502.11028v3
- https://arxiv.org/pdf/2502.06884
- https://www.visible-language.org/journal/issue-59-2-addressing-uncertainty-in-llm-outputs-for-trust-calibration-through-visualization-and-user-interface-design/
- https://dl.acm.org/doi/10.1145/3696449
- https://link.springer.com/article/10.1007/s00146-025-02422-7
- https://aclanthology.org/2025.tacl-1.26.pdf
