Confidence Strings, Not Scores: Why Your 0.87 Badge Moves Nobody
The product team ships a confidence badge next to every AI suggestion. Green for ≥85%, yellow for 60–84%, red below. They run an A/B test six weeks later and find no change in user behavior at any threshold. False positives at 0.92 confidence get accepted at the same rate as false positives at 0.61 confidence. The team's instinct is to tune the calibration — fit a temperature scaling layer, regenerate the badges, run the A/B again. The numbers shift; the behavior doesn't.
The problem isn't that the model is miscalibrated, though it almost certainly is. The problem is that calibrated probability is the wrong output. The signal a user can act on isn't "how sure" the model is. It's "what specifically the model didn't check." A 0.87 badge tells the user nothing they can verify. "I'm reasonably confident in the address but I haven't checked the unit number" tells them exactly where to look.
This is the gap between the academic frame for uncertainty — where the goal is a probability that matches empirical accuracy — and the production frame, where the goal is a string the user can act on. The two have been conflated for so long that most teams ship the academic artifact and assume users will translate it into action. They don't. They ignore it.
The probability badge is a UI dead end
The literature on automation bias gives the bad news clearly: users overtrust confident AI outputs and ignore the confidence signal when it conflicts with their existing belief. Studies across 2024–2025 show that a numeric confidence score functions less as a decision input and more as a rubber stamp — users who already wanted to accept the suggestion treat 0.61 as "good enough," and users who already wanted to reject it treat 0.92 as "the model is overfit."
There's a structural reason this happens. A probability is a summary statistic. It compresses every reason the model is uncertain into a single number, and the number doesn't tell the user which reason it is. Did the model not see this exact pattern in training? Was the input ambiguous? Was a tool call truncated? Did the retrieved context disagree with itself? All of these collapse into the same 0.73, and 0.73 doesn't tell the user where to look.
The user's actual decision question is not "how confident is the model" but "what should I check before I trust this." A scalar can't answer that question because the answer is a pointer to a specific field, claim, or assumption. The probability is the wrong shape of signal for the decision being made.
Calibration research has spent a decade trying to make these scalars more accurate. The Expected Calibration Error of frontier models in 2025 sits between 0.05 and 0.20 depending on task — meaning the displayed probability is off by 5 to 20 percentage points on average. Alignment training tends to make this worse, not better; preference-collapse during RLHF systematically pushes models toward overconfidence because confident answers get rewarded. Post-hoc calibration techniques like temperature scaling can recover some of this. None of it changes user behavior, because the user wasn't reading the number in the first place.
What does change user behavior
The HCI literature is more useful here than the calibration literature. A 2024 study at FAccT found that first-person uncertainty expressions like "I'm not sure, but…" significantly decreased user agreement with incorrect AI suggestions and increased user accuracy on the joint task — while general-perspective hedges like "It's not clear, but…" had no statistically significant effect. The grammatical person matters: users respond to a model that owns its uncertainty more than to a model that gestures at uncertainty in the abstract.
A separate 2025 paper on verbalized uncertainty in AI-assisted decision-making found that users prefer "medium" verbalized uncertainty — language that acknowledges limitations without sounding overwhelmed by them — over either confident assertion or constant hedging. Trust and task performance both peaked at this middle setting, and the effect was independent of the model's actual accuracy. Users were responding to the texture of the language, not the underlying probability.
The effect that's missing from the probability-badge approach is specificity. "I'm reasonably sure" doesn't help. "I'm reasonably sure, but I haven't checked whether this address has a unit number" helps a lot. The user now has a closed-loop action: look at the unit number, decide. The narrative names a verifiable claim, and the user can verify it in seconds. The probability badge would have shown 0.84 and the user would have accepted or rejected based on prior bias.
This is what an uncertainty narrative does that a probability score can't: it transfers the verification work from "is the model right" (unanswerable in finite time) to "is this specific claim right" (answerable in seconds). The narrative compresses the model's residual uncertainty into the smallest claim that, if checked, resolves it.
Generating uncertainty narratives as a first-class output
The implementation pattern looks different from a calibration layer. You don't post-process logits into a calibrated probability; you ask the model to emit, alongside its answer, a structured field that names which sub-claims it's confident in and which it isn't, and what the user would check to resolve each uncertain sub-claim.
The output schema has three fields per uncertain sub-claim: the claim, the source of uncertainty, and the cheapest verification. "Address: 1234 Elm St, Apt 7B" → uncertainty: "the unit number was extracted from a phone-call transcript with background noise" → verify: "confirm with the customer record." A short string the user can act on, generated alongside the answer rather than computed after it.
A few engineering points are non-obvious:
-
Force the model to emit the narrative even when confident. If you only ask for an uncertainty narrative when the model thinks it might be wrong, the model will think it's right, and you'll get nothing. Treat the field as required and let the model say "no material uncertainties" when warranted. Empty narratives carry information too.
-
Constrain to verifiable claims, not feelings. "I'm not sure if this is right" is useless. "I haven't verified that the SKU is in stock" is actionable. The model needs an explicit instruction that uncertainty narratives must name a fact the user can check, not a vibe.
-
Bound the narrative length. Three to five sub-claims max. Past that you've reproduced the original problem in a different shape — a wall of hedges that users will skim and ignore. The discipline is to surface the most decision-relevant uncertainties, not all of them.
-
Separate uncertainty narrative from refusal. A narrative is "I answered, here's what I'm not sure about." A refusal is "I won't answer because I'm too unsure." These have different downstream UX. Conflating them produces a model that hedges everything, which trains users to ignore the hedges.
-
Pin the narrative to specific output spans, not the whole answer. "The middle paragraph mentions a date I haven't verified" beats "I'm somewhat unsure about parts of this." The user needs a pointer, not a global score in disguise.
The narrative is more expensive to generate than a confidence number — it's tokens, and they're typically generated in the same pass as the answer. In practice this is a 5–15% latency tax depending on the model and answer length. That tax is worth paying because the alternative — a confidence number that nobody acts on — is paying for the work and getting no decision-quality improvement at the user layer.
The eval methodology has to change too
The standard eval for a confidence score is calibration: bin predictions by stated probability, measure empirical accuracy in each bin, compute Expected Calibration Error. This is the right eval for a probability and the wrong eval for a narrative. A perfectly calibrated probability still doesn't change user behavior. An imperfectly calibrated narrative that names the right concrete claim does.
The eval for an uncertainty narrative grades actionability. The rubric needs three orthogonal dimensions:
-
Specificity. Does the narrative name a verifiable sub-claim? "I'm not sure" fails. "I'm not sure about the unit number" passes. This grades against a binary criterion: can a user, reading only the narrative, identify a finite verification action?
-
Coverage. Of the actual errors in the answer, what fraction were named in the narrative? This is recall. A narrative that perfectly flags the unit number but misses a wrong street name has 50% coverage. You build this eval by collecting answer-level errors, then checking whether each error was surfaced in the narrative.
-
Precision. Of the items the narrative flagged as uncertain, what fraction were actually wrong? A narrative that flags everything has 100% coverage and ~10% precision, and trains users to ignore it. The same overhead-collapse you saw in numeric badges happens with verbose narratives.
The interesting dynamic is that these three dimensions trade off in different ways than calibration error does. You can have perfect calibration with terrible specificity (every answer gets a probability, no answer gets an explanation). You can have great specificity with terrible coverage (the model names one obvious uncertainty per answer and misses three subtle ones). Optimizing one doesn't get you the others, and the joint optimum is closer to "right amount of medium-grain hedging" than to "high accuracy probability."
A practical setup: hold out a labeled error set from production, where each error has been classified by which sub-claim is wrong. Run the model with uncertainty narratives enabled. For each example, score whether the narrative named the actually-wrong sub-claim (coverage), whether it named only sub-claims that turned out to be wrong (precision), and whether each named claim was specific enough that a user could verify it in under 30 seconds (specificity). Track all three over time; treat any single dimension dropping as a regression.
The architectural shift
Treating uncertainty as a probability and uncertainty as a narrative are different architectural decisions, not different surfaces on the same primitive. The probability path puts uncertainty at the end of the model — calibrate logits, expose a number, render a badge. The narrative path puts uncertainty inside the answer — generate it as part of the same forward pass, schema-constrain it, eval it as a content field.
This has consequences upstream. If uncertainty is content, it's part of the system prompt's responsibility. The prompt has to instruct the model on what counts as a verifiable claim, what counts as too vague to surface, and how granular to go. It's part of the eval surface. It's part of the regression suite when you upgrade models. It's part of the few-shot examples. It's part of every artifact the model produces.
It also collapses one of the long-running false dichotomies in AI UX: "should we show the user a confidence indicator or not?" The right answer in the narrative frame is "always, because it's part of the answer, and the answer without it is a lie of omission." A model that generates an answer without naming what it didn't check is a model that's been pre-stripped of the most decision-relevant information it has.
The teams I've seen succeed at this stop treating uncertainty as a UI element and start treating it as a generation requirement. They write evals against narratives, not probabilities. They train product designers to ask "what would the user check" rather than "what threshold should the badge turn red." They retire the calibration sprint, ship the narrative schema, and find that user accuracy on the joint task improves measurably while the badge they spent six weeks tuning gets quietly removed.
The badge wasn't moving anyone. The string is.
- https://arxiv.org/abs/2405.00623
- https://www.sciencedirect.com/science/article/pii/S1071581925000126
- https://arxiv.org/abs/2306.13063
- https://arxiv.org/html/2402.07632v4
- https://arxiv.org/html/2503.15850
- https://arxiv.org/html/2511.11500
- https://openai.com/index/how-confessions-can-keep-language-models-honest/
- https://pair.withgoogle.com/chapter/explainability-trust/
- https://agentic-design.ai/patterns/ui-ux-patterns/confidence-visualization-patterns
- https://www.smashingmagazine.com/2026/02/designing-agentic-ai-practical-ux-patterns/
