Skip to main content

Confidence Strings, Not Scores: Why Your 0.87 Badge Moves Nobody

· 10 min read
Tian Pan
Software Engineer

The product team ships a confidence badge next to every AI suggestion. Green for ≥85%, yellow for 60–84%, red below. They run an A/B test six weeks later and find no change in user behavior at any threshold. False positives at 0.92 confidence get accepted at the same rate as false positives at 0.61 confidence. The team's instinct is to tune the calibration — fit a temperature scaling layer, regenerate the badges, run the A/B again. The numbers shift; the behavior doesn't.

The problem isn't that the model is miscalibrated, though it almost certainly is. The problem is that calibrated probability is the wrong output. The signal a user can act on isn't "how sure" the model is. It's "what specifically the model didn't check." A 0.87 badge tells the user nothing they can verify. "I'm reasonably confident in the address but I haven't checked the unit number" tells them exactly where to look.

This is the gap between the academic frame for uncertainty — where the goal is a probability that matches empirical accuracy — and the production frame, where the goal is a string the user can act on. The two have been conflated for so long that most teams ship the academic artifact and assume users will translate it into action. They don't. They ignore it.

The probability badge is a UI dead end

The literature on automation bias gives the bad news clearly: users overtrust confident AI outputs and ignore the confidence signal when it conflicts with their existing belief. Studies across 2024–2025 show that a numeric confidence score functions less as a decision input and more as a rubber stamp — users who already wanted to accept the suggestion treat 0.61 as "good enough," and users who already wanted to reject it treat 0.92 as "the model is overfit."

There's a structural reason this happens. A probability is a summary statistic. It compresses every reason the model is uncertain into a single number, and the number doesn't tell the user which reason it is. Did the model not see this exact pattern in training? Was the input ambiguous? Was a tool call truncated? Did the retrieved context disagree with itself? All of these collapse into the same 0.73, and 0.73 doesn't tell the user where to look.

The user's actual decision question is not "how confident is the model" but "what should I check before I trust this." A scalar can't answer that question because the answer is a pointer to a specific field, claim, or assumption. The probability is the wrong shape of signal for the decision being made.

Calibration research has spent a decade trying to make these scalars more accurate. The Expected Calibration Error of frontier models in 2025 sits between 0.05 and 0.20 depending on task — meaning the displayed probability is off by 5 to 20 percentage points on average. Alignment training tends to make this worse, not better; preference-collapse during RLHF systematically pushes models toward overconfidence because confident answers get rewarded. Post-hoc calibration techniques like temperature scaling can recover some of this. None of it changes user behavior, because the user wasn't reading the number in the first place.

What does change user behavior

The HCI literature is more useful here than the calibration literature. A 2024 study at FAccT found that first-person uncertainty expressions like "I'm not sure, but…" significantly decreased user agreement with incorrect AI suggestions and increased user accuracy on the joint task — while general-perspective hedges like "It's not clear, but…" had no statistically significant effect. The grammatical person matters: users respond to a model that owns its uncertainty more than to a model that gestures at uncertainty in the abstract.

A separate 2025 paper on verbalized uncertainty in AI-assisted decision-making found that users prefer "medium" verbalized uncertainty — language that acknowledges limitations without sounding overwhelmed by them — over either confident assertion or constant hedging. Trust and task performance both peaked at this middle setting, and the effect was independent of the model's actual accuracy. Users were responding to the texture of the language, not the underlying probability.

The effect that's missing from the probability-badge approach is specificity. "I'm reasonably sure" doesn't help. "I'm reasonably sure, but I haven't checked whether this address has a unit number" helps a lot. The user now has a closed-loop action: look at the unit number, decide. The narrative names a verifiable claim, and the user can verify it in seconds. The probability badge would have shown 0.84 and the user would have accepted or rejected based on prior bias.

This is what an uncertainty narrative does that a probability score can't: it transfers the verification work from "is the model right" (unanswerable in finite time) to "is this specific claim right" (answerable in seconds). The narrative compresses the model's residual uncertainty into the smallest claim that, if checked, resolves it.

Generating uncertainty narratives as a first-class output

The implementation pattern looks different from a calibration layer. You don't post-process logits into a calibrated probability; you ask the model to emit, alongside its answer, a structured field that names which sub-claims it's confident in and which it isn't, and what the user would check to resolve each uncertain sub-claim.

The output schema has three fields per uncertain sub-claim: the claim, the source of uncertainty, and the cheapest verification. "Address: 1234 Elm St, Apt 7B" → uncertainty: "the unit number was extracted from a phone-call transcript with background noise" → verify: "confirm with the customer record." A short string the user can act on, generated alongside the answer rather than computed after it.

A few engineering points are non-obvious:

  • Force the model to emit the narrative even when confident. If you only ask for an uncertainty narrative when the model thinks it might be wrong, the model will think it's right, and you'll get nothing. Treat the field as required and let the model say "no material uncertainties" when warranted. Empty narratives carry information too.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates