Skip to main content

5 posts tagged with "uncertainty"

View all tags

The Confidence Score Your Users Learned to Ignore

· 11 min read
Tian Pan
Software Engineer

You wanted to be honest. You put a little "92%" next to every answer your agent gave. After the third time the agent was confidently wrong at 92%, your users stopped reading the number. They did not get angry about it. They just learned, the way humans always learn around a misbehaving signal, that the gauge on the dashboard is not connected to the engine. The number is still there. It costs you tokens to produce it. It informs no decision anyone makes.

This is the failure mode that calibration UX research keeps rediscovering: surfacing a probability is a trust commitment, and the commitment goes one direction. The moment the number turns out to be uncorrelated with correctness in the user's lived experience, the score is dead — and the trust you spent putting it there is dead with it. You cannot un-ring that bell by fixing the number later. The number is now decoration.

The Confidence-Score Tax: Why Asking the Model How Sure It Is Costs More Than Being Wrong

· 10 min read
Tian Pan
Software Engineer

Somewhere in the evolution of every AI feature, a reviewer asks a reasonable-sounding question: "Can we have the model tell us how confident it is, so we can route the low-confidence answers to a human or a fallback?" It sounds like free insurance. You add a confidence field to the output schema, the model dutifully fills it in, and now you have a dial to turn. Ship it.

That dial is not free, and worse, it is usually not wired to anything. The confidence number is a token sequence the model is happy to produce and under no obligation to mean. Teams pay real tokens and real latency to acquire it, never check whether it correlates with correctness, and then route production traffic on it as if "0.9" were a 90% reliability estimate. It is a gauge bolted to the dashboard with nothing behind the glass.

This post is about the two costs nobody priced: the per-request tax of generating the confidence field at all, and the much larger cost of trusting an uncalibrated number to make routing decisions.

Epistemic Trust in Agent Chains: How Uncertainty Compounds Through Multi-Step Delegation

· 10 min read
Tian Pan
Software Engineer

Most teams building multi-agent systems spend a lot of time thinking about authorization trust: what is Agent B allowed to do, which tools can it call, what data can it access. That's an important problem. But there's a second trust problem that doesn't get nearly enough attention, and it's the one that actually kills production systems.

The problem is epistemic: when Agent A delegates a task to Agent B and gets back an answer, how much should A believe what B returned?

This isn't a question of whether B was authorized to answer. It's a question of whether B actually could.

The Confident Hallucinator: Runtime Patterns for Knowledge Boundary Signaling in LLMs

· 10 min read
Tian Pan
Software Engineer

GPT-4 achieves roughly 62% AUROC when its own confidence scores are used to separate correct answers from incorrect ones. That's barely above the 50% baseline of flipping a coin. The model sounds certain and polished in both cases. If you're building a production system that assumes high-confidence responses are reliable, you're working with a signal that's nearly random.

This is the knowledge boundary signaling problem, and it sits at the center of most real-world LLM quality failures. The model doesn't know what it doesn't know — or more precisely, it knows internally but can't be trusted to express it. The engineering challenge isn't getting models to refuse more; it's designing systems that make uncertainty actionable without making your product feel broken.

Confidence Strings, Not Scores: Why Your 0.87 Badge Moves Nobody

· 10 min read
Tian Pan
Software Engineer

The product team ships a confidence badge next to every AI suggestion. Green for ≥85%, yellow for 60–84%, red below. They run an A/B test six weeks later and find no change in user behavior at any threshold. False positives at 0.92 confidence get accepted at the same rate as false positives at 0.61 confidence. The team's instinct is to tune the calibration — fit a temperature scaling layer, regenerate the badges, run the A/B again. The numbers shift; the behavior doesn't.

The problem isn't that the model is miscalibrated, though it almost certainly is. The problem is that calibrated probability is the wrong output. The signal a user can act on isn't "how sure" the model is. It's "what specifically the model didn't check." A 0.87 badge tells the user nothing they can verify. "I'm reasonably confident in the address but I haven't checked the unit number" tells them exactly where to look.