The Confidence-Score Tax: Why Asking the Model How Sure It Is Costs More Than Being Wrong
Somewhere in the evolution of every AI feature, a reviewer asks a reasonable-sounding question: "Can we have the model tell us how confident it is, so we can route the low-confidence answers to a human or a fallback?" It sounds like free insurance. You add a confidence field to the output schema, the model dutifully fills it in, and now you have a dial to turn. Ship it.
That dial is not free, and worse, it is usually not wired to anything. The confidence number is a token sequence the model is happy to produce and under no obligation to mean. Teams pay real tokens and real latency to acquire it, never check whether it correlates with correctness, and then route production traffic on it as if "0.9" were a 90% reliability estimate. It is a gauge bolted to the dashboard with nothing behind the glass.
This post is about the two costs nobody priced: the per-request tax of generating the confidence field at all, and the much larger cost of trusting an uncalibrated number to make routing decisions.
The Field That Looks Free
Start with the part that is easy to ignore because it is small per request and invisible in aggregate.
Adding confidence to your output schema is not a metadata annotation. It is a generation task. The model produces those tokens the same way it produces the answer, and every component of that has a cost:
- Output tokens on every request. The confidence field is generated whether or not the answer was actually uncertain. A feature serving ten million requests a month pays for ten million confidence fields, including the nine-plus million where the model was going to be right anyway and you never looked at the number.
- Prompt tokens for the instruction. Telling the model to self-assess takes instruction text, and good confidence prompting is not one sentence. Research on verbalized confidence shows that the reliability of the number depends heavily on how you ask — the better-calibrated prompt methods involve probability-framed wording, detailed descriptions, and few-shot examples. Those examples are prompt tokens on every call.
- A reasoning preamble you did not ask for. Models often justify a confidence score by reasoning about it first. That justification rides on the critical path of the response. If your feature streams, the user waits for it.
- Latency on the confident majority. The tax is paid most heavily exactly where it helps least: on the large fraction of requests where the model was correct and confident, you spend tokens and milliseconds producing a number that triggers no action.
None of this shows up as a line item. It shows up as a slightly higher token bill, a slightly worse p95, and a schema that grew one field. The honest way to account for it is per-request times total volume: a confidence field is a tax levied on the confident majority to fund routing on the uncertain tail. Whether that trade is worth it is a real question — but you cannot answer it if you have priced the field at zero.
The Schema Change Is Not Neutral
There is a subtler cost that even careful teams miss. Adding a field to a structured output does not just append data — it can change the answer itself.
LLMs generate left to right. The structure and ordering of an output schema is, functionally, part of the prompt the model is conditioning on as it writes. This is not a theoretical concern. Practitioners have documented cases where renaming a single output field — final_choice to answer — moved task accuracy from single digits to the mid-nineties. Field order matters too: putting a reasoning field before the answer field, versus after, has been shown to swing accuracy by several points, because the model either thinks before it commits or rationalizes after.
Now drop a confidence field into that schema. Where does it go? If it lands before the answer, you have asked the model to commit to a confidence level before it has generated the thing it is supposedly confident about — which is incoherent, and which perturbs the answer that follows. If it lands after, the model has already produced the answer and is now narrating a number about text it cannot revise. Either way, the schema you shipped to get a confidence signal has measurably changed the distribution of the primary output you actually care about.
The takeaway is uncomfortable: the act of measuring confidence is not a passive read. It is an intervention. You should A/B the answer quality of the schema-with-confidence against the schema-without before you assume the field is observational.
The Number You Bought Is Probably Not Calibrated
Here is the cost that dwarfs the others. Suppose you have paid the tax, the field is populated, and you are ready to route. The whole premise of routing is that "0.9" means something close to "right 90% of the time." It almost never does.
A model's stated confidence is a token sequence shaped by the training distribution's correlation between phrasing and correctness. It is not a probability estimate. It is what a confident-sounding answer looks like in the training data. The empirical record on this is blunt:
- Studies of verbalized confidence find that even strong models discriminate correct from incorrect answers barely above chance — area-under-ROC figures in the low sixties, where 50 is a coin flip.
- Models systematically cluster their self-reported confidence at the top of the scale. They report 90–100% confidence constantly, including on answers that are wrong, and the actual accuracy in those high-confidence bins sits well below the line.
- Calibration error varies wildly by model. Recent evaluations have measured one frontier model with an expected calibration error above 0.7 while answering correctly only a quarter of the time — a confidence signal that is essentially noise — and others, on the same benchmark, three to four times better calibrated. You cannot assume; you have to measure the specific model and the specific task.
- https://arxiv.org/abs/2412.14737
- https://arxiv.org/html/2412.14737v2
- https://arxiv.org/html/2502.11028v1
- https://arxiv.org/html/2510.26995
- https://arxiv.org/html/2410.13284v3
- https://python.useinstructor.com/blog/2024/09/26/bad-schemas-could-break-your-llm-structured-outputs/
- https://iclr-blogposts.github.io/2025/blog/calibration/
- https://1up.ai/blog/why-llms-suck-at-confidence-scoring/
- https://cleanlab.ai/blog/trustworthy-language-model/
- https://www.nature.com/articles/s42256-026-01217-9
