Skip to main content

Abstain or Escalate: The Two-Threshold Problem in Confidence-Gated AI

· 13 min read
Tian Pan
Software Engineer

Most production AI features ship with a single confidence threshold. Above the line, the model answers. Below it, the user gets a flat "I'm not sure." That single number is doing two completely different jobs at once, and it's why your trust metric has been sliding for two quarters even though your accuracy on answered queries looks fine.

The right design has at least two cutoffs. An abstain threshold sits low: below it, the model declines because no answer is worth more than silence. An escalate threshold sits in the middle: between the two cutoffs, the system hands the case to a human reviewer instead of dropping it on the floor. Collapse them into a single dial and you ship a product that feels equally useless when it's wrong and when it's uncertain — which is the worst possible position to occupy in a market where users have a free alternative one tab away.

This isn't a new idea. The reject-option classifier literature has been arguing for split thresholds since the 1970s, distinguishing ambiguity rejects (the input is between known classes) from distance rejects (the input is far from any training data). Production AI teams keep rediscovering the same lesson the hard way, usually about six months after their first launch, when the support queue is full of people typing "is this thing broken or what."

Why one number lies to you

A single confidence threshold pretends that "the model shouldn't answer" and "a human should answer instead" are the same decision. They are not.

Consider what happens at threshold 0.7. A query at confidence 0.4 gets the same treatment as a query at confidence 0.65: both get the abstain message, both look identical to the user, both contribute to the same drop in answer rate. But these are radically different cases. The 0.4 query is genuinely outside the model's competence — no amount of reviewer time will change that, and the right answer is "this product can't help you, here's how to find someone who can." The 0.65 query is a near-miss where a five-minute human review would produce a correct answer that the user would value. Treating them as the same outcome means the team optimizes for the easier metric (raise the threshold to cut hallucinations) and quietly destroys the harder one (give users actual help in the gray zone).

The trust math is asymmetric in a way that surprises new teams. Empirical work on user perception of AI features consistently shows that a flat "I don't know" erodes trust at roughly the same rate as a confidently wrong answer when both happen frequently. The user doesn't have a clean mental model that separates "the model declined" from "the model failed" — they just know they didn't get help. If your single threshold is producing a 25% abstain rate, you are paying the trust cost of being wrong 25% of the time, with none of the upside of occasionally being right.

The calibration that has to land first

Before you can pick two thresholds, you need a confidence signal that means something. Most teams skip this step and pick numbers off a calibration curve that doesn't exist.

Token probability is the most commonly abused signal. It looks like a confidence score, the API returns it for free, and it correlates loosely enough with correctness that nobody questions it for a while. The problem is well-documented: token probability conflates lexical uncertainty (which of several phrasings will I emit) with factual uncertainty (do I know the answer). A model that is 95% sure the next token is "Paris" tells you nothing useful about whether Paris is the right answer. Recent measurements have shown average expressed confidence of 91% on tasks where actual accuracy was 39%, and verbalized confidence (asking the model to output a percentage) is worse than useless because it clusters in the 90–100% range regardless of what the model knows.

What works better is sample-based agreement. For a question you ask the model five times at non-zero temperature: if the answers converge, the model probably knows. If they diverge across plausible-but-different answers, the model is hallucinating. This requires a real budget — five sample fanout is five times the cost — but it produces a confidence signal that survives contact with production traffic. Some teams pair this with a smaller verifier model that scores the answer against retrieved evidence, which gives a second axis of confidence that fails differently from sample agreement and can be combined with it.

Whatever signal you pick, the deliverable is a calibration curve: predicted confidence on the x-axis, observed accuracy on the y-axis, ideally a diagonal line. If the curve is nowhere near the diagonal, no choice of threshold will save you. The threshold is downstream of calibration, and calibration is downstream of evaluation discipline most teams don't have when they ship the first version.

Two thresholds, three regions, three product moments

Once the signal is calibrated, the architecture becomes straightforward. You pick two cutoffs that partition the confidence range into three regions, each with a distinct product behavior.

High confidence (above the upper threshold): the model answers directly. The latency budget is tight, the UX is conversational, and the eval target is "is the answer correct given that we chose to give one." The metric here is accuracy on answered queries, and it's the metric most teams already track.

Medium confidence (between the thresholds): the case is escalated to a human reviewer. The latency budget shifts from seconds to minutes or hours, the UX becomes "we're looking into this, here's a tracking link," and the eval target is "did the reviewer produce a better answer than the model would have." The metric here is escalation yield — what fraction of escalated cases produced a meaningfully better outcome than the model's draft.

Low confidence (below the lower threshold): the model declines and the UX redirects the user. This is not "I don't know" with no follow-up; it's "I can't help with this, but here's the relevant doc / contact / form / specialist." The eval target is "did the redirect get the user to a useful next step," and the metric is recovery rate — what fraction of declined cases produced a successful resolution outside the AI surface.

Three regions, three completely different product designs. Three different SLAs. Three different teams own them, in any company larger than a startup. The single-threshold version pretends that the bottom two regions are the same product moment, and the team that ships it then spends the next two quarters explaining why "abstain" rates and "escalate" rates can't be added together to produce a meaningful metric.

The reviewer queue is a hard constraint, not a footnote

The hidden gotcha in the two-threshold design is that the escalation threshold is bounded by something most teams don't model: the throughput of the human review queue. If your model emits 100,000 cases per day at a 15% escalation rate, you are committing 15,000 cases per day of reviewer time. At an optimistic three-minute average review time, that's 750 reviewer-hours per day, which is roughly 95 full-time reviewers working a standard shift. Most teams discover this number after the threshold is already wired up and the queue overflows by lunchtime on day one.

The right move is to start from the queue. Reviewer capacity is a fixed quantity — measured in reviewer-hours per day, with seasonal variation and turnover that make it less elastic than engineering capacity. The escalation threshold has to be tuned so that the volume above the abstain line and below the answer line fits in the queue, with headroom for traffic spikes. This is a queueing-theory problem, not a model-tuning problem. The team that doesn't model it will see the queue overflow, the support team will silently start dropping cases or batching them past the SLA, and the trust metric will tank because escalation in name only is worse than no escalation at all.

The escalation rates that actually sustain in production tend to land in the 8–15% range for most consumer-facing features, with higher rates only viable in domains where the per-case value justifies dedicated reviewer staffing (legal, medical, financial advice). If your initial threshold pencil-out is producing 30% escalation, the threshold isn't the problem — your model isn't ready for the deployment shape you're targeting. Either invest in better calibration to compress the gray zone, narrow the feature's scope to higher-confidence intents, or accept a lower coverage target and tune the abstain threshold higher.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates