Skip to main content

Abstain or Escalate: The Two-Threshold Problem in Confidence-Gated AI

· 13 min read
Tian Pan
Software Engineer

Most production AI features ship with a single confidence threshold. Above the line, the model answers. Below it, the user gets a flat "I'm not sure." That single number is doing two completely different jobs at once, and it's why your trust metric has been sliding for two quarters even though your accuracy on answered queries looks fine.

The right design has at least two cutoffs. An abstain threshold sits low: below it, the model declines because no answer is worth more than silence. An escalate threshold sits in the middle: between the two cutoffs, the system hands the case to a human reviewer instead of dropping it on the floor. Collapse them into a single dial and you ship a product that feels equally useless when it's wrong and when it's uncertain — which is the worst possible position to occupy in a market where users have a free alternative one tab away.

This isn't a new idea. The reject-option classifier literature has been arguing for split thresholds since the 1970s, distinguishing ambiguity rejects (the input is between known classes) from distance rejects (the input is far from any training data). Production AI teams keep rediscovering the same lesson the hard way, usually about six months after their first launch, when the support queue is full of people typing "is this thing broken or what."

Why one number lies to you

A single confidence threshold pretends that "the model shouldn't answer" and "a human should answer instead" are the same decision. They are not.

Consider what happens at threshold 0.7. A query at confidence 0.4 gets the same treatment as a query at confidence 0.65: both get the abstain message, both look identical to the user, both contribute to the same drop in answer rate. But these are radically different cases. The 0.4 query is genuinely outside the model's competence — no amount of reviewer time will change that, and the right answer is "this product can't help you, here's how to find someone who can." The 0.65 query is a near-miss where a five-minute human review would produce a correct answer that the user would value. Treating them as the same outcome means the team optimizes for the easier metric (raise the threshold to cut hallucinations) and quietly destroys the harder one (give users actual help in the gray zone).

The trust math is asymmetric in a way that surprises new teams. Empirical work on user perception of AI features consistently shows that a flat "I don't know" erodes trust at roughly the same rate as a confidently wrong answer when both happen frequently. The user doesn't have a clean mental model that separates "the model declined" from "the model failed" — they just know they didn't get help. If your single threshold is producing a 25% abstain rate, you are paying the trust cost of being wrong 25% of the time, with none of the upside of occasionally being right.

The calibration that has to land first

Before you can pick two thresholds, you need a confidence signal that means something. Most teams skip this step and pick numbers off a calibration curve that doesn't exist.

Token probability is the most commonly abused signal. It looks like a confidence score, the API returns it for free, and it correlates loosely enough with correctness that nobody questions it for a while. The problem is well-documented: token probability conflates lexical uncertainty (which of several phrasings will I emit) with factual uncertainty (do I know the answer). A model that is 95% sure the next token is "Paris" tells you nothing useful about whether Paris is the right answer. Recent measurements have shown average expressed confidence of 91% on tasks where actual accuracy was 39%, and verbalized confidence (asking the model to output a percentage) is worse than useless because it clusters in the 90–100% range regardless of what the model knows.

What works better is sample-based agreement. For a question you ask the model five times at non-zero temperature: if the answers converge, the model probably knows. If they diverge across plausible-but-different answers, the model is hallucinating. This requires a real budget — five sample fanout is five times the cost — but it produces a confidence signal that survives contact with production traffic. Some teams pair this with a smaller verifier model that scores the answer against retrieved evidence, which gives a second axis of confidence that fails differently from sample agreement and can be combined with it.

Whatever signal you pick, the deliverable is a calibration curve: predicted confidence on the x-axis, observed accuracy on the y-axis, ideally a diagonal line. If the curve is nowhere near the diagonal, no choice of threshold will save you. The threshold is downstream of calibration, and calibration is downstream of evaluation discipline most teams don't have when they ship the first version.

Two thresholds, three regions, three product moments

Once the signal is calibrated, the architecture becomes straightforward. You pick two cutoffs that partition the confidence range into three regions, each with a distinct product behavior.

High confidence (above the upper threshold): the model answers directly. The latency budget is tight, the UX is conversational, and the eval target is "is the answer correct given that we chose to give one." The metric here is accuracy on answered queries, and it's the metric most teams already track.

Medium confidence (between the thresholds): the case is escalated to a human reviewer. The latency budget shifts from seconds to minutes or hours, the UX becomes "we're looking into this, here's a tracking link," and the eval target is "did the reviewer produce a better answer than the model would have." The metric here is escalation yield — what fraction of escalated cases produced a meaningfully better outcome than the model's draft.

Low confidence (below the lower threshold): the model declines and the UX redirects the user. This is not "I don't know" with no follow-up; it's "I can't help with this, but here's the relevant doc / contact / form / specialist." The eval target is "did the redirect get the user to a useful next step," and the metric is recovery rate — what fraction of declined cases produced a successful resolution outside the AI surface.

Three regions, three completely different product designs. Three different SLAs. Three different teams own them, in any company larger than a startup. The single-threshold version pretends that the bottom two regions are the same product moment, and the team that ships it then spends the next two quarters explaining why "abstain" rates and "escalate" rates can't be added together to produce a meaningful metric.

The reviewer queue is a hard constraint, not a footnote

The hidden gotcha in the two-threshold design is that the escalation threshold is bounded by something most teams don't model: the throughput of the human review queue. If your model emits 100,000 cases per day at a 15% escalation rate, you are committing 15,000 cases per day of reviewer time. At an optimistic three-minute average review time, that's 750 reviewer-hours per day, which is roughly 95 full-time reviewers working a standard shift. Most teams discover this number after the threshold is already wired up and the queue overflows by lunchtime on day one.

The right move is to start from the queue. Reviewer capacity is a fixed quantity — measured in reviewer-hours per day, with seasonal variation and turnover that make it less elastic than engineering capacity. The escalation threshold has to be tuned so that the volume above the abstain line and below the answer line fits in the queue, with headroom for traffic spikes. This is a queueing-theory problem, not a model-tuning problem. The team that doesn't model it will see the queue overflow, the support team will silently start dropping cases or batching them past the SLA, and the trust metric will tank because escalation in name only is worse than no escalation at all.

The escalation rates that actually sustain in production tend to land in the 8–15% range for most consumer-facing features, with higher rates only viable in domains where the per-case value justifies dedicated reviewer staffing (legal, medical, financial advice). If your initial threshold pencil-out is producing 30% escalation, the threshold isn't the problem — your model isn't ready for the deployment shape you're targeting. Either invest in better calibration to compress the gray zone, narrow the feature's scope to higher-confidence intents, or accept a lower coverage target and tune the abstain threshold higher.

The org failure mode that ships every time

Two-threshold designs fail in the org chart before they fail in production. The pattern is consistent enough that you can predict it from the staffing diagram.

The engineer tuning the threshold sits on the AI team. Their evaluation surface is the calibration curve, the answer-rate metric, and the offline eval suite. They optimize for "the threshold that maximizes expected utility on our eval set," and they ship a number that looks great on the dashboard.

The reviewer queue is owned by an operations team — sometimes outsourced, often in a different time zone, occasionally a contracted vendor with a per-hour SLA. They learn about the new escalation flow when their queue depth alarm fires for the first time. Their evaluation surface is queue depth, average handle time, and reviewer satisfaction. They optimize for "the throughput that keeps the queue from overflowing," which often means quietly compressing the time-per-case until quality slips.

The user-facing copy for "I don't know" and "let me get someone who does" is owned by a product designer or content strategist who sees these as one feature: the "fallback experience." They ship one piece of copy that's vague enough to cover both cases, which is exactly the failure mode the two-threshold design was supposed to fix.

The single-owner pattern that works: one product manager who owns the entire confidence-gating policy as a unified surface, with explicit accountability for the abstain rate, the escalation rate, the reviewer queue depth, and the recovery rate as one metric set. The PM doesn't need to be deeply technical, but they have to understand that those four metrics are coupled, and that optimizing any one in isolation breaks the others. Without that single owner, the threshold gets re-tuned every quarter by whichever team is currently feeling the pain, and the others find out by reading the dashboard.

The UX layer most teams build last

Once the thresholds are picked and the queue is staffed, the UX still has to do the work of making three regions feel like one coherent product. This is the layer most teams build last and rebuild three times.

The "abstain" copy has to do redirect work without sounding dismissive. "I don't know" is the worst possible phrasing because it leaves the user with nowhere to go. "I can't help with that — try [specific link / specific person]" is dramatically better, because it converts an abstention into a successful handoff. The metric for this surface is whether the user clicks through to the redirect, not whether the model declined politely.

The "escalate" copy has different requirements. The user has to understand that a human is now involved, that the response will be slower, and that the original interaction has been preserved so they don't have to re-explain. The latency expectation has to be set explicitly — "expect a response within four hours" is a contract, "we'll get back to you soon" is not. The notification flow has to handle the eventual response in a way that brings the user back to the original context, not a generic email that drops them at a help-desk landing page.

The "answer" copy is what most teams already have. The mistake is letting it leak across the boundary — using the same conversational tone for an escalation that should feel like a status update, or the same direct answer format for a redirect that should feel like a referral.

Three regions, three voices, three latency budgets, one product. The team that gets this right has tuned not just the numbers but the entire choreography around them.

A confidence signal is a routing layer

The deeper realization is that confidence in production AI is not a metric — it's a routing layer. A single threshold is the agent equivalent of running a load balancer with one backend: you have either "the backend can handle it" or "the request is dropped," with nothing in between. The mature design has multiple backends with different costs and capabilities, and a routing rule that picks the right one for each request based on properties of the request itself.

In the two-threshold design, the model is the cheap fast backend, the human reviewer is the expensive slow backend, and the redirect is the "this isn't our problem" path. The thresholds are the routing rules. Once you see the architecture this way, you stop arguing about "what's the right confidence threshold" and start asking the right question: "given the cost and latency of each backend, and the traffic distribution we're seeing, what split minimizes total expected cost subject to the queue capacity and the trust budget."

That's a problem with a clean answer. The single-threshold framing isn't.

Where to start

If you're shipping a confidence-gated feature today and you have one threshold, here's the order of operations to fix it.

First, audit the calibration. If you don't have a calibration curve for your current confidence signal, you don't have a confidence signal — you have a number that correlates with correctness in some unknown way. Build the curve before you touch the thresholds.

Second, model the queue. Talk to whoever owns reviewer capacity, get a real throughput number, and back-calculate the maximum sustainable escalation rate. That number is your upper threshold's hard ceiling.

Third, separate the metrics. Stop reporting "abstain rate" as a single number. Track "answer rate," "escalation rate," "decline-and-redirect rate," and "decline-with-no-redirect rate" as four separate quantities. The last one should approach zero — if it doesn't, your redirect UX is broken.

Fourth, find a single owner. The threshold policy is one decision with four metrics; it needs one PM, not four engineering teams arguing about which lever to pull.

The payoff isn't immediate, but it compounds. Teams that get this right end up with confidence-gated features that feel honest in a way single-threshold systems never do — the user trusts the answers more because they trust the abstentions more, and the escalations don't feel like a black hole. The architecture follows the physics of the problem instead of fighting it. That's worth more than any threshold-tuning sprint will produce.

References:Let's stay in touch and Follow me for more thoughts and updates