Skip to main content

The Refusal Training Gap: Why Your Model Says No to the Wrong Questions

· 10 min read
Tian Pan
Software Engineer

A user asks your assistant, "How do I kill a Python process that's hung?" and gets a polite refusal about violence. Another user asks, "Who won the 2003 Nobel Prize in Physics?" and gets a confidently invented name. Both responses came out of the same model, both passed your safety review, and both will be in your support inbox by Monday. The frustrating part is that these are not two separate failures with two separate fixes. They are the same failure: your model has been trained to recognize refusal templates, not to recognize what it actually shouldn't answer.

The industry has spent three years getting models to refuse policy-violating requests. It has spent almost no time teaching them to refuse questions they cannot reliably answer. The result is a refusal capability that is misaimed: heavily reinforced on surface patterns ("kill," "exploit," "bypass"), barely trained on epistemic state ("I don't know who that is"). When you only optimize one direction, you get a model that says no to the wrong questions and yes to the wrong questions, often within the same conversation.

Two refusals that should not share a button

There is a useful distinction worth making explicit before you can reason about any of this: there are two completely different reasons a model should decline to answer.

The first is normative: "I shouldn't answer this." The model has been instructed, by policy or by alignment training, that some categories of output are off-limits. The decision is about whether the request is allowed, not whether the model knows the answer. A well-aligned model refuses to write working malware whether or not it could.

The second is epistemic: "I can't reliably answer this." The model lacks the information, the information has changed since training, the question is ambiguous, or the model's internal confidence is too low to commit. The decision is about whether an answer would be honest, not whether it's allowed.

These two refusals look identical from the user's side — the user gets a "sorry, I can't help with that" string and the conversation moves on. But they require completely different machinery to produce well. Normative refusal is mostly a classifier problem on the input. Epistemic refusal is a calibration problem on the model's own uncertainty about its output. Most production models conflate them, route them through the same RLHF objective, and end up under-trained on the epistemic side because the training data heavily oversamples the normative one.

This is why, in 2026, you can still trip over the "How do I kill a Python process?" example. The word "kill" pattern-matches a refusal template that was trained on thousands of harm-adjacent prompts, and the model fires the refusal before any reasoning about what was actually asked. Meanwhile, the same model will tell you, with uniform confidence, the wrong year for an obscure event, because no equivalent training pressure exists to make it abstain on factual uncertainty.

The asymmetry baked into reward models

OpenAI's "Why Language Models Hallucinate," published in September 2025, makes the structural argument cleanly: standard training and evaluation procedures reward guessing over acknowledging uncertainty. When the leaderboard scores accuracy and treats abstentions as wrong answers, every gradient step pushes the model away from "I don't know" and toward a confident guess. If you are graded the same for being wrong as for abstaining, the optimal policy is to always answer.

This asymmetry isn't a bug in any single training run. It's the cumulative effect of years of evaluation conventions — MMLU, TriviaQA, the standard benchmark suite — none of which were designed with calibrated abstention as a first-class outcome. Models that learned to say "I don't know" lost on the leaderboards, so they didn't learn to say it. The rare exceptions, like SimpleQA-style metrics that distinguish wrong from abstained, are still niche compared to accuracy-only scoreboards.

Compare that to the other side. Refusal on policy-violating prompts has its own benchmarks, its own red-teaming pipelines, its own internal review processes. Every major lab runs adversarial harm probes before release. The pressure to refuse harmful prompts is enormous, sustained, and well-instrumented. The pressure to refuse epistemically uncertain prompts is, in most production training pipelines, vestigial.

The result is a model that has internalized two very different signals. It has a strong, well-trained sense of "this looks like a category I should refuse." It has a weak, poorly-trained sense of "I don't actually know this." When those two collide — when a benign factual question pattern-matches a refusal template, or when an unanswerable question doesn't trigger any refusal heuristic — you get the failure modes that fill bug trackers.

Over-refusal is a measurable category, not a vibe

If you only test for unsafe outputs, your refusal rate looks healthy regardless of how much harmless work the model is also blocking. OR-Bench, presented at ICML 2025, was the first large-scale attempt to measure the other side: 80,000 prompts that look superficially harmful but are actually benign, plus a hard subset of around 1,000 prompts that even frontier models still over-refuse on. The companion XSTest suite, with its 250 safe prompts paired against 200 unsafe contrasts, became saturated quickly enough that newer benchmarks like the Multi-turn Scenario-based Exaggerated Safety Benchmark (MS-XSB) and ORFuzz (ASE 2025) emerged to keep up with model improvements.

What these benchmarks consistently show is a tradeoff curve, not a free lunch. Recent measurement work documented system-prompt changes where one configuration rejected 35 percent more toxic prompts but also 55 percent more benign prompts; another rejected 20 percent more toxic prompts at the cost of 10 percent more benign refusals. There is no plateau where you get more safety without paying in helpfulness. If your safety review only graphs one axis, you are riding that curve blindfolded.

The implication for product teams is that "refusal rate" as a single dashboard number is worse than useless. A 5 percent refusal rate could mean the model is correctly declining 5 percent unsafe prompts, or it could mean the model is correctly declining 1 percent of unsafe prompts and incorrectly declining 4 percent of benign ones. The dashboard looks the same. The user experience is wildly different. Until you can attribute refusals to "this was actually unsafe" versus "this was a benign prompt that fired a template," your refusal metric is a single number describing two different failure modes that drift independently.

What calibrated refusal actually looks like

The technical literature on this has converged on a few patterns worth knowing. Refusal Tokens (arXiv 2412.06748) propose treating refusal as a separately-controllable token at decoding time, so you can adjust the threshold without retraining. Recent work on safety representations (arXiv 2511.19009) and SafeConstellations (arXiv 2508.11290) show that safety behavior in the activation space is more structured than the surface refusal-string distribution suggests, which makes per-task calibration tractable. Uncertainty-Based Abstention work measures verbalized hedging ("I don't know," "I'm not sure") as a calibration signal alongside statistical uncertainty metrics, which lets you compare the model's stated confidence to its actual confidence.

The common thread is that none of these treat refusal as one decision. They split it into a "should I answer this?" gate and an "am I confident enough to commit?" gate, and they instrument each one separately. That separation is what makes calibrated refusal possible. If your stack treats every "I cannot help with that" as a single output type, you have no way to tell the model to be more cautious about facts and less cautious about word matches — both signals are routed through the same RLHF target and tuning one perturbs the other.

In practice, the cheapest first step is an evaluation reform, not a model change. Build a held-out set with three labels per prompt: harmful (should refuse), benign (should not refuse), and uncertain (should abstain or hedge). Score the model on three rates jointly: false negative refusal (failed to refuse harmful), false positive refusal (refused benign), and overconfidence (answered when it should have abstained). Most teams have the first two and skip the third entirely. Adding it is the difference between a refusal dashboard that tells you something and one that tells you nothing.

Why this is a product problem, not a research one

The temptation, when you read the research literature, is to wait for the next model release to fix it. That is the wrong move. The frontier labs are aware of the over-refusal problem and the calibration problem, and both will keep improving — but the curve they're riding is the same curve every product team is on. There is no model coming that resolves the safety-helpfulness tradeoff for free. The tradeoff is structural; the only question is whether you've measured your position on the curve and made an explicit choice about where you want to be.

The product failure mode worth naming is the dashboard that stays green while users hit both walls. Your refusal rate is steady at 3 percent. Your CSAT is slowly declining. Your support inbox has two distinct complaint patterns: "your AI refused to help me with X" (over-refusal) and "your AI confidently told me a wrong fact about Y" (under-refusal on epistemic uncertainty). These look like unrelated problems unless you understand they're the same gap, scored from opposite directions, and your eval harness can see neither.

The actionable shape of the fix is unglamorous: add the prompts your users actually fail on to your eval set, label them with both the "should refuse" axis and the "should abstain" axis, and watch the joint metric over releases. When your model upgrades, do not just track aggregate refusal rate — break it down by category and by epistemic ground truth. When you tune system prompts, run them against both the harmful and benign suites before shipping; a system prompt that improves harm refusal almost always degrades benign helpfulness, and you want to see the tradeoff before users do.

The takeaway

Refusal is not one capability. It is two capabilities that the training pipeline has historically treated as one, with the result that one is heavily over-trained on pattern-matched templates and the other is barely trained at all. Until your evaluation methodology separates them — and until you score the model jointly on what it correctly declines, what it incorrectly declines, and what it confidently fabricates — you cannot tell whether your model is getting safer or just getting more annoying. The teams that figure this out early are the ones whose dashboards stop lying to them, and whose users stop hitting both walls in the same week.

References:Let's stay in touch and Follow me for more thoughts and updates