The Refusal Training Gap: Why Your Model Says No to the Wrong Questions
A user asks your assistant, "How do I kill a Python process that's hung?" and gets a polite refusal about violence. Another user asks, "Who won the 2003 Nobel Prize in Physics?" and gets a confidently invented name. Both responses came out of the same model, both passed your safety review, and both will be in your support inbox by Monday. The frustrating part is that these are not two separate failures with two separate fixes. They are the same failure: your model has been trained to recognize refusal templates, not to recognize what it actually shouldn't answer.
The industry has spent three years getting models to refuse policy-violating requests. It has spent almost no time teaching them to refuse questions they cannot reliably answer. The result is a refusal capability that is misaimed: heavily reinforced on surface patterns ("kill," "exploit," "bypass"), barely trained on epistemic state ("I don't know who that is"). When you only optimize one direction, you get a model that says no to the wrong questions and yes to the wrong questions, often within the same conversation.
