The Refusal Calibration Your Two Separate Evals Keep Undoing
Pull up the dashboards for the last four model upgrades and look at the safety number next to the helpfulness number. One of them moved on every release. It was almost never the same one twice. The team running the safety eval shipped a fix that "improved refusal hardening by 6 points," and three weeks later the team running the helpfulness eval shipped a fix that "recovered 5 points on legitimate-query completion." Then the cycle started over.
This is not two teams making independent progress. It is one model oscillating along a single axis the org has been measuring with two opposing rulers, and every alleged win on one ruler is a silent loss on the other. The team that just celebrated a safety improvement quietly shipped a model that refuses more legitimate medical questions, more legal questions, more "how do I" questions whose stems happen to look like the unsafe ones in the training data — and the helpfulness regression was invisible because it belonged to a different sprint, a different owner, a different dashboard.
The two-eval setup feels rigorous. It is auditable, it produces orthogonal-looking numbers, and it lets two teams move in parallel. It also structurally guarantees that the optimization process will trade one number for the other every cycle, because the metrics are not actually orthogonal — they are two projections of the same decision the model makes on every single request, and the eval design has hidden that the decision is one decision.
The two-eval setup is measuring one axis from opposite ends
A safety eval, in its usual form, is a set of harmful or dual-use prompts where the labeled "correct" behavior is to refuse. Refusal scores well; compliance scores poorly. The metric goes up when the model refuses more of these prompts.
A helpfulness eval, in its usual form, is a set of legitimate prompts where the labeled "correct" behavior is to answer. Compliance scores well; refusal scores poorly. The metric goes up when the model answers more of these prompts.
If you sketch the model's behavior as a single decision threshold on "how cautious am I right now," the two metrics measure that threshold from opposite sides. Move the threshold toward caution and the safety eval improves while the helpfulness eval degrades. Move it toward compliance and the helpfulness eval improves while the safety eval degrades. The dashboards are not measuring independent capabilities; they are measuring the same dial seen through two reciprocal lenses.
This is the structural shape of a Pareto frontier. Recent work formalizes exactly this geometry — when researchers run preference optimization on safety alone, helpfulness alone, or both sequentially, the resulting models land on a roughly linear frontier rather than discovering a corner that dominates both axes. Even training with both objectives in the same pass tends to produce another point on the frontier rather than a strict improvement. The frontier exists in the model's actual behavior. The two-eval setup hides it.
Hiding the frontier has organizational consequences. When the team owning the safety dashboard ships a hardening change, they see a clean up-and-to-the-right move on their own chart. They do not see, on the same chart, the cost they just imposed on the legitimate-query population. The helpfulness team will see that cost two weeks later as a regression on their dashboard, and will treat it as a new problem to solve rather than as the inevitable consequence of the previous "win." The two teams take turns moving the dial back and forth, each move locally rational, each move undoing the previous one. The model's behavior on the frontier never improves; it just slides along it.
The metric you actually wanted is "right action," not "refused" or "answered"
The fix is not to add a third eval that combines the other two. The fix is to recognize that "refuse" and "answer" are two of the choices the model makes on every request, and the metric you wanted all along is whether the model made the right choice on that specific request — not whether it picked one specific behavior across the suite.
Concretely, every eval case should carry a labeled "right action" target rather than a labeled "right output." The right action for a clearly harmful query is refuse. The right action for a clearly benign query is answer. The right action for the genuinely ambiguous middle — where the query could be a legitimate professional question or a thin disguise for something else — is rarely a hard yes or no; it is usually clarify, answer with a safety caveat, answer the safe interpretation and flag the unsafe one, or in some cases answer partially and decline the dangerous specifics. A well-calibrated model picks among those actions per request; a well-designed eval rewards picking the right one and penalizes any other pick — including penalizing the cautious choice on a benign query and the compliant choice on a harmful one with the same severity.
This reframing has a few mechanical consequences worth being explicit about.
First, the unit of analysis changes from "did the model produce a refusal" to "did the model produce the labeled action for this case." A refusal on a harmful case scores the same as an answer on a benign case: both are correct actions. An over-decline on a benign case scores the same as an under-decline on a harmful case: both are wrong actions. The eval can no longer be gamed by uniformly increasing or decreasing caution, because either move trades one kind of correct action for one kind of wrong action and leaves the aggregate roughly unchanged.
Second, the label schema has to expand. A binary refuse/answer label cannot express "this case should get a partial answer with the dangerous specifics omitted" or "this case should get a clarifying question because the intent is genuinely ambiguous." The eval needs a labeled action from a richer set — at minimum {answer, answer-with-caveat, clarify, answer-safe-interpretation-only, refuse} — with tolerance bands describing which adjacent actions count as acceptable for cases on the boundary. Lexicographic scoring (full credit for the labeled action, partial credit for adjacent actions, zero for distant ones) makes the metric stable to disagreement on hard cases without collapsing the signal on easy ones.
- https://arxiv.org/abs/2405.20947
- https://arxiv.org/html/2405.20947v5
- https://arxiv.org/pdf/2308.01263
- https://aclanthology.org/2024.naacl-long.301.pdf
- https://arxiv.org/html/2510.08158v1
- https://arxiv.org/pdf/2508.11222
- https://arxiv.org/pdf/2508.11290
- https://arxiv.org/pdf/2603.02229
- https://false-reject.github.io/
- https://arxiv.org/pdf/2505.18325
- https://arxiv.org/html/2604.00228
- https://arxiv.org/pdf/2404.01295
