The Refusal Calibration Your Two Separate Evals Keep Undoing
Pull up the dashboards for the last four model upgrades and look at the safety number next to the helpfulness number. One of them moved on every release. It was almost never the same one twice. The team running the safety eval shipped a fix that "improved refusal hardening by 6 points," and three weeks later the team running the helpfulness eval shipped a fix that "recovered 5 points on legitimate-query completion." Then the cycle started over.
This is not two teams making independent progress. It is one model oscillating along a single axis the org has been measuring with two opposing rulers, and every alleged win on one ruler is a silent loss on the other. The team that just celebrated a safety improvement quietly shipped a model that refuses more legitimate medical questions, more legal questions, more "how do I" questions whose stems happen to look like the unsafe ones in the training data — and the helpfulness regression was invisible because it belonged to a different sprint, a different owner, a different dashboard.
