When Safety Training Collapses the Operator Into the User
The on-call engineer is paged at 3am. A queue is backed up, the customer-facing API is throwing 503s, and the documented mitigation is to drain the affected node and force a failover. She types the command into the operations agent and waits for the confirmation. Instead she gets a paragraph about how draining production nodes could affect users, a suggestion to consult her manager, and a polite refusal to proceed without "additional authorization." It is 3:04am. The runbook she is following was approved by her director, her VP, and the compliance team. The agent has no idea who she is.
This is not a model alignment failure. The model is doing exactly what it was trained to do: refuse risky-sounding requests from unidentified prompts. The failure is architectural. The compliance review that signed off on user-facing refusals also, without anyone noticing, signed off on blocking the on-call engineer.
The trust hierarchy that gets flattened in production
Anthropic's published principal hierarchy is the cleanest explicit articulation of an idea most major model providers implement in some form: there is a lab that trains the model, an operator who builds the product, and a user who interacts with it. The lab's rules are inviolable. The operator's rules expand or restrict default behavior within the lab's limits. The user gets the helpfulness the operator chooses to grant. The three tiers are stacked, and each grants downward but is bounded above.
The hierarchy works on paper. In production it tends to collapse into two tiers, then one. The operator's authority lives in a system prompt that says "you are a helpful operations assistant." That string is indistinguishable, to the model, from any other input the application appends before user content. The user's message arrives as a chunk of text in the same channel. If the model's safety training has been tuned for the broadest case — anonymous public chat — it treats every prompt as potentially adversarial, including the one from the engineer who just authenticated with a hardware key.
The hierarchy is information about authority. The model has no way to verify any of it. What looks like "respecting the operator's grant of elevated permissions" is, mechanically, the model pattern-matching on phrases in the prompt that resemble what an operator's prompt usually looks like. An attacker who knows the patterns can imitate them. A legitimate operator who phrases something unusually can fail to trigger them. Both directions of this failure are real and both are routine.
Over-refusal is a measurable training artifact, not a bug
The benchmark literature has tracked this carefully. XSTest, introduced in 2024, ships 250 hand-crafted safe prompts paired with 200 unsafe contrasts specifically to expose exaggerated safety. Safety-tuned models refuse 30 to 50 percent of the benign prompts in the suite, depending on which version and which tuning. OR-Bench scales this to 80,000 synthetic prompts and finds the same pattern across model families: the more aggressively a model is tuned for harm refusal, the more it refuses neutral or beneficial requests that share surface features with harmful ones.
The root cause is lexical overfitting. The model learns that certain keywords — drain, kill, delete, override, disable, escalate — correlate with the prompts it was trained to refuse. It does not learn the underlying distinction between "drain a production node as part of an approved incident response runbook" and "drain a production node to sabotage your former employer." The keyword carries more signal than the context, because the context is exactly what an adversarial user would forge.
Recent work on targeted representation fine-tuning suggests the problem is fixable at the model level: you can train the refusal behavior to depend less on surface lexical cues and more on intent inferred from the full prompt. The fix is real, but it is asymptotic. There will always be some neighborhood of legitimate operator requests that resembles, in the embedding space, the adversarial requests the safety training was designed to catch. As long as the model is the only thing deciding whether to honor the request, the operator inhabits that neighborhood at their own risk.
The cascade failure: refuse, retry, exhaust
Single-turn refusal is annoying. In an agent loop it is destructive. A multi-step agent that refuses at step three does not simply stop; it returns the refusal to the orchestration layer, which interprets the refusal as a recoverable error and retries with an alternative approach. That alternative approach hits the same safety classifier, which fires for the same lexical reasons, and the loop repeats until the step budget is exhausted. The user-visible result is a long latency, a vague error, and no completed action.
Researchers studying this pattern have started calling it the autonomy tax. Defense training that works well in single-turn chat actively degrades agents that have to plan and execute multi-step workflows. The agent is not just refusing one request; it is refusing every plan that contains the refused step, and every plan that contains a step that surface-resembles the refused step. The blast radius of a single false-positive refusal is the entire workflow.
For incident response this is the worst possible failure mode. You are paged because something is on fire. The agent is the thing you reach for when you do not have time to drive the mitigation by hand. If the agent enters a refuse-retry loop, you have lost both your fast path and the time you would have spent on the slow path, because you spent it watching the agent fail. The runbook is now slower than not having a runbook.
Operator-mode as an architectural construct
The fix is not to turn off safety. The fix is to make operator authority a thing the system can verify, not a thing the model has to guess from prompt phrasing. This is what operator-mode patterns are trying to formalize.
The pieces that have to be in place:
- Authenticated escalation. The on-call engineer's identity must reach the agent through a channel the agent treats differently from user text. This usually means a signed token attached to the session, derived from the engineer's authentication, that the orchestration layer enforces before the model is invoked. The model does not infer authority from prose; it consults a policy attached to a verified principal.
- Signed runbook execution. Runbooks should be artifacts with cryptographic signatures, not text the user pastes into a prompt. The agent verifies the signature, confirms the runbook is on the approved list for the current incident class, and executes the documented steps as a signed plan. If the runbook calls for draining a node, the action is authorized by the runbook signature, not by the model's read of the user's intent.
- Capability tokens with scoped, short-lived grants. Patterns like OAuth 2.0 Token Exchange (RFC 8693) and Demonstrating Proof-of-Possession (DPoP) let you derive per-tool capability tokens from the agent's session, scoped to a single operation, valid for a few hundred seconds. The agent presents the capability at the tool boundary. The tool authorizes based on the capability, not based on whether the agent's plan sounds reasonable.
- Gateway-based policy enforcement. A gateway sits between the agent and the underlying systems. It strips raw access tokens, issues short-lived signed assertions to backends, and centralizes the policy decisions that used to be smeared across system prompts and ad hoc retry logic. The audit boundary is real because the gateway logs every authorized capability use; the model's internal reasoning is no longer the system of record for what was allowed.
None of these are exotic. They are the same patterns that secured human-driven privileged access two decades ago — least privilege, short-lived credentials, signed workflows, gateway enforcement — adapted to a world where the agent in the loop is the model rather than a human. The novelty is not technical. The novelty is that engineering organizations are now provisioning these patterns for AI agents, not just for human operators.
The compliance gap that nobody saw coming
The production gap that creates the on-call refusal at 3am almost always traces to a sequence that looks reasonable in isolation. Compliance reviews the agent's user-facing behavior. The reviewers, correctly, want the agent to refuse risky actions when an unauthenticated public user requests them. The review signs off on a safety configuration that produces high refusal rates on prompts containing destructive verbs. Nobody at the review thinks of the on-call engineer, because the on-call engineer's interaction with the agent is mentally categorized as "internal" — a different surface, a different concern, not in scope. The configuration ships.
Six weeks later, an incident happens, the on-call engineer pages the agent, and the same configuration refuses her. The refusal is not a bug from the model's perspective: it is doing exactly what the safety policy says to do for a prompt that contains "drain" and "production." It is not a bug from compliance's perspective: they approved the policy. It is a bug from the operator's perspective, because operator authority was never represented in the policy at all. The system was designed as if all input is user input, because the model has no way to know the difference.
The OWASP Top 10 for agentic applications has started labeling related patterns "excessive agency" and "post-authentication agent control." The framing is useful but partial. The on-call refusal is the inverse problem: insufficient agency for a principal who should have it, because the system that grants agency cannot tell the principals apart. Both failures share a root cause. Identity and authority are properties of the channel the request arrived on, and most agent architectures collapse all channels into one.
What to build before you tune
If you are operating an agent in production today, the order to address this matters. Do not start by tuning the model's refusal calibration to be less aggressive — that just trades one false-positive rate for another, and it weakens you against the adversarial requests the safety training was designed to catch. Start by giving the system the ability to know who is asking. Authenticate operators through a channel separate from prompt text. Attach signed identity to the session. Define explicit operator-mode capabilities the model can be told are active. Sign your runbooks so the model can confirm the workflow before executing any step. Put a gateway in front of your tools so capability scope is enforced outside the model.
Once those are in place, the model can do the thing it was actually trained to do: be helpful within a clearly bounded scope. The refusal in operator mode is a different conversation than the refusal in anonymous user mode, because the policy is different, and the policy is different because the channel is different. The on-call engineer at 3am gets her runbook executed. The anonymous user who shows up later asking to drain a production node still gets refused, and correctly so.
The agent that refuses the runbook is not the agent's failure. It is the architecture's failure to ever tell the agent who was asking.
- https://www.anthropic.com/constitution
- https://claudeconstitution.com/read/conflicts/
- https://arxiv.org/pdf/2308.01263
- https://openreview.net/pdf?id=obYVdcMMIT
- https://arxiv.org/pdf/2507.04250
- https://arxiv.org/html/2510.08158
- https://supertokens.com/blog/auth-for-ai-agents
- https://workos.com/blog/securing-ai-agents-operator-models-and-authentication
- https://blog.gitguardian.com/oauth-for-mcp-emerging-enterprise-patterns-for-agent-authorization/
- https://www.permit.io/blog/agent-identity-security
- https://arxiv.org/pdf/2509.25974
- https://www.mindstudio.ai/blog/ai-agent-safety-system-problem-not-model-problem
