The Expensive-to-Undo Tool Taxonomy: One Approval Gate Per Risk Class
The "send email" tool and the "delete account" tool are sitting behind the same modal. Your user has clicked "Approve" forty times today, none of those clicks involved reading the diff, and the next click — the one that ships an irreversible mutation to a production database — will look identical to the forty before it. This is the failure mode of binary tool approval, and it is the default in almost every agent framework shipped today.
The framing problem is that "needs human approval" is treated as a single boolean attached to a tool, when it is actually a five-or-six-class taxonomy that depends on what kind of damage the tool can do and how recoverable the damage is. Teams that ship safe agents stop asking "does this tool need a confirm dialog" and start asking "what risk class does this tool belong to, and what gate corresponds to that class." The right number of approval gates is not one and not many. It is one per risk class, and you have to enumerate the classes before you can build the gates.
The Failure Mode of Binary Approval
Most agent harnesses today implement tool authorization as a two-state machine: either the tool runs autonomously (auto-approve) or it pops a confirm dialog (human approval). The dialog is the same regardless of whether the tool is reading a config file or wiring money to an external account. The user sees the same modal, with the same buttons, in the same place on the screen, for actions that differ in blast radius by six orders of magnitude.
The predictable result is confirmation fatigue. Industry research treats this as a security threat rather than a UX nuisance — the same dynamic that makes SOC analysts dismiss two-thirds of alerts also makes agent users click through approval modals without reading them. When every tool call funnels through the same gate, the gate becomes performative. Users develop muscle memory for "Approve" and the dangerous tools get the same reflex click as the safe ones. Studies of automation oversight have found that approver-style human-in-the-loop systems carry roughly equivalent risk to fully autonomous systems, because the human in the loop has stopped functioning as a check.
The fix is not "make the modal scarier." Bigger fonts and red buttons train users to ignore bigger fonts and red buttons. The fix is to recognize that approval is a multi-class problem and that different classes deserve structurally different gates — not stylistically different ones.
A Risk Taxonomy for Tools
A working taxonomy needs enough classes to map cleanly to distinct gating strategies, but not so many that registering a tool becomes an exercise in metaphysics. Six classes cover the field:
- Reversible-internal. The tool mutates state your systems own and exposes a clean undo. Editing a draft, toggling a feature flag with an instant rollback, writing to a sandbox table. The blast radius is bounded and reversible by a single inverse operation.
- Compensating. The tool mutates state where no perfect undo exists, but a follow-up action approximately reverses the first. Issuing a credit to compensate for an erroneous charge, sending a correction email after a wrong-recipient send. Sagas live here.
- Time-window-undoable. The tool's effect is reversible only within a bounded window. A scheduled email that can be canceled before send, a delete operation with a 30-day soft-delete, a payout queued for next-day batch processing. The window is the load-bearing safety property and must be audited like any other invariant.
- Irreversible-internal. The tool's effect is permanent at commit time, but only your systems observe the result. Hard-deleting a row, truncating a log, deleting a snapshot. No third party sees it, but you cannot get it back.
- Irreversible-public. The tool's effect is permanent and observable to third parties the moment it commits. Sending an email to a customer, posting to a public channel, charging a card, publishing a release. The permanence is compounded by audience: the action is not just irreversible in the database, it is irreversible in the world.
- Authority-modifying. The tool changes who can do what. Granting a role, rotating a credential, modifying an allowlist, changing the agent's own permissions. The blast radius is the future actions the new authority enables, which the gate cannot directly observe.
These classes are not opinions. They map to concrete operational facts about each tool: does an inverse exist, is there a recovery window, who observes the side effect, and does the action expand future authority. A tool's class should be derivable from its tool spec the same way its parameters are.
One Gate Per Class
Once classes exist, each class gets its own gate, and the gates differ in kind, not just intensity:
- Reversible-internal: automatic execution, audit-logged, post-hoc anomaly review. No user-facing prompt. Dialogs here are noise.
- Compensating: automatic execution with a registered compensating action. The framework is responsible for emitting an inverse if the original turns out to be wrong, and the eval suite must exercise the compensation path, not just the forward path.
- Time-window-undoable: automatic execution plus an undo affordance surfaced for the duration of the window. The gate moves after the action — the user is told "this will commit in N minutes; cancel here" — which preserves agent throughput while keeping a real veto.
- Irreversible-internal: dry-run-first. The agent computes the action, the framework renders the resolved effect (rows that would be deleted, files that would be removed, exact diff), and the human approves the effect, not the intent. One-click confirm is acceptable here only because the user is reviewing concrete output, not a paraphrase of what the tool might do.
- Irreversible-public: two-person approval, or a single approver plus a mandatory cooling-off delay (30 seconds is enough to interrupt muscle memory). The cost of a wrong send to a customer is too high to gate behind the same one-click that approves an internal cleanup.
- Authority-modifying: explicit, out-of-band approval — a separate channel, a separate session, a different identity. The agent should never be able to grant itself authority through the same conversation that needed the authority to begin with. This is where capability-based discipline pays off.
The gates differ structurally because the failure modes differ structurally. Conflating an irreversible-public action and an irreversible-internal action under "needs approval" loses information the gate could have used.
Risk Class as a Versioned Tool Attribute
The class assignment must live somewhere durable. The wrong place is the agent's prompt, the conversation context, or a runtime heuristic. The right place is the tool registry, alongside the parameter schema, declared at registration time and treated as part of the tool's contract.
A practical pattern: tool registration requires a risk_class field, and the framework refuses to load a tool that omits it. The class is versioned — bumping a tool from reversible-internal to irreversible-public is a breaking change to the tool's API and triggers the same review process as a parameter signature change. Capability tokens issued to the agent encode the maximum risk class the agent is permitted to invoke in a given session, so a low-trust agent simply cannot reach irreversible-public tools at all.
This shifts risk classification from a runtime guess (which is unreliable, because the agent may not know whether kubectl delete is reversible in your cluster) to a build-time fact reviewed by a human at tool-registration time (which is auditable). It also gives you a clean place to enforce policy: the same registry that publishes tool specs to the agent emits gate configurations to the harness, and the two cannot disagree because they share a source of truth.
A common mistake is treating the class as a tag with no enforcement. The class only does work if the framework refuses to invoke a tool through the wrong gate. If the gate is enforced in code that the agent's prompt can talk past — for example, "skip the dry-run because the user said it was urgent" — the taxonomy is decoration. Enforcement belongs in the harness, downstream of any model output.
The Composition Problem
A single tool's class is necessary but not sufficient. Composition matters: an agent with read_secret (reversible) and send_email (irreversible-public) has, in effect, a leak_secret_to_anyone capability that neither tool exposes alone. The risk class of a plan is not the max of the risk classes of its tools; it is a function of the path through them.
Practical implications: the gate configuration should support composition rules, not just per-tool rules. "Any plan that reads tagged-sensitive data and then invokes a public-facing tool requires irreversible-public gating, regardless of which individual tools were used." Taint-style propagation through tool inputs and outputs makes this enforceable. Without it, an attacker — or an honest agent — composes safe primitives into an unsafe path the per-tool review never anticipated.
This is also where the dry-run pattern earns its keep. Showing the resolved plan — the actual sequence of calls with concrete arguments — to a human is the only gate that can catch composition risk, because no static review of individual tools sees it.
The Architectural Conclusion
Binary approval is not a starting point that hardens with iteration. It is a category error: it conflates classes of risk that need to be handled differently and pays for the conflation with click fatigue, missed dangerous calls, and unreviewable audit trails. The work to fix it is finite and front-loaded. Enumerate the classes. Assign each tool a class at registration time, version it, and gate the framework on it. Map each class to a structurally distinct gate. Add composition rules where the path through tools matters more than the nodes.
The teams that do this stop relying on user vigilance as a security control. The teams that do not eventually ship an agent that deletes the wrong table because the same modal that approved fourteen draft saves also approved the drop. The taxonomy is the difference, and the taxonomy is much shorter than the postmortem.
- https://genai.owasp.org/llmrisk/llm062025-excessive-agency/
- https://changkun.de/blog/ideas/human-in-the-loop-agents/
- https://learn.microsoft.com/en-us/azure/architecture/patterns/compensating-transaction
- https://www.agentpatterns.tech/en/governance/human-approval
- https://labs.reversec.com/posts/2025/08/design-patterns-to-secure-llm-agents-in-action
- https://kla.digital/blog/ai-agent-permissions
- https://medium.com/@mbonsign/the-permission-loop-a-design-specification-for-tool-to-llm-confirmation-ff10f2b0cbce
