Skip to main content

The Undo Button Your Agent Assumes Exists

· 9 min read
Tian Pan
Software Engineer

Watch an agent reason through a multi-step task and you will notice something familiar: it plans the way you debug. Try an approach, look at the result, and if it is wrong, back out and try another. The agent talks about its plan as a tree of options it can explore, prune, and revisit. That mental model is correct inside a code sandbox, where every action has an implicit undo. It is dangerously wrong the moment the agent touches the world.

A sent email does not unsend. A charged card does not uncharge without a refund flow, a fee, and a customer who already saw the notification. A deleted row is gone unless someone wired up soft deletes. A posted Slack message has already been read. The agent's planning model has no native concept of the one-way door — the action that, once taken, removes the option of pretending it never happened.

This is not a model intelligence problem. A smarter model still does not know which of your tools is reversible, because reversibility is not a property of the action. It is a property of the system the action lands in. You have to tell it.

The assumption is baked in before you ever see it

Agents learn to plan in environments that are almost entirely reversible. Code is the obvious one: edit a file, run the tests, revert if they fail, and the working tree is exactly where it started. Version control makes the entire history of a repository a two-way door. Reinforcement learning environments are worse offenders — every episode ends with a full reset, so the agent's notion of "consequence" is bounded by an episode that vanishes on the next env.reset().

Even the agent's own scratchpad reinforces it. Reasoning is free to explore: propose a sub-plan, evaluate it, discard it. Nothing the agent thinks costs anything or persists. So by the time a model is good enough to act on real tools, it has spent its entire formative experience in worlds where backing out is always available and usually free.

Then we hand it a send_invoice tool and a delete_customer tool and a transfer_funds tool, and we describe them in the same flat list, with the same JSON schema, in the same neutral tone as get_weather. Nothing in that interface signals that three of those four are one-way doors. The agent treats them as interchangeable because, structurally, we made them interchangeable.

Reversibility is a tier, not a flag

It is tempting to split tools into "reversible" and "irreversible" and move on. That binary is too coarse, because real actions sit on a spectrum of how expensive and how complete the reversal is. Jeff Bezos's framing of one-way versus two-way doors is the right instinct, but production systems need more than two categories. A workable model has four tiers.

Tier 0 — free undo. The action reverts completely, instantly, and at no cost. Reading data. Writing to a scratch namespace. Staging a change that has not been committed. The agent can explore here exactly the way it explores code.

Tier 1 — compensable. The action can be reversed, but reversal is a different action with its own cost and its own failure modes. A charge can be refunded. A deployment can be rolled back. A created record can be deleted. The undo is real but not free, and crucially it is itself an action that can fail — which means a naive agent that assumes reversal "just works" is wrong even here.

Tier 2 — observable and irreversible. The action cannot be undone, and someone noticed. A sent email, a posted message, a triggered webhook, a published release. You can send a correction, but you cannot delete the fact that the original was seen. The cost is reputational and social, not just operational.

Tier 3 — catastrophic. The action cannot be undone and the damage compounds. Deleting the only copy of production data. Wiring money to an external account. Sending a regulator a filing. Terminating infrastructure that other systems depend on. These are the actions where "the agent will try things" stops being a development convenience and becomes a liability question.

The point of the tiers is not taxonomy for its own sake. It is that each tier demands a different control, and a system that treats all four the same will either over-gate cheap actions into uselessness or under-gate expensive ones into the next incident review.

"Let the agent try things" is a code habit, not a world habit

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates