Cancel-Safe Agents: The Side Effects Your Stop Button Already Shipped
A user clicks Stop because the agent misunderstood the request. The UI flashes "stopped." By the time the spinner disappears, the agent has already sent two emails, scheduled a Tuesday meeting on the user's calendar, opened a draft pull request against the wrong branch, and queued a Slack message that is racing the cancellation signal through the tool layer. The model has obediently stopped generating tokens. The world has not stopped reacting to the tokens it generated thirty seconds ago.
This is the failure mode nobody covered in the agent demo. Cancellation in synchronous code was already a hard problem with a generation of cooperative-cancellation theory behind it: Go contexts, Python's asyncio.cancel, structured concurrency with task groups, the whole grammar of "ask politely, escalate carefully, don't leave resources behind." Agents take that already-hard problem and add a layer on top: the planner does not know that the user revoked authorization between step 4 and step 5, and the tools it kicked off in step 4 do not get a memo when step 5 is cancelled. Stop is a UI affordance. The system underneath stop has to be designed.
Stop Is a Lie Until It Names What Already Happened
Most agent UIs treat cancel as a binary: a button, a spinner that vanishes, a "stopped" toast. The user reads "stopped" and assumes nothing further will happen on their behalf. That assumption is the entire bug.
Inference cancellation is the easy half. The token stream stops cleanly because the model is a function the runtime can interrupt. Tool execution is the hard half. By the time the cancel signal reaches the tool layer, an HTTP request to a calendar API may already be in flight, an email may already be sitting in a queue waiting for a worker, and a database migration may already be halfway through a transaction the agent will never see commit or roll back. A trustworthy stop button has to surface these. The right UI affordance after cancel is not "stopped" — it is "here is what your agent did before you cancelled, here is what was interrupted, here is what is still in flight, here is what you can undo."
Building that UI requires the system underneath to know the answers. It almost never does, because the side-effect inventory is implicit: scattered across tool-call logs, vendor request IDs that may or may not have been written, and exception traces that describe what the agent attempted rather than what landed. The first architectural move is to make the inventory explicit. Every tool call has to be journaled before it is issued, with enough metadata that a post-cancel audit can reconstruct: what was attempted, what reached the upstream system, what came back, and which compensating action — if any — exists.
The Authorization Window Between Step 4 and Step 5
Long-running agents have a subtle authorization problem that synchronous tools never had to deal with: the user can change their mind during the run. They watch the agent take step 1, step 2, step 3, then realize at step 4 that the agent misread the goal. They press cancel. The planner is currently mid-token on step 5's tool call. The tool call lands.
In a well-designed system, the tool's permission to act is not a one-time grant at session start. It is re-checked at the moment of execution. The architectural primitive looks more like an OAuth-style scoped token that can be revoked and that the tool layer presents on every call than like a session-wide flag the planner reads once. Authorization should be evaluated at request time rather than just during initial connection, so an agent that drifts into a goal the user revoked cannot continue acting on the old grant. Concretely: the cancel signal does not just stop the planner — it invalidates the action grant that pending and future tool calls must hold to commit.
This shifts the failure mode from "tool call succeeded after cancel" to "tool call attempted after cancel and was rejected by the auth layer." That second failure mode is recoverable. The first is an incident.
A useful frame: every tool call is a two-phase commit where the second phase is conditional on a still-authorized signal. The tool prepares the side effect (composes the email, locks the calendar slot, opens the database transaction) but does not commit until it re-validates the authorization. Cancellation flips the bit, and prepared-but-uncommitted side effects fall on the floor instead of escaping into the world. This pattern is not free — every tool integration has to be designed with a prepare/commit boundary — but for irreversible actions it is the only way "cancel" can mean what users think it means.
Forward Plans Need Backward Plans
The saga pattern from distributed transactions has been the answer to "what happens when a sequence of side effects has to half-fail" for two decades. Agents are sagas, whether their authors realize it or not. The standard saga discipline applies: when a step has a side effect, define the compensating action that semantically undoes it before you ship that step. Refund the payment. Cancel the calendar invite. Recall the email if the protocol supports recall. Mark the database record reverted with a tombstone row, because the original write is still in the audit log.
The harder, agent-specific point is that compensation cannot always be derived after the fact. A planner that can call any of fifty tools cannot generically know how to undo any of fifty tools. The undo plan has to be authored by the human integrating the tool, not generated by the model. The architectural pattern that works in production is to register, alongside each tool, a paired compensating action and a classification of reversibility:
- https://temporal.io/blog/compensating-actions-part-of-a-complete-breakfast-with-sagas
- https://learn.microsoft.com/en-us/azure/architecture/patterns/saga
- https://learn.microsoft.com/en-us/azure/architecture/patterns/compensating-transaction
- https://docs.python.org/3/library/asyncio-task.html
- https://bruceeckel.substack.com/p/cancellation-in-concurrency
- https://docs.langchain.com/oss/python/langgraph/durable-execution
- https://www.dbos.dev/blog/durable-execution-crashproof-ai-agents
- https://medium.com/agentspan/late-bound-sagas-why-your-agent-is-not-an-llm-in-a-loop-a8c50731c551
- https://www.cerbos.dev/features-benefits-and-use-cases/agentic-authorization
- https://workos.com/blog/ai-agent-access-control-best-practices
