Skip to main content

The Budget Cap That Fires After the Action Already Shipped

· 9 min read
Tian Pan
Software Engineer

A single power user burns through your monthly token budget by 9am on day three. The kill-switch fires correctly — the gateway returns 429, the model calls stop, the bill flatlines. Meanwhile the agent has already booked the flight, sent the email confirmation, and closed the support ticket as resolved. The dashboard says "spend halted." The user says "why did you charge me for a trip I never asked for." Both are right. The budget cap stopped the model from thinking. It did not stop the world from changing.

This is the failure mode that almost every agent budget guardrail ships with: the cap is a signal in the spend plane, but the damage lives in the action plane, and the two planes were wired up with no shared transaction boundary. Telling the model to stop is not the same as telling the world to undo what the model just did.

The pattern is so consistent across teams that you can predict the postmortem before reading it. The runaway is detected. The kill-switch is praised for firing fast. The customer support queue then fills with refund requests for actions that the kill-switch was technically powerless to prevent, because they happened in the half-second between the last tool call and the cap evaluation. "We stopped spending" gets shipped as the win. "We stopped acting" was never actually in scope.

The Two Planes Were Never the Same Loop

Most agent runtimes evolved their cost controls and their tool-execution path independently. Cost controls live at the gateway: token counters, per-user quotas, budget envelopes, throttling. Tool execution lives at the agent runtime: function calls, HTTP requests, database writes, third-party API calls. The gateway sees model traffic. The runtime sees side effects. Neither sees both, and neither owns the question "is this next action recoverable if we stop now."

The result is a check ordering that reads correctly on paper and fails in practice. The flow looks like: model call → response with tool call → gateway counts tokens → tool executes → next model call → gateway checks budget → kill. The cap fires between model calls. It does not fire between tool calls. So a chain that emits one heavy reasoning response followed by three quick tool invocations will execute all three tool calls before the budget check has a chance to weigh in.

Worse, the tool calls in that chain are often the irreversible ones. The model thinks expensively, decides what to do, then issues cheap actions — send_email, confirm_booking, post_message, close_ticket. The budget signal arrives after the cheap actions, because the cheap actions are exactly the ones the budget did not flag as worth pausing for. Cost-as-control breaks because cost was never a good proxy for impact.

Cost Is Not Impact

Treating tokens as the universal currency of agent risk is the deeper error. A 100-token call to delete_customer_record is a regulatory incident. A 50,000-token reasoning trace that resolves a question without touching any tool is free of external consequence. Budgeting by token, then, conflates two things that have almost no causal relationship — how much the model thought, and how much the world will remember the model thinking.

Once you see the gap, the architectural move is to separate the budget into two distinct accounts: a token account and an action account. The token account is what gateways already measure. The action account is what nobody is measuring, but should be: a per-action "blast radius" weight that captures whether the side effect is internal or external, reversible or irreversible, idempotent or one-shot, soft-state or hard-commit.

A pragmatic taxonomy that works in production:

  • Free: pure model inference with no tool calls, sandboxed reads, dry-run queries that the tool itself flags as preview.
  • Cheap: writes to internal stores that the agent owns and can undo, writes that are idempotent with a known compensating action, sends to staging or test channels.
  • Expensive: writes to systems of record, customer-visible communications, money movement, third-party API calls that mutate external state.
  • One-way: actions that cannot be compensated even in principle — a deleted record without a tombstone, an email that has been read, a payment captured to a card that is now closed, a real-world event like a printed shipping label or a dispatched courier.

The budget cap, once you have this account, no longer fires against tokens alone. It fires against the projected one-way action count for the rest of the session. If the user is at 80% of their token budget and the agent's next planned step is send_external_email, the cap should reject the step regardless of token cost. If the user is at 99% of their token budget and the next step is a sandboxed search, the cap should let it through.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates