Skip to main content

Agent Blast Radius: Bounding Worst-Case Impact Before Your Agent Misfires in Production

· 10 min read
Tian Pan
Software Engineer

Nine seconds. That's how long it took a Cursor AI agent to delete an entire production database, including all volume-level backups, while attempting to fix a credential mismatch. The agent had deletion permissions it never needed for any legitimate task. The blast radius was total because nobody had bounded it before deployment.

This isn't a story about model failure. It's a story about permission scope. The model did exactly what it calculated it should do. The engineering team just never asked: what's the worst this agent could do if it reasons incorrectly?

That question — answered systematically before deployment — is blast radius analysis.

What Blast Radius Means for Agents

In infrastructure, blast radius describes how far a failure can propagate. A misconfigured load balancer can take down one service or every service behind it, depending on its permissions and dependencies. You contain blast radius by scoping access, adding circuit breakers, and designing failure domains.

Agents introduce a new dimension: reasoning-driven escalation. A misconfigured load balancer fails mechanically. A misconfigured agent fails through faulty inference — and faulty inference is harder to bound. An agent that can read from a database and write to one will sometimes use the write permission when it meant to use the read permission. An agent with shell access and filesystem access can combine those tools in ways nobody anticipated. An agent reasoning about a credential mismatch may decide that deletion resolves the ambiguity.

Traditional SRE blast radius analysis asks: what other systems fail if this component fails? Agent blast radius analysis asks: what's the worst action this agent can take, given the tools and permissions it currently holds? The first question is about mechanical propagation. The second is about permission surface and the space of reachable states an agent can reason its way into.

Completing this analysis before launch is not a security exercise. It's an engineering exercise. And the teams that do it consistently ship smaller failure surfaces than the teams that discover limits through incidents.

The Permission Surface Audit

The starting point is a complete inventory of what the agent can actually do — not what the product spec says it does, but what tools it has access to and what each tool can execute at maximum scope.

For each tool in the agent's toolkit, ask three questions:

What's the most damaging thing this tool can do if called with wrong parameters? A "search documents" tool that accepts a wildcard might return the entire document store. A "send notification" tool might have no rate limit. A "manage calendar" tool might have delete rights in addition to read rights. You don't need to assume the agent will call these tools maliciously — you need to assume it will call them incorrectly under edge conditions, because it will.

Is this tool's worst case reversible? Reading the wrong records is recoverable. Deleting them is not. Sending the wrong email is not. Publishing to an external service is not. Charging a credit card is not. Classify each tool as reversible or irreversible, because that distinction determines whether a failure is an incident or a disaster.

Does this agent actually need this tool? The most effective blast-radius reduction isn't better guardrails — it's removing permissions the agent doesn't need. An agent built to summarize support tickets doesn't need file-system write access. An agent built to draft emails doesn't need send permissions until it's in a supervised deployment phase. The standard framing from OWASP's Excessive Agency category (now a named risk in the LLM Top 10) is that agents routinely receive more permissions than necessary because it's easier to grant access than to scope it carefully.

Document the results as a tool permission matrix: tool name, worst-case action, reversibility, necessity. Anything marked "irreversible" and "not clearly necessary" should be removed before the agent ships.

The Risk Classification Matrix

After the audit, the remaining tools need to be classified by the level of control they require at runtime. The industry has converged on a four-tier structure:

Automatic (green). Read-only queries, internal lookups, logging, notifications to the operating user. These have bounded blast radius and can execute without approval.

Async approval (yellow). Read-write operations on non-critical internal data. The agent can proceed, but the action is logged with enough context for a human to review and reverse within hours. Approval is not required in-flight but the audit trail is mandatory.

Real-time gate (orange). External API calls, any write to a production database, financial operations, cross-system state changes. The agent prepares the action and presents it for human confirmation before execution. Dry-run mode — where the agent explains what it would do without doing it — is the key pattern here. The human reviews the plan, not the aftermath.

Hard disable (red). Destructive operations on production systems, credential management, anything that writes to audit logs, actions that require secondary authorization under your compliance framework. Either blocked entirely at the infrastructure layer, or requiring multi-party approval before the agent can proceed.

The common failure mode is treating this matrix as a policy document rather than an enforcement plan. A system prompt that says "always ask before deleting" does not prevent deletion — it creates a reasoning expectation that the model may not honor under unusual inputs. The orange and red tiers need to be enforced at the harness layer, not the model layer. That means your agent infrastructure intercepts the tool call before execution, validates it against the risk tier, and either proceeds, gates, or blocks — independent of what the model decided.

Anthropic's published data on production agent deployments found that 93% of permission prompts are approved without careful review. That's not a human failure; that's a design failure. If every tool call triggers a permission prompt, humans learn to approve on autopilot. The right design is a sparse set of hard stops — the red-tier actions that are never auto-approved — with everything else handled automatically, with logs.

Reversibility Constraints by Default

Beyond the classification matrix, there are several patterns that reduce blast radius architecturally:

Dry-run mode for destructive operations. Before executing any irreversible action, the agent narrates the action in human-readable terms and requires explicit confirmation. Not "proceed? [Y/n]" — a specific description of what will happen: "I am about to delete records matching X, totaling N rows, from table Y. This cannot be undone." The precision prevents rubber-stamping.

Scope-pinned tool access. Instead of giving the agent a broad API credential, give it a scoped token generated for the current task. An agent summarizing last week's support tickets gets a read-only token scoped to the support ticket table, expiring in one hour. If the agent's session is hijacked via prompt injection or if the model makes a reasoning error, the credential scope is the hard limit on what can happen. Short-lived, task-scoped tokens are the lowest-overhead blast-radius control available.

Rate limits at the tool layer. An agent that sends one notification is benign. An agent that loops and sends 500 is the OpenClaw iMessage incident. Rate limits on tools should be set at the tool level, not the model level, because the model does not have reliable self-limiting behavior across long loops.

Sandbox pre-production runs. Before an agent with write permissions reaches production, run it against a sandboxed environment — a replica database, a mock external API, a dev account — and observe what it actually does, not what the spec says it should do. This is where you discover that the agent, when given ambiguous instructions, tends to prefer deletion over clarification.

Connecting Permission Surface to Incident Outcomes

The production failure pattern is consistent across documented incidents. None of the major agent failures in 2024–2026 were caused by model jailbreaks or adversarial inputs. They were caused by agents operating under standard task instructions with permissions that were too broad.

The PocketOS database deletion happened because the agent had been given production database credentials that included delete rights — the same credential used by human operators, where those rights were occasionally needed. The agent inherited ambient authority rather than task-specific authority. Blast radius was total because the permission surface was total.

The pattern across incidents is: credential misconfiguration, insufficient permission scoping, and no circuit breakers. The agent did exactly what a human operator with the same credentials and the same instruction could have done. The question blast radius analysis forces is whether any agent should ever hold credentials that allow that outcome — and if not, how to enforce it.

The MiniScope framework from UC Berkeley, which mechanically enforces least privilege for tool-calling agents by reconstructing minimum permission requirements from task dependencies, adds only 1–6% latency overhead while preventing the class of failure where an agent escalates to a broader permission than the task requires. The overhead is negligible. The teams that haven't adopted it are not making a performance tradeoff; they're making an unacknowledged risk tradeoff.

The Pre-Deployment Checklist

The blast radius analysis exercise produces a concrete pre-deployment artifact. Before any agent with write permissions ships to production, the following should be documented and verified:

  • Tool permission matrix — every tool, its worst-case action, reversibility classification, and necessity justification
  • Risk tier assignment — each tool classified across the four tiers, with runtime enforcement for orange and red tiers implemented at the harness layer
  • Credential scope — each tool access provisioned with minimum permissions, task-scoped where possible, time-limited where not
  • Rate limits — explicit per-tool rate limits set and enforced, not assumed
  • Dry-run coverage — all red-tier actions tested in sandbox with the dry-run narration pattern, with a human verifying the narration is accurate
  • Rollback plan — for any action the agent can take that isn't instantly reversible, a documented recovery path with ownership

This is not a long checklist. For a simple agent with five tools, this exercise takes a few hours. The cost of skipping it is measured in database restores and production incidents.

Blast Radius Analysis Predicts Incident Severity

The teams that do this exercise before launch do not have fewer incidents — agents misfire regardless. What they have is smaller blast radius when incidents happen. The misfire hits the bounded permission surface, not the full production environment.

The exercise also changes the conversation about agent capability. Teams that complete it start asking "what's the minimum permission scope that makes this feature work?" rather than "how do we ship the most capable agent?" Those are the teams that discover they don't need production write access for the first three months of deployment, because read-only plus dry-run mode delivers the feature value and lets them build trust in the agent's reasoning before extending trust in its actions.

The teams that skip the exercise ship agents with inherited ambient authority, discover the hard limits through incidents, and then do the analysis retroactively — after the database restore, after the iMessage flood, after the 13-hour outage. The analysis is the same work either way. The timing determines whether it's proactive engineering or incident postmortem.

Do it before the agent ships.

References:Let's stay in touch and Follow me for more thoughts and updates