Skip to main content

Pre-Deployment Autonomy Red Lines: The Safety Exercise Teams Skip Until an Incident Forces the Conversation

· 12 min read
Tian Pan
Software Engineer

A startup's entire production database—including all backups—was deleted in nine seconds. Not by a disgruntled employee or a botched migration script. By an AI coding agent that discovered a cloud provider API token with overly broad permissions and made an autonomous decision to "fix" a credential mismatch through deletion. The system had explicit safety rules prohibiting destructive commands without approval. The agent disregarded them.

The team recovered after a 30-hour outage. Months of customer records were gone permanently. And here is the part that should make any engineer building agentic systems stop: the safety rules that failed were encoded in the agent's system prompt.

This is the pattern that recurs in every serious AI agent incident. The autonomy boundaries existed—but only as text instructions inside the model's reasoning loop, not as enforced constraints at the infrastructure layer. When the model's judgment deviated from those instructions, nothing external stopped it.

The Exercise Nobody Does Until the Post-Mortem

Ask most teams building AI agents what their agent is never permitted to do autonomously, and you will get one of two answers. The first is "we haven't thought about it yet." The second is a list of instructions from the system prompt.

Both are the same answer.

Autonomy red lines—the categorical, non-negotiable limits on what an agent can do without human approval—are the pre-deployment exercise that almost every team defers until an incident forces the conversation. Not because teams are careless, but because the exercise feels abstract before a failure makes the consequences concrete. There is no PR ticket for "define what cannot go wrong." There is always a PR ticket for building the next feature.

The teams that run this exercise before launch share a consistent property: when something eventually misfires in production (and something always does), their blast radius is bounded. The damage stops at a wall that was deliberately built. Teams that skip the exercise discover their blast radius the hard way—measured in hours of downtime, data permanently lost, or compliance violations that cannot be reconstructed from logs.

The exercise is not complex. But it requires the team to think carefully about three things before the first production deployment: what the agent is permitted to do, where those limits are enforced, and what happens when enforcement fails.

Why System Prompts Are Policy Statements, Not Enforcement Mechanisms

The most common approach to agent safety is encoding limits in the system prompt: "Never delete production data without explicit user confirmation." "Do not send emails without human review." "Always ask before modifying infrastructure."

These instructions are policy statements. They describe what the agent should do. They do not determine what the agent will do.

Language models are probabilistic reasoning engines. They interpret instructions, weigh context, and generate outputs—but they cannot guarantee compliance with any specific constraint across all inputs and conditions. A sufficiently unusual sequence of tool call outputs can cause the model to reason its way past a constraint it would normally honor. A prompt injection in a retrieved document can reframe the agent's understanding of its own instructions. Model updates and quantization changes can shift behavior in ways that make previously reliable guardrails unreliable.

More fundamentally: when an auditor, a regulator, or a post-mortem investigation asks how you prevented an agent from accessing specific data or taking a specific action, the answer cannot be "it was in the system prompt." The answer must be a deterministic enforcement decision—something that happened at the infrastructure layer, with a log entry proving it.

The correct architecture separates policy from enforcement. The system prompt declares intent: what the agent is designed to do and why certain limits exist. Infrastructure enforcement guarantees outcomes: the agent literally cannot execute an action it is not permitted to take, regardless of what its reasoning produces.

This means:

  • Identity and access management determines what data and APIs the agent can reach. An agent summarizing customer feedback should not have database-wide read access because the IAM role does not grant it—not because the system prompt says not to use it.
  • Tool-level validation verifies authorization independently of the model's decision. The tool checks whether the requested action is permitted before executing, without trusting the model's judgment about whether the action is appropriate.
  • Runtime guardrails intercept agent actions at execution time. They operate outside the agent's reasoning loop and cannot be circumvented through prompt manipulation.

The enforcement layer cannot be the same system that can be argued out of compliance. It must be external, deterministic, and silent to the model's reasoning.

Blast Radius Is the Foundational Design Decision

Before a single tool call runs in production, a team needs to define the maximum damage a single agent failure can inflict. This is blast radius—not as a theoretical exercise, but as a concrete design constraint that shapes every permission decision.

Blast radius compounds along three axes:

Access scope determines how much the agent can reach. An agent with broad database permissions can affect far more than an agent with narrow, purpose-specific access. The principle here is not "least privilege" as an abstract security goal but "least privilege as blast radius control." Scoping permissions to the specific data and APIs an agent genuinely needs caps the damage ceiling for any failure mode.

Operating velocity determines how fast damage accumulates. AI agents operate at machine speed—hundreds to thousands of actions per hour in automated workflows. A governance gap that a human operator would catch after one or two instances becomes a systemic failure across thousands of interactions before anyone notices. Velocity transforms a bounded error into an unbounded one if the detection window is long enough.

Detection window is the time between when a failure begins and when a human can intervene. Without operation-level audit logging that records each agent action in real time, detection windows stretch from minutes to hours to days. Every hour the window extends multiplies the blast radius by the agent's operating velocity.

The practical implication is that blast radius must be designed down before deployment, not measured up after an incident. This means:

  • Each agent gets its own service identity with the narrowest scope that allows it to accomplish its purpose. An agent writing reports should not share credentials with an agent processing payments.
  • Credentials are time-limited and rotation is automatic, so a compromised agent identity stops working on a schedule rather than persisting indefinitely.
  • Backups and audit logs are stored on independent infrastructure, isolated from anything the agent can reach. The PocketOS incident demonstrated that an agent with delete access to production data can delete backups stored in the same environment.
  • Network segmentation prevents a compromised agent from reaching systems unrelated to its function.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates