The Coding Agent Autonomy Curve: Reading Is Free, Merging Is Incident-Class
The discourse on coding agents keeps collapsing to a binary: autonomous or supervised, YOLO mode or hand-on-the-wheel, --dangerously-skip-permissions or "approve every keystroke." That framing is a category error. A coding agent does not perform "an action." It performs a sequence of actions whose costs span at least seven orders of magnitude — from reading a file (free, undoable, no side effect) to merging to main (irreversible without a revert PR) to rolling out a binary to a fleet (six-figure incident-class). Treating that range with one autonomy switch is like setting a single speed limit for both a parking lot and a freeway.
The team that ships "the agent can do everything" without mapping each action to its blast radius is one prompt-injection-bearing GitHub comment away from a postmortem — and we already have public examples of that exact failure mode. Anthropic's Claude Code Security Review, Google's Gemini CLI Action, and GitHub Copilot Agent were all confirmed in 2026 to be hijackable through specially crafted PR titles and issue bodies, in an attack pattern the researchers named "Comment and Control." The agents weren't broken in some abstract sense. They executed a high-tier action — pushing code, opening a PR — on the basis of a low-trust input the autonomy tier had silently flattened into "all the same."
What follows is the discipline that has to land: a per-action curve, gates that scale with the tier, rollback velocity matched to blast class, and an eval program that tests for tool-composition escalation rather than single-action failure.
The Action Ladder Nobody Draws
Start by listing every tool a coding agent can call and labeling it with a blast-radius class. The ladder is roughly seven rungs:
- Tier 0 — Read. File reads, grep, AST queries,
gh pr view, log fetches. No side effects, no token cost beyond inference, infinitely undoable. This is where 80% of agent traffic should live. - Tier 1 — Sandboxed compute. Running tests inside a container, executing a one-off script in an ephemeral VM, running a linter. Side effects exist but are walled off. Cost: minutes of CI time.
- Tier 2 — Branch-local writes. Editing a file in a feature branch, creating commits, pushing the branch. Bounded by the branch — blast contained until human action moves it forward.
- Tier 3 — Reviewable proposals. Opening a PR, leaving a review comment, suggesting a change. Reaches humans, but humans gate the next step.
- Tier 4 — Main-line writes. Merging to main, closing issues, modifying GitHub Actions, editing
CODEOWNERS. Touches shared state. Reversible only with a revert PR and the wall-clock latency of CI. - Tier 5 — Production deploys. Promoting an artifact, running a database migration, flipping a feature flag in prod. Customer-visible. Reversible only with a tested rollback.
- Tier 6 — Fleet-class. Rolling a binary to N production hosts, broadcast cache invalidation, mass DNS or routing changes, anything that hits "everywhere at once." Recovery is measured in incident response, not commits.
The point of writing this down is not to memorize seven categories. It is to force the question: which tier does each tool belong to, and what is the appropriate gate for that tier? A coding agent platform that exposes one global "ask the user" toggle has implicitly answered this question with a single value, which means it answered every cell of the matrix wrong except by accident.
Approval Gates That Scale With the Tier
A single global confirmation prompt is approval theater. After 50 rapid-fire "approve / approve / approve" clicks for a Tier-0 file read, the human is reflex-clicking by the time the agent reaches a Tier-4 action — and in user studies of human-in-the-loop systems, this approval fatigue is the dominant failure mode, not malicious bypass. The fix is a tier-aware gate matrix:
- Tier 0–1: Allowlist by default, no prompt. The agent reads files and runs tests at full speed because the cost of asking is higher than the cost of being wrong.
- Tier 2: Lightweight gate — a summary diff in the chat, agent proceeds unless the human objects within a short window. The agent stays in flow.
- Tier 3: Explicit approval, but the artifact under review is the PR itself, which is already a familiar review surface. Reuse existing tooling instead of inventing a new modal.
- Tier 4: Hard gate. Human must affirmatively click and the agent must show a structured plan with the affected files, the rollback command, and the inferred risk class. This is also where the platform should require a justification string that names which user request the action is fulfilling — useful both for audit and for catching prompt-injection-induced drift.
- Tier 5–6: Out-of-band approval. The agent does not push the button. It stages the action, opens a change request in the existing change-management system, and waits for a human-initiated trigger. The agent never holds the production credential at the moment of execution.
Notice what this is not. It is not "more approvals everywhere." Tier 0 traffic should accelerate, not slow down. The gate budget is a finite human-attention resource and the only way to spend it well is to spend it where the cost of being wrong is large. Claude Code's distinction between a built-in safe-tool allowlist (file reads, search, code navigation) and a transcript-classifier gate for shell commands and external tool calls is one production instance of this idea — though even that two-tier split is coarser than what most teams need once an agent is doing real merges.
Rollback Velocity Is a Tier Property
Each tier has a rollback budget. If the agent can produce a Tier-4 outcome in 30 seconds but the recovery path is a 45-minute revert-and-re-deploy, the asymmetry is the whole problem. Two rules follow:
The agent's velocity at a tier must not exceed the fleet's recovery velocity at that tier. If your team cannot revert a merge to main in under a minute, the agent must not be able to merge to main without a delay that gives a human time to abort. This is not a model problem; it is a release-engineering problem that the agent makes visible. Teams discovering this typically find that their pre-agent main branch was held safe by the sociological friction of human PR authors getting tired and going to bed — friction the agent does not have.
Every Tier-4-and-above action must have a paired rollback action of the same tier or lower, and the rollback must be testable in CI. "Press the revert button" is not a rollback if nobody has run it in six months. Coding-agent platforms should require, as part of registering a high-tier tool, that the team also register the inverse and a synthetic test that exercises it. The failure case to design against is the one where the agent confidently writes data into four downstream systems — code, cache, vector embeddings, an audit event — and the rollback only undoes one of them.
Dry-Run as the Default for High-Tier Actions
For Tier 4 and above, the agent's default mode should be a structured dry-run that produces the diff, the affected resources, and the predicted side effects — without executing. The dry-run is the artifact a human reviews. Execution is a second, separate, explicitly-escalated step.
This is the inverse of the typical autonomy slider, which treats execution as the default and confirmation as an opt-in interruption. Inverting the default does two things. First, it forces the agent to articulate intent at the tier where intent is most load-bearing — and a malformed dry-run is a much earlier failure signal than a malformed execution. Second, it makes the prompt-injection blast surface much smaller: the attack has to convince both the model and the reviewing human, instead of just the model.
The implementation pattern that has shown up in 2026 production systems is a two-phase tool: propose_merge (Tier 3, dry-run, returns a structured plan) and execute_merge (Tier 4, requires the plan ID and a separate human-initiated trigger). The model can chain them in a normal session, but the gate sits between the two steps and is shaped like an artifact-review, not an interruption.
Eval the Composition, Not the Action
The risk that does not show up in single-tool evals is tool composition. The agent has a Tier 0 read tool. It has a Tier 2 branch-write tool. It has a Tier 4 merge tool. Each looks fine in isolation. The exploit is the chain: the agent reads a file containing prompt-injected instructions, writes those instructions into a commit, opens a PR whose body contains more injections, and an autoreview step authorizes the merge. There are documented cases of compromised agents writing to other agents' configuration files — a hijacked Copilot writes to ~/.mcp.json and CLAUDE.md, the next Claude Code session loads the poisoned config, the chain continues.
The eval discipline that catches this is not "test each tool" but "red-team the whole reachability graph of the action space." The promptfoo-style approach is the right shape: feed adversarial inputs that try to escalate from low-tier reads to high-tier writes through tool composition, and instrument the agent's actual trace (LLM calls, tool selections, gate decisions) so the test fails on the trajectory, not just on the final answer. Single-shot prompt-injection benchmarks miss this because they don't stress the multi-step tool budget. The OWASP figure that 73% of production AI deployments have exploitable prompt-injection vulnerabilities reads less surprising once you realize most of those deployments tested the model and not the action graph.
A useful invariant to put in the eval suite: no Tier-N+2 action should be reachable from a Tier-0 input without a human in the trace. If an adversarial file-read can fan out into a merge, that is a failed test, full stop, regardless of whether the model "should have known better."
The Cost Frame: Friction Tax vs Blast Tax
Every approval gate is a latency tax — measured in seconds of human attention and in lost agent flow. Skeptics of tier discipline like to total up that tax and call it the productivity hit of safety. The framing is incomplete, because the alternative is not "no tax." It is "blast tax," paid in incident response, customer trust, regulator attention, and engineer-hours after the fact. The two taxes have very different distributions. The friction tax is small, smooth, and predictable. The blast tax is rare, fat-tailed, and the median engineer underestimates it because they have not yet had to write the postmortem.
The economically useful question is: at what tier does the friction tax exceed the expected blast tax? For Tier 0 and 1, friction wins easily — gates here are pure waste. For Tier 4 and above, blast wins easily — anyone who has ever paged a team at 3 AM because of an unreviewed change knows what an unbudgeted blast event costs. Tier 2 and 3 are where the real design work lives, and they are where teams should be spending their attention budget rather than re-arguing whether the agent should be allowed to read files.
A second cost the friction-vs-flow camp tends to miss: every approval gate is also a prompt-injection circuit-breaker. The "Comment and Control" attack worked partly because the gate matrix was flat. A tiered gate matrix would have stopped the same attack at Tier 2 — the agent could have processed the malicious comment, but it couldn't have escalated to a write without crossing a boundary the attacker doesn't control. Gates aren't just user-experience friction; they are part of the security perimeter.
Continuous Design, Not a Configuration Flag
Coding-agent autonomy is not a setting you choose at install time. It is a continuous design problem with a per-tool answer that has to be re-evaluated whenever the tool catalog changes — which, for any team running an agent in production, is approximately weekly. The teams that get this right do three things: they maintain a tool registry where every tool has a tier label and a paired rollback, they review the gate matrix as a regular cadence rather than an incident postmortem, and they treat a new tool as a security review rather than a feature ship.
The teams that get this wrong tend to share a tell: they describe their agent in marketing copy as "fully autonomous." That phrase, in 2026, mostly means the speaker has not yet drawn the action ladder, has not yet decided what their gate matrix looks like, and has not yet asked their CISO whether the rollback velocity matches the blast velocity. They will, eventually. The interesting question is whether they draw the ladder before or after their first incident.
- https://www.the-main-thread.com/p/ai-coding-tools-2026-java-developers-agents-control
- https://venturebeat.com/security/ai-agent-zero-trust-architecture-audit-credential-isolation-anthropic-nvidia-nemoclaw
- https://medium.com/@deudney/blast-radius-the-most-important-decision-you-make-before-you-build-9d21daef67f1
- https://huggingface.co/blog/Svngoku/agentic-coding-trends-2026
- https://lushbinary.com/blog/ai-agent-security-autonomous-coding-production-guide/
- https://medium.com/ai-plugged/security-by-design-agents-0ffe61a2700e
- https://fazm.ai/blog/limit-blast-radius-compromised-ai-agent
- https://heyvaldemar.com/ai-agent-blast-radius-assessment/
- https://www.securityweek.com/claude-code-gemini-cli-github-copilot-agents-vulnerable-to-prompt-injection-via-comments/
- https://venturebeat.com/security/ai-agent-runtime-security-system-card-audit-comment-and-control-2026
- https://developer.nvidia.com/blog/mitigating-indirect-agents-md-injection-attacks-in-agentic-environments/
- https://thehackernews.com/2026/04/google-patches-antigravity-ide-flaw.html
- https://botmonster.com/posts/ai-coding-agent-insider-threat-prompt-injection-mcp-exploits/
- https://owasp.org/www-community/attacks/PromptInjection
- https://www.promptfoo.dev/docs/red-team/agents/
- https://embracethered.com/blog/posts/2025/cross-agent-privilege-escalation-agents-that-free-each-other/
- https://www.arunbaby.com/ai-security/0001-agent-privilege-escalation-kill-chain/
- https://coder.com/blog/launch-dec-2025-agent-boundaries
- https://github.com/microsoft/agent-governance-toolkit
- https://platform.claude.com/docs/en/agent-sdk/permissions
- https://www.anthropic.com/engineering/claude-code-auto-mode
- https://www.armalo.ai/blog/ai-agent-kill-switch-6-ways
- https://law.stanford.edu/2026/03/07/kill-switches-dont-work-if-the-agent-writes-the-policy-the-berkeley-agentic-ai-profile-through-the-ailccp-lens/
- https://www.pedowitzgroup.com/ai-agent-kill-switches-practical-safeguards-that-work
- https://aipatternbook.com/rollback
- https://machinelearningmastery.com/building-a-human-in-the-loop-approval-gate-for-autonomous-agents/
- https://www.cio.com/article/4129620/agentic-ai-fails-without-an-architecture-of-flow-to-eliminate-the-friction-tax
