Sandboxing Agents That Can Write Code: Least Privilege Is Not Optional
Most teams ship their first code-executing agent with exactly one security control: API key scoping. They give the agent a GitHub token with repo:read and a shell with access to a working directory, and they call it "sandboxed." This is wrong in ways that become obvious only after an incident.
The threat model for an agent that can write and execute code is categorically different from the threat model for a web server or a CLI tool. The attack surface isn't the protocol boundary anymore — it's everything the agent reads. That includes git commits, documentation pages, API responses, database records, and any file it opens. Any of those inputs can contain a prompt injection that turns your research agent into a data exfiltration pipeline.
This post covers the four layers of isolation you actually need — container isolation, filesystem controls, network egress policy, and credential scoping — plus the capability audit process that tells you whether you've done it right. The goal isn't theoretical completeness; it's making your agent a boring lateral movement target instead of an interesting one.
Why API Key Scoping Isn't a Sandbox
The intuitive first defense is to restrict what the agent can call: read-only GitHub token, no delete operations, scoped to a single repository. This addresses direct privilege escalation — an agent with a read-only token cannot push code. But it leaves the most dangerous attack vectors untouched.
In 2025, a widely-used open-source AI agent platform had an unauthenticated code-validation endpoint that allowed remote code execution. Before the patch, the only thing preventing exploitation was the network perimeter — not any isolation inside the agent's execution environment. In a separate class of incidents, attackers fed reconciliation agents wildcard queries phrased as routine business tasks; the agents found those requests semantically reasonable and executed them, exporting thousands of customer records. Neither of these would have been stopped by tighter API key scopes.
The underlying issue is that your agent is not just executing your code. The LLM generates code at runtime. You control the policy — what the agent is allowed to do — but not the specific execution path. If a malicious instruction reaches the agent through any input channel, and the agent has a general-purpose shell with no additional isolation, the instruction runs. The API key only gates what the agent can call over authenticated APIs; it doesn't gate what the generated code can do on the host.
A properly sandboxed agent has four distinct isolation layers. Weaken any one of them and you're relying on the others to compensate.
Layer 1: Container Isolation — Choose Your Kernel Boundary
The first question is what kernel boundary sits between the agent's code and your host infrastructure. There are four practical options, each with different tradeoff profiles.
Plain containers (Docker/runc) share the host kernel. A syscall that escapes the container namespace lands directly on the host. Three runc CVEs in 2025 demonstrated mount-race conditions that allowed writes to protected host paths from inside a container. Plain containers are appropriate only if layered with strict seccomp and AppArmor profiles — and even then, a kernel vulnerability is a single bug away from full host access.
gVisor interposes a user-space kernel between the container and the host kernel. All syscalls from the sandbox go through gVisor's "Sentry" before reaching the real kernel. This eliminates most kernel escape paths at the cost of 10–20% I/O overhead, higher on syscall-heavy workloads. gVisor is a good fit for general Python workloads where the performance tax is acceptable and you're running on GKE or a compatible Kubernetes environment.
Firecracker microVMs give each sandbox its own Linux kernel on top of KVM hardware virtualization. Escaping the sandbox now requires a hypervisor escape, not just a syscall escape — a substantially higher bar. Cold starts are around 125ms; memory overhead is roughly 5MB per instance. E2B runs every agent sandbox in its own Firecracker VM, and its growth from 40K sessions/month to 15M sessions/month in one year is evidence that the operational overhead is manageable. Firecracker is the right choice if you're running at scale and need the strongest isolation without going to bare metal.
WebAssembly (WASM/WASI) offers near-zero cold starts and formally verified memory safety, but Python scientific computing — NumPy, PyTorch, most ML tooling — has incomplete or no WASI support today. WASM is viable for narrow, pure-computation tasks but not for general agent workloads.
The practical baseline: use Firecracker or gVisor as your kernel boundary, and layer seccomp-BPF on top regardless. A well-tuned seccomp profile blocks ptrace, mount, pivot_root, and raw socket creation. A seccomp violation kills the process even inside a microVM, giving you defense in depth if the kernel boundary fails.
Layer 2: Filesystem Namespacing — Every Path Access Is a Decision
The default filesystem layout in a container is far too permissive for an agent that generates and runs code. The generated code can traverse directories the agent was never intended to access, find credentials cached to disk, read configuration files, or write persistence mechanisms that survive sandbox teardown.
Linux gives you the tools to fix this precisely. Mount namespaces let you give the sandbox its own filesystem view, separate from the host. Landlock (available since Linux 5.13) provides capability-based filesystem access control at the kernel level: you can restrict the agent process to specific directory trees with specific access modes, and that restriction propagates to all child processes spawned by agent-generated code.
The correct filesystem layout for an agent sandbox:
- Input data and reference files: mounted read-only. The agent can read them; generated code cannot modify them.
- Workspace output directory: read-write, scoped to a session-specific path (
/workspace/{session-id}/). Nothing outside this directory is writable. /tmp: ephemerally mounted, cleared on sandbox teardown. No persistence across sessions.- System directories: inaccessible via mount namespace — the agent doesn't need to see
/etc,/var, or/home. - Secrets and API keys: injected as environment variables at runtime via a secrets manager. Never written to disk in the workspace where the agent's file tools can read them.
The secrets-on-disk problem deserves emphasis. An MCP server study in 2025 found that 82% of tested servers were vulnerable to path traversal when filesystem permissions weren't scoped to specific paths. If your agent has a file-reading tool and your API keys live anywhere on the host filesystem, a sufficiently clever prompt can find them. The fix is not better prompting — it's mounting secrets as in-memory environment variables and ensuring the agent's mount namespace has no access to paths outside the workspace.
Critically: block all writes outside the workspace directory as a hard kernel-level control, not a guideline. Persistence mechanisms, sandbox escapes, and RCE staging all require writing to a location that survives the session. Remove that capability structurally.
Layer 3: Network Egress — Default Deny Is the Only Acceptable Default
An agent with unrestricted outbound network access can exfiltrate data to attacker-controlled infrastructure, establish reverse shells, call external APIs that incur costs, or use DNS-based exfiltration (encoding data in DNS query subdomains). After a prompt injection, unrestricted egress turns a contained incident into a data breach.
The control stack, in order of depth:
Network namespace isolation gives each sandbox its own network stack. Outbound connectivity requires explicit routing rules — there's no default route to the internet.
DNS allowlisting restricts which domains the sandbox can resolve. This cuts off DNS-based exfiltration and prevents the agent from discovering new egress endpoints at runtime by querying public DNS for attacker infrastructure.
HTTP proxy with domain allowlist routes all HTTP/S traffic through an explicit proxy. The sandbox has no direct internet access — it speaks only to the proxy, which enforces an allowlist. Requests to unlisted domains return 403. This is the most operationally convenient control at scale because the allowlist is centrally managed and auditable.
Per-task egress policies apply network access only to sandboxes that need it. A code formatter agent needs no network access at all. A web research agent needs outbound HTTP but not PyPI access. Model the required egress surface per task type, not per agent class.
A common anti-pattern: granting broad network access during environment setup (to install packages) and not revoking it afterward. The correct approach is to open egress to package registries (PyPI, npm, apt) during the initialization phase, snapshot the environment, then re-lock egress before the agent begins executing user-directed code. Post-initialization, the agent's sandbox should not be able to reach package registries or any other external endpoint unless the task explicitly requires it.
Layer 4: Credential Scoping — Dynamic Agents Need Dynamic Credentials
Traditional software has static credential requirements. You can audit them at deploy time, scope the credential, and be done. Agent systems call tools based on LLM reasoning at inference time — the exact set of API calls made for a given task cannot be fully predicted. Static credential scoping is necessary but not sufficient.
Three patterns that address the dynamic nature of agent credential use:
Short-lived tokens per session. Any integration the agent makes — MCP server connections, external API calls, database access — should use a token that expires with the session, not a long-lived API key. A leaked 10-minute token has a bounded blast radius; a leaked API key does not.
Just-in-time elevation. The agent starts with a read-only credential set. For operations that require elevated permissions — deletes, publishes, deploys — the agent requests elevation for a specific operation, receives a scoped one-time token valid for that single call, and the elevation is logged as an audit event. This pattern, borrowed from PAM (privileged access management), is unfamiliar in agent contexts but directly applicable.
One credential set per agent. Shared credentials across agents mean that compromising any one agent compromises all of them. The orchestration agent should not hold production API keys for all downstream agents simultaneously. Each agent or plugin gets its own low-privilege credential, issued by the secrets manager at session start.
The MiniScope framework (Berkeley, 2025) formalizes automated permission minimization: it constructs a permission hierarchy graph based on relationships between tool calls and solves for the minimal credential set that still allows task completion. The overhead is 1–6% over vanilla tool calling. The practical takeaway is that you can automate the credential scoping decision rather than relying entirely on manual audit — but you still need to review and approve the generated scopes.
The Capability Audit: How Do You Know You Got It Right?
The four layers above are controls. The capability audit is how you verify they're working and that you haven't granted more than necessary.
Static audit: enumerate every tool the agent has access to, regardless of whether it calls them. If you cannot produce this list, you have already violated least privilege. This list is the starting point for the audit — every entry needs a justification for why the agent needs it.
Dynamic audit: run the agent against a representative distribution of production tasks. Log every tool call. Compare the actual call histogram to the declared tool set. Tools that never appear in the call log for any task in the distribution are candidates for removal. Tools that appear rarely are candidates for narrower scoping.
Trace-based policy generation: tools like Progent (April 2025) let you express runtime policies in a JSON DSL that restrict which tool calls are permissible in which agent states. Policies can be LLM-generated from task descriptions and then human-reviewed. The enforcement is deterministic — not an LLM guardrail, which can be jailbroken — and reduces attack success rates to near zero on standardized benchmarks while preserving task completion rates.
OWASP's least-agency audit criterion for agentic applications frames the test as three questions: Can the agent reach tools beyond its task scope? Do those tools operate with broader privileges than the task requires? Can high-impact irreversible actions proceed without human review? If any answer is "yes," the audit fails. For high-stakes operations, human-in-the-loop approval is a security control, not a UX tradeoff.
Finally, tag each agent with an autonomy level — from fully supervised (every action confirmed) to fully autonomous (no human in the loop). Higher autonomy levels require proportionally stricter sandboxing, more frequent re-audit, and lower thresholds for anomaly alerting. This creates an operational framework for reasoning about which agents need the heaviest isolation and which can run with lighter controls.
The Incident That Doesn't Happen
The concrete value of this stack is that it converts a category of incidents from "possible" to "structurally prevented." An agent running in a Firecracker VM with a namespaced filesystem, default-deny egress, and session-scoped credentials cannot exfiltrate data over the network — the kernel won't route the packets. It cannot write persistence mechanisms outside the workspace — the mount namespace blocks it. It cannot escalate to a production credential it doesn't hold — the secrets manager never issued it one. The prompt injection still reaches the agent; it just runs in a box with no exits.
The remaining attack surface after all four layers is considerably smaller: compromising the sandbox orchestrator itself, exploiting the hypervisor, or manipulating the agent's behavior within its permitted tool scope. These are harder problems that require different mitigations — hypervisor hardening, orchestrator access controls, and behavioral anomaly detection. But at least you're working on the hard problems instead of leaving the easy ones open.
The teams shipping agent systems that handle sensitive workloads have learned this the same way enterprise software learned defense-in-depth in the early 2000s: usually after the first incident large enough to require a post-mortem. The isolation stack exists. The tooling is mature enough to use in production. The only variable is whether you build it before or after you need to explain to someone why your agent became a lateral movement vector.
- https://developer.nvidia.com/blog/how-code-execution-drives-key-risks-in-agentic-ai-systems/
- https://developer.nvidia.com/blog/practical-security-guidance-for-sandboxing-agentic-workflows-and-managing-execution-risk/
- https://cheatsheetseries.owasp.org/cheatsheets/AI_Agent_Security_Cheat_Sheet.html
- https://genai.owasp.org/resource/owasp-top-10-for-agentic-applications-for-2026/
- https://www.softwareseni.com/firecracker-gvisor-containers-and-webassembly-comparing-isolation-technologies-for-ai-agents/
- https://northflank.com/blog/kata-containers-vs-firecracker-vs-gvisor
- https://arxiv.org/abs/2512.11147
- https://arxiv.org/abs/2504.11703
- https://modal.com/blog/top-code-agent-sandbox-products
- https://cloudsecurityalliance.org/blog/2025/02/06/agentic-ai-threat-modeling-framework-maestro
- https://aws.amazon.com/blogs/security/the-agentic-ai-security-scoping-matrix-a-framework-for-securing-autonomous-ai-systems/
- https://auth0.com/blog/access-control-in-the-era-of-ai-agents/
- https://agentnode.net/blog/ai-agent-permission-models-least-privilege
- https://adversa.ai/blog/adversa-ai-unveils-explosive-2025-ai-security-incidents-report-revealing-how-generative-and-agentic-ai-are-already-under-attack/
- https://www.obsidiansecurity.com/blog/cve-2025-34291-critical-account-takeover-and-rce-vulnerability-in-the-langflow-ai-agent-workflow-platform
