3 posts tagged with "code-execution"

Agent Sandboxing and Secure Code Execution: Matching Isolation Depth to Risk

March 9, 2026 · 11 min read

Software Engineer

Most teams shipping LLM agents with code execution capabilities make the same miscalculation: they treat sandboxing as a binary property. Either they skip isolation entirely ("we trust our users") or they deploy Docker containers and consider the problem solved. Neither position survives contact with production.

The reality is that sandboxing exists on a spectrum with five distinct levels, each offering a different isolation guarantee, performance profile, and operational cost. The mismatch between chosen isolation level and actual risk profile is the root cause of most agent security incidents — not the absence of any sandbox at all.

Why Your Agent Should Write Code, Not JSON

January 31, 2026 · 10 min read

Tian Pan

Software Engineer

Most agent frameworks default to the same action model: the LLM emits a JSON blob, the host system parses it, calls a tool, returns the result. Repeat. It's clean, auditable, and almost universally used — which is exactly the problem. For anything beyond a single tool call, this architecture forces you to write scaffolding code that solves problems the agent could solve itself, if only it were allowed to write code.

There's a different approach: give the agent a Python interpreter and let it emit executable code as its action. One published benchmark shows a 20% higher task success rate over JSON tool-calling. An internal benchmark shows 30% fewer LLM round-trips on average. A framework built around this idea hit #1 on the GAIA leaderboard (44.2% on validation) shortly after release. The tradeoff is a more complex execution environment — but the engineering required is tractable, and the behavioral gains are real.

Why Your AI Agent Should Write Code Instead of Calling Tools

January 29, 2026 · 11 min read

Tian Pan

Software Engineer

Most AI agents are expensive because of a subtle architectural mistake: they treat every intermediate result as a message to be fed back into the model. Each tool call becomes a round trip through the LLM's context window, and by the time a moderately complex task completes, you've paid to process the same data five, ten, maybe twenty times. A single 2-hour sales transcript passed between three analysis tools might cost you 50,000 tokens — not for the analysis, just for the routing.

There's a better way. When agents write and execute code rather than calling tools one at a time, intermediate results stay in the execution environment, not the context window. The model sees summaries and filtered outputs, not raw data. The difference isn't incremental — it's been measured at 98–99% token reductions on real workloads.

About Tian Pan