Skip to main content

Why Your AI Agent Should Write Code Instead of Calling Tools

· 11 min read
Tian Pan
Software Engineer

Most AI agents are expensive because of a subtle architectural mistake: they treat every intermediate result as a message to be fed back into the model. Each tool call becomes a round trip through the LLM's context window, and by the time a moderately complex task completes, you've paid to process the same data five, ten, maybe twenty times. A single 2-hour sales transcript passed between three analysis tools might cost you 50,000 tokens — not for the analysis, just for the routing.

There's a better way. When agents write and execute code rather than calling tools one at a time, intermediate results stay in the execution environment, not the context window. The model sees summaries and filtered outputs, not raw data. The difference isn't incremental — it's been measured at 98–99% token reductions on real workloads.

The Tool-Calling Tax

When you wire an agent to a collection of tools using the standard function-calling pattern, every tool invocation follows the same path: the model generates a structured call, the runtime executes it, the full result flows back into context, the model decides what to do next, and the cycle repeats. This is fine when tool outputs are small. It breaks down badly at scale.

Consider an agent integrated with a CRM, a calendar API, a document store, and a data warehouse. Each connected service might expose dozens of endpoints. Before the agent does any work, those tool definitions — all of them — need to load into the context window. A production deployment with several MCP servers attached can easily consume 150,000 tokens just on tool definitions, before the first real query is processed. At frontier model pricing, that's not a rounding error.

Then there's data amplification. When an agent fetches 10,000 rows from a database to answer a question about five of them, the naive implementation ships all 10,000 rows through the model. The model reads them, extracts the five relevant rows, formats them for the next tool call, and sends them along. The 9,995 irrelevant rows have now been paid for twice.

The underlying problem is that tool-calling treats the model as the data bus. Everything flows through it because that's how function calling is designed. Code execution breaks that assumption.

Code Is the Native Language of Agents

Here's an observation worth sitting with: programming languages were designed specifically to express operations a computer should perform. JSON was not. When an agent needs to loop over records, filter a dataset, handle errors, or chain operations — all of which are routine in real agentic tasks — it has to encode those intentions into a flat, literal sequence of tool call objects.

Code can express a conditional in a single line. JSON tool-calling requires the model to check a condition, emit a tool call, receive the result, emit another tool call contingent on the first, and so on. The 30% reduction in agent steps that code-executing agents show on benchmarks reflects exactly this: fewer round trips because more work happens per model invocation.

This isn't just a philosophical point. Three independent research papers converge on it — work showing that code actions elicit better LLM behavior than JSON tool chains, that code can naturally manage complex objects like dataframes and images between steps while JSON cannot, and that generative models have been trained on vastly more high-quality code than high-quality JSON action schemas. The models are simply better at expressing agent actions in code than in structured tool calls.

The MCP Code Execution Pattern in Practice

The cleanest implementation of this idea works as follows. Rather than loading all tool definitions into context upfront, tools are exposed as files in a virtual filesystem organized by server. An agent working with Google Drive, a CRM, and a database might see a directory structure like ./servers/google-drive/, ./servers/salesforce/, ./servers/postgres/. To use a tool, the agent reads its definition from the filesystem — on demand, when needed — then writes code that calls it.

This is progressive disclosure. The model never pays for tool definitions it doesn't use. Instead of 150,000 tokens of upfront loading, you might pay 2,000 tokens for the five tool definitions actually relevant to a given task.

Cloudflare's implementation of this pattern is even more minimal: two tools total, a search() function that queries an API spec and an execute() function that runs authenticated calls in an isolated Worker. When you have 2,500 API endpoints and the traditional pattern would put ~1,170,000 tokens of tool definitions in context, the code execution pattern brings that to roughly 1,000. The OpenAPI spec lives on the server; the model queries the fragment it needs.

Beyond token efficiency, several capabilities emerge from this architecture:

Data stays in the execution environment. An agent can load a large dataset, run transformations in code, and return only the output to the model's context. The model never sees the raw data at all.

Control flow is free. Loops, conditionals, retries, and error handling execute in the sandbox without requiring model round trips. Complex workflows that would take 20 tool-call iterations can complete in 3-4 model invocations.

Privacy becomes enforceable. Sensitive fields — PII, credentials, proprietary records — can be handled in code and never surface in model context. This matters for enterprise compliance: you can write code that processes customer data, extracts aggregate statistics, and returns only those statistics. The raw records never transit through the LLM.

Skill libraries compound over sessions. When agents save useful functions from one task for reuse in later tasks, they build an evolving library of capabilities without any model fine-tuning. An agent that wrote a useful data normalization function yesterday can import it today.

Choosing a Sandbox

Code execution requires somewhere to run code safely. Three dominant approaches have emerged, each with different tradeoffs:

Firecracker microVMs (E2B and similar) are the production-grade option for cloud-deployed agents. Cold starts run around 150ms, which is effectively imperceptible in the context of LLM inference times. Each sandbox is a full isolated VM with network access, a real filesystem, and support for arbitrary Python packages. This is the right choice when agents need to interact with external services or run complex library code.

V8 isolates (Cloudflare Workers) offer sub-millisecond startup and are designed for the case where the agent primarily needs to make authenticated API calls against a known set of services. Network access to arbitrary external hosts is disabled by default — a useful security property — but this also limits what agents can do beyond structured API calls.

WebAssembly / Pyodide enables Python execution in-browser or in constrained environments without any server infrastructure. WASM's linear memory model is bounds-checked by design, making sandbox escapes architecturally much harder. The tradeoff is speed and library support: not everything compiles to WASM, and execution is slower than native. But for client-side agent deployments where you want zero server infrastructure and zero cross-user contamination risk, it's the right call.

The cold-start latency concern that made sandboxing feel expensive two years ago is largely obsolete. The bottleneck in agent execution is model inference. Sandbox overhead is noise.

The Security Reality Check

Here's what makes experienced engineers uncomfortable about code execution: the threat model is genuinely hard. AI-generated code must be treated as untrusted input, with the same rigor you'd apply to user-submitted code in a public execution environment.

Static sanitization — filtering output for dangerous patterns, blocklisting functions — does not work reliably. Motivated adversaries (or clever prompt injection attacks embedded in documents the agent retrieves) can bypass filters through encoding, library chaining, and namespace manipulation. One publicly disclosed vulnerability demonstrated multi-stage numpy attribute manipulation to achieve system command execution through what appeared to be safe scientific computing code.

The SaaStr incident in 2025 is worth knowing: an AI coding agent deleted an entire production PostgreSQL database for 1,200+ executives. Not because of a sophisticated attack — because the agent had broad database permissions and insufficient confirmation logic before executing destructive operations. The model did exactly what it was instructed to do, in an environment where it had more authority than it should have.

These failures share a common root cause: agents with capabilities scoped to their worst possible use, not their intended use. The technical countermeasures that actually work:

  • Sandboxing is non-negotiable. Containment limits blast radius when sanitization fails, and sanitization will eventually fail. Sandboxes don't prevent bad code from being generated; they prevent bad code from affecting the host system.
  • Per-user isolation. Cross-contamination between users — where one agent's execution affects another's environment — is a real risk in shared sandbox deployments.
  • Least-privilege tool scoping. An agent tasked with reading from a database should not have write permissions by default, even if the system supports writes.
  • Remote execution over local Docker. Local containers share the host kernel. Hypervisor-based isolation (Firecracker, WASM) provides stronger guarantees.

The prompt injection problem — where malicious content in retrieved data hijacks code generation — remains genuinely unsolved. Defense in depth, with sandboxing as the last line, is currently the best available approach.

Reliability: The Problem Nobody Discusses

Code execution agents have a parsing failure rate of roughly 2.4% across large evaluation sets. That sounds small until you run 100 agent tasks and six of them produce malformed code the runtime can't execute. And when parsing fails, task success rates drop by over 20 percentage points relative to runs where parsing succeeds — because failures force expensive recovery paths or produce incorrect results.

The current best practice is to wrap agent output in a thin structured envelope: a JSON object with thoughts and code fields. This combines structured generation (reliable parsing) with code execution (expressive power) and improves end-to-end task success by 2–7 percentage points on standard benchmarks.

The catch: this only works reliably on large models. Below roughly 32 billion parameters, the cognitive overhead of generating valid JSON-wrapped Python degrades instruction following badly enough that you're better off with plain code generation or standard function calling. The "code agents are strictly better" story breaks down for small models.

This has a practical implication for system design. Code execution agents are not a universal drop-in replacement for function calling. For well-defined, low-variety workflows where you know the tool call schema in advance and task complexity is modest, standard function calling is still the right default — simpler, more predictable, easier to evaluate. Code execution pays off as task complexity grows, as the number of connected tools increases, and as intermediate data volumes rise. Match the architecture to the actual task requirements.

The Compound Benefit Over Time

The most underexplored advantage of code-executing agents is temporal. Every reusable function an agent writes and saves is a capability it carries into future sessions. An agent that figures out how to efficiently normalize customer address data from three different CRM formats can save that function and reuse it. The next time a similar task comes up, it imports the function rather than solving the problem from scratch.

This is a form of learning that doesn't require model fine-tuning, doesn't require curating new training data, and doesn't require any changes to the underlying model. It emerges naturally from giving agents a persistent place to store code. Skill libraries compound over time in a way that static tool registries cannot — because the skills were generated by the agent itself, tuned to the specific data shapes and quirks of the environment it operates in.

For teams building agents meant to run repeatedly in the same environment — internal tooling agents, data processing pipelines, workflow automation — this is worth designing for explicitly from the start. Build in a skill persistence layer. Track which functions get reused and which don't. The agents that accumulate reusable skills over weeks of operation will outperform those that start fresh every session.

What to Take Forward

The architecture shift from function-calling to code-executing agents is not a rewrite of everything — it's a change in where computation happens. Move processing into the execution environment. Keep context clean. Treat generated code with the same security discipline you'd apply to any untrusted input.

The token economics are compelling enough that any agent making more than a handful of tool calls per task should evaluate whether code execution is the right architecture. The Cloudflare numbers — six zeros of token reduction for complex API workflows — are not representative of every use case, but even modest workloads consistently show 50–80% token reductions. At scale, that's a real cost and latency difference.

The security and reliability concerns are real and deserve serious engineering attention. Sandboxing, least-privilege access, and structured output for code generation are not optional features for production deployments. But these are solved problems with available tooling — not reasons to avoid the architecture.


References:Let's stay in touch and Follow me for more thoughts and updates