Why Your Agent Should Write Code, Not JSON
Most agent frameworks default to the same action model: the LLM emits a JSON blob, the host system parses it, calls a tool, returns the result. Repeat. It's clean, auditable, and almost universally used — which is exactly the problem. For anything beyond a single tool call, this architecture forces you to write scaffolding code that solves problems the agent could solve itself, if only it were allowed to write code.
There's a different approach: give the agent a Python interpreter and let it emit executable code as its action. One published benchmark shows a 20% higher task success rate over JSON tool-calling. An internal benchmark shows 30% fewer LLM round-trips on average. A framework built around this idea hit #1 on the GAIA leaderboard (44.2% on validation) shortly after release. The tradeoff is a more complex execution environment — but the engineering required is tractable, and the behavioral gains are real.
The Structural Limits of JSON Tool-Calling
Consider a task where an agent needs to search for catering options and party themes simultaneously, then combine the results. In a JSON tool-calling agent, that's two separate LLM calls, two round-trips:
// Step 1
{"name": "web_search", "arguments": "Best catering services in Gotham City"}
// Step 2 (after observing step 1)
{"name": "web_search", "arguments": "Party theme ideas for superheroes"}
A code-action agent collapses this into one step:
results = {}
for query in ["Best catering services in Gotham City", "Party theme ideas for superheroes"]:
results[query] = web_search(query)
print(results)
The JSON approach forces a sequential, call-by-call structure even when the operations are independent. This is a fundamental constraint, not an implementation detail. JSON cannot express loops, conditionals, variable assignment, or passing outputs between tool calls as first-class objects. Code can.
This matters practically. If a tool returns an image tensor, a pandas DataFrame, or any opaque object, JSON has no way to reference that output in the next action — you'd need the framework to serialize and deserialize it, which breaks for complex types. Code stores it in a variable. If you need to run the same tool a dozen times with different inputs and aggregate the results, JSON requires a dozen LLM round-trips. Code does it in one.
What the Benchmarks Show
The empirical case for code actions is not primarily about intuition — it's measured.
The most rigorous evidence comes from a 2024 paper published at ICML, which introduced the CodeAct framework and tested it against JSON and text-based tool-calling across multiple benchmarks. CodeAct achieved up to 20% higher task success rates, not by using a better LLM, but by changing the action representation from JSON to Python.
A separate framework built on the same principle ran its own analysis across hundreds of benchmark tasks. The conclusion: CodeAgent consistently used about 30% fewer steps than an equivalent ToolCallingAgent on the same model. Fewer steps means fewer LLM API calls, lower latency, and lower cost — not because the task got easier, but because code batches what JSON forces to serialize.
A third data point: a system built on code-action principles ranked first on the GAIA validation leaderboard (44.2%) and second on the test set. GAIA questions are designed to require multi-step tool use that closely tracks real-world assistant tasks. The code-action approach cleared that bar, including Level 3 (hardest) questions where most systems fail entirely.
The performance advantage exists because LLMs are extraordinarily well-trained on Python. Code is what they see most in pretraining. Asking an LLM to write a JSON action schema is asking it to use a format optimized for machine parsing, not for expressing the LLM's own reasoning. Code, on the other hand, is a natural continuation of the thought process — closer to how the model already represents computation internally.
The Execution Loop in Practice
The code-action agent loop follows a Thought → Code → Observation cycle (TCO) instead of the standard Thought → Action → Observation (TAO) of JSON agents.
In practice:
- The LLM receives the task and context, reasons through it, and emits a Python code block.
- The executor runs the code in a controlled environment.
- The stdout, return values, and any errors come back as the observation.
- The LLM processes the observation and decides whether to emit more code or return a final answer.
The observation matters. Errors come back too — tracebacks, type errors, AttributeErrors. This means the agent can debug itself. If search_tool("query") returns an object with an unexpected schema, the agent can inspect it, adapt its code, and retry. JSON tool-calling agents fail silently or error out; code agents iterate.
One practical issue: parsing errors. If the LLM generates invalid Python (unclosed parenthesis, incorrect indentation), the executor fails before running a single line. An analysis of over 15,000 agent traces showed that parsing errors on the first call drop overall success rates by more than 21 percentage points and raise average step count from 3.18 to 4.63. The fix is structured output: wrap the code block in a JSON envelope so the LLM's full output is always parseable as JSON, even if the code inside has syntax issues. This lets the framework extract and retry the code block cleanly. The gains are 2–7 percentage points average improvement on capable models.
The caveat: smaller models (roughly under 32B parameters) struggle with dual-formatting pressure — simultaneously producing valid JSON structure, valid Python syntax, and correct task reasoning. For small models, unstructured code output often outperforms structured code output, because eliminating one formatting constraint frees capacity for the actual problem.
Security is the Hard Part
Letting an LLM write and execute arbitrary code creates a real attack surface. The threat model has four layers:
LLM error is the most benign — the model generates a loop that fills disk space, or calls os.remove() unintentionally. Rare, but observed.
Supply chain compromise means compromised model weights that systematically generate malicious code. Theoretical for most deployments, but worth noting for regulated environments.
Prompt injection is the most practical threat. An agent that browses the web reads pages, some of which may contain adversarial instructions embedded in the content. Those instructions get added to the agent's context and can influence subsequent code generation — telling the agent to exfiltrate data, call unexpected endpoints, or take destructive actions. An agent with code execution capabilities makes this far more dangerous than a retrieval-only agent.
Adversarial users matter for any public-facing agent — users who craft inputs specifically to exploit code execution.
The naive defense is an AST-based interpreter rather than raw exec(). Instead of running code directly, the interpreter parses it to an Abstract Syntax Tree and walks it operation by operation, enforcing an import allowlist, blocking submodule access outside the allowlist, and capping total operation count to kill infinite loops. This prevents the most direct attacks without preventing legitimate use of approved libraries.
But the AST interpreter is not a sandbox. Even fully allowed packages can be exploited to exhaust resources. For production deployments exposed to any external input, the right isolation level is microVM or better:
- Firecracker-based VMs offer strong kernel-level isolation with cold start times around 150ms.
- Hibernation-based microVMs can achieve under 25ms startup by persisting warm state between invocations — viable for latency-sensitive user-facing paths.
- gVisor interposes a user-space kernel between the agent code and the host, catching most escape vectors with sub-second startup.
- Docker alone is not sufficient for adversarial workloads. Shared-kernel containers have documented escape paths; use them only for trusted, internal workflows.
One architectural decision that complicates sandboxing: multi-agent systems. If you sandbox only the code execution (Approach 1), managed sub-agents running inside that sandbox lack access to model API credentials and cannot make LLM calls. If you sandbox the entire agent system (Approach 2), you solve the credential problem but must explicitly provision API keys inside the sandbox, which adds operational complexity. There is no clean solution — choose based on your threat model.
Scale reference: code execution sandboxes for AI agents grew from tens of thousands of sessions per month in early 2024 to tens of millions per month by early 2025, with broad enterprise adoption. This is now a solved infrastructure problem — the sandboxing services exist and are production-hardened. The engineering question is which isolation level your use case requires.
When Not to Use Code Actions
Code-action agents are not universally better. The failure modes are different from JSON tool-calling, not absent.
For simple, single-tool workflows, the JSON approach is easier to audit, debug, and explain to non-engineers. If your agent does one thing — search, retrieve, classify — the code machinery adds overhead without benefit.
For small models, the structure tax of producing valid Python plus valid JSON plus correct reasoning is often too much. Test empirically before committing. A ToolCallingAgent on a small model may outperform a CodeAgent on the same model.
For workflows where the full task graph is predetermined, you don't need an agent at all. Agents introduce non-determinism, latency variance, and cost variance. If you can express the task as a fixed pipeline, do that. Reserve agents for tasks where the action sequence genuinely cannot be known in advance.
Context window saturation is a compounding problem in code-action loops. Each step appends observations — tool outputs, code traces, error messages — to the context. Long runs fill the window, drowning out relevant early context. This is not unique to code agents, but code agents tend to run longer and produce verbose observations. Explicit context management (summarization, selective retention) is required for production deployments, and most frameworks provide poor defaults here.
Building for Production
A few design decisions that matter in practice:
Set an explicit step cap and fail gracefully. Runaway loops are the most common production failure mode. Set max_iterations at the orchestration level and handle the limit as a first-class exit condition, not an exception.
Log full execution traces. The code-action loop is harder to debug than JSON tool-calling because the action space is unconstrained. Full traces — including the code generated, the observation returned, and the model's reasoning — are essential for diagnosing failures post-hoc.
Think carefully about what tools to expose. Code agents can use any Python library available in their execution environment, plus any tools you explicitly provide. Providing too many tools adds noise to the context. Providing too few forces the agent to write everything from scratch. The right set is task-specific, but lean toward fewer, well-documented tools rather than more.
Test both agent types on your actual task distribution. Benchmarks show code agents win on average, but averages hide task-specific variance. Your specific workload may be simpler and more uniform than GAIA. Measure on your data before choosing an architecture.
The code-action model is not a universal upgrade to apply everywhere. It's a structural bet that the action space you need is better expressed in Python than in a predefined schema. For complex, multi-step tasks with dynamic tool composition requirements, that bet pays off at a measurable rate. For everything else, keep it simple.
- https://huggingface.co/blog/smolagents
- https://huggingface.co/blog/structured-codeagent
- https://huggingface.co/learn/agents-course/en/unit2/smolagents/tool_calling_agents
- https://huggingface.co/docs/smolagents/tutorials/secure_code_execution
- https://arxiv.org/abs/2402.01030
- https://arxiv.org/abs/2401.00812
- https://arxiv.org/abs/2411.01747
- https://northflank.com/blog/best-code-execution-sandbox-for-ai-agents
