Skip to main content

Function Calling vs Code Generation for Agent Actions: The Tradeoffs Nobody Benchmarks

· 10 min read
Tian Pan
Software Engineer

An agent running in production once received the instruction "clean up the test data" and executed a DROP TABLE command against a production database. The tool call succeeded. The audit log showed a perfectly structured JSON payload. The agent had done exactly what it was asked — just not what anyone meant. This isn't a story about prompt injection. It's a story about an architectural choice: the team had given their agent the ability to generate and execute arbitrary code, and they had underestimated what that actually means at runtime.

The choice between function calling and code generation as the action layer for AI agents is one of the most consequential decisions in agent architecture, and almost nobody benchmarks it directly. Papers measure accuracy on task completion; they rarely measure the failure modes that matter in production — silent semantic errors, irreversible side effects, security exposure surface, and debugging cost when something goes wrong.

What the Two Approaches Actually Mean

Function calling (also called tool calling or structured tool use) works by giving the LLM a menu of predefined operations described by JSON schemas. The model selects which tool to invoke and generates structured arguments to match the schema. The execution environment then validates and runs the matching function. The LLM's output is always a structured data object — it never directly specifies execution logic.

Code generation as the action layer is different in kind. The model outputs executable code — Python, JavaScript, or another language — and a runtime executes it. The model isn't selecting from a menu; it's writing a program. Libraries like CodeAct formalize this pattern, and OpenAI's code interpreter and similar sandboxed environments put it in production.

The critical distinction isn't "structured vs. unstructured output." It's about what the model controls. With function calling, the model controls what to do; the execution system controls how. With code generation, the model controls both. That difference has downstream consequences across reliability, safety, flexibility, and cost.

Where Each Approach Actually Wins

Benchmarks like the Berkeley Function Calling Leaderboard reveal something counterintuitive: frontier models reliably select the correct function in straightforward cases. The failure rate on function selection itself is low. Where models fail is in what they do with the results — applying logic to tool outputs, chaining operations correctly, and recovering from unexpected return values. Accuracy on those downstream reasoning steps can drop as much as 75% in studies that measure logic errors beyond simple parameter generation.

Code generation shows a different failure profile. Studies comparing CodeAct to structured tool-calling alternatives find that code generation achieves up to 20% higher success rates on complex multi-step tasks and reduces the number of interaction turns by up to 30%. The reason is compositional power: instead of making five sequential tool calls and waiting for results between each, a code-generating agent can write a loop, use intermediate variables, and express branching logic inline. A single generation handles what would otherwise require multiple model invocations.

The shape of this tradeoff suggests clear task categories:

Function calling wins when:

  • The task maps cleanly onto predefined operations (database queries, API reads, appointment scheduling)
  • Auditability and compliance matter — every action is logged as a discrete, typed call
  • Latency and token cost are constraints — structured calls avoid the overhead of code generation and execution
  • The action space should be bounded — you don't want the agent inventing new capabilities at runtime

Code generation wins when:

  • The task requires composing multiple operations with intermediate logic
  • The agent needs to adapt based on intermediate results — conditional paths, retries, transformations
  • The domain is software development, data analysis, or anything where expressing logic in code is natural
  • You want to reduce round-trips — writing a complete procedure in one generation is faster than orchestrating tool calls step by step

The Safety Asymmetry That Changes Everything

This is where the analysis gets uncomfortable for code generation advocates.

With function calling, the threat model is narrow: a malicious or confused model can call a function with wrong parameters, or call the wrong function. Both are bad, but both are predictable and constrainable. You can validate schemas before execution, require human confirmation for high-stakes operations, and audit every call against a fixed catalog.

With code generation, the threat model expands dramatically. An agent that can write arbitrary code can, in principle, write code that:

  • Reads files it shouldn't read
  • Makes network requests to external endpoints
  • Executes shell commands that the code generation prompt never explicitly authorized
  • Chains operations to achieve effects that no single tool call would permit

Prompt injection — already a first-class concern for tool-using agents — becomes operationally dangerous when the execution layer is code. In function-calling architectures, successful injection might cause the agent to call an API it shouldn't. In code-generation architectures, it could cause the agent to execute arbitrary commands.

Research from 2025 on runtime enforcement frameworks shows that naive sandboxing catches fewer than half of unsafe execution attempts when the code generation prompt is sophisticated. More robust approaches — information flow control, pre-execution policy checking, blocklist/allowlist enforcement — can achieve 90%+ unsafe execution prevention, but they require substantial infrastructure investment.

The operational upshot: for most production workflows that touch sensitive data or irreversible operations, function calling's constrained action space is a feature, not a limitation.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates