Skip to main content

The Action Space Problem: Why Giving Your AI Agent More Tools Makes It Worse

· 9 min read
Tian Pan
Software Engineer

There's a counterintuitive failure mode that most teams encounter when scaling AI agents: the more capable you make the agent's toolset, the worse it performs. You add tools to handle more cases. Accuracy drops. You add better tools. It gets slower and starts picking the wrong ones. You add orchestration to manage the tool selection. Now you've rebuilt complexity on top of the original complexity, and the thing barely works.

The instinct to add is wrong. The performance gains in production agents come from removing things.

This isn't a hypothetical. Teams that have shipped long-running production agents — coding assistants, research systems, autonomous task runners — consistently report that their biggest performance improvements came from stripping down the action space, not expanding it. Understanding why this happens, and how to design around it, is one of the least-discussed but most impactful dimensions of agent engineering.

Why Tool Count Directly Degrades Agent Performance

When you register tools with an LLM agent, every tool definition occupies tokens. A well-documented tool with a name, description, parameters, and examples can consume 200–500 tokens. Put 50 such tools in a prompt and you've pre-loaded 10,000–25,000 tokens before the agent has processed a single message from the user.

This creates three compounding problems.

The first is attention dilution. Transformer attention is finite. More tokens in the context means each token gets proportionally less attention weight. Tool definitions are structural — they establish the action space — but they compete with instruction tokens, user input, and accumulated history for the model's limited attention budget. As context grows, the model's ability to correctly reason about which tool to use diminishes.

The second is selection noise. Research from the Berkeley Function Calling Leaderboard shows that accuracy drops measurably as the number of available tools increases. With 5 tools, frontier models pick correctly most of the time. With 20, errors appear. With 50+, the model starts confusing similar-looking tools, hallucinating tool names that don't exist, or defaulting to generic tools when specialized ones would be more effective. OpenAI caps agents at 128 tools — but the performance cliff arrives well before that limit.

The third is context confusion — the LLM's difficulty distinguishing between instructions, data, and structural markers when all three are present at high density. A large tool registry creates a kind of semantic noise: the model has to hold many more "if this, then that" relationships in working memory simultaneously, which crowds out higher-level reasoning. You built a Swiss Army knife and handed the model a hardware store.

The Hierarchical Action Space

The fix isn't removing capabilities. It's organizing them into tiers that match how agents actually operate, so the model only sees the tools relevant to what it's doing right now.

Level 1: Core atomic tools. Keep this to roughly 20 tools. File read/write, browser navigation, shell execution, HTTP requests, and a handful of agent-specific utilities like memory read/write. These are the primitives. Every task uses them. They're always present.

Level 2: General-purpose utilities. Things that expand Level 1 capabilities without requiring an LLM to invoke them directly. CLI tools, system commands, package managers. These are available but don't need to be in the model's active context. The agent can access them through shell execution at Level 1.

Level 3: Domain-specific logic. Complex operations, multi-step procedures, business rules. Instead of exposing these as LLM tools, encode them as code or library functions the agent calls directly — avoiding multiple LLM roundtrips for operations that should be deterministic.

This structure means the model's active context contains, at most, 20 core tools plus whatever Level 3 function is needed for the current task. Everything else is out of the way.

One failure mode worth calling out explicitly: dynamic RAG-based tool definitions. This seems clever — retrieve tools based on the current query, don't show the agent tools it doesn't need. In practice it creates unstable, non-deterministic context. The model sees different tool subsets for similar queries. It develops patterns around tools that may not always appear. It attempts to call tools that aren't present because a prior step made them seem available. Static, predictable tool sets are nearly always better than dynamic ones for production agents.

Tool RAG as a Middle Ground

For systems with genuinely large tool registries — enterprise integrations, API gateways, platform toolkits — pure static exposure isn't feasible. A retrieval-based approach can work, but only with careful implementation.

The principle: retrieve tools based on relevance to the current step, not the overall task. An agent working on a multi-stage data pipeline should see database tools when querying, transformation tools when processing, and notification tools when reporting. Mixing all three increases the decision surface unnecessarily.

Research on AutoTool (a tool-selection framework) shows this reduces inference costs by up to 30% while maintaining competitive task completion rates. Dynamic Tool Dependency Retrieval (DTDR) conditions on both the initial query and the evolving execution context, which handles cases where the right tool depends on what the agent has already done — a better fit than query-only retrieval.

The constraint: keep retrieved tool sets small. Five to ten tools per retrieval call is a reasonable ceiling. More than that and you're back to the selection noise problem.

The Agent-as-Tool Pattern

The most powerful pattern for managing action space complexity isn't tool organization — it's moving decision-making out of the action selection loop entirely.

Instead of having a general-purpose agent call specialized sub-agents through conversational coordination, treat specialized agents as deterministic functions. Invoke them with a structured input and an expected output schema. Get back an immediately usable result. Don't expose the sub-agent's internal reasoning or tool use to the parent agent unless the parent needs to understand why the sub-agent acted, not just what it produced.

This flips the framing. The parent agent doesn't need to understand how to use a code review agent — it needs to know "call code_review_agent(diff) → {issues: [], summary: string}." The sub-agent is a black box with a typed interface, the same way a function call works in normal software.

The practical benefits are significant:

  • The parent agent's action space stays small. It doesn't need to know about the tools the sub-agent uses internally.
  • Sub-agent results don't pollute the parent's context. You get a typed summary, not a full conversation history to parse.
  • Sub-agents can be developed, tested, and improved independently. The parent's context doesn't care about the sub-agent's implementation.

Where this breaks down: when the parent genuinely needs the trajectory of the sub-agent's reasoning to make its next decision. A software planning agent might need to understand why a sub-agent flagged a particular file, not just that it was flagged. In those cases, pass a structured reasoning summary rather than raw conversation history. Never dump the sub-agent's full context into the parent — that's the "communicate by sharing memory" anti-pattern.

Communication Over Shared Context

The Go programming language has a useful concurrency principle: "Don't communicate by sharing memory; share memory by communicating." It applies directly to multi-agent architectures.

The naive multi-agent design shares context: agent A and agent B both read from and write to a shared context object. A sees B's intermediate steps. B sees A's partial results. The total context grows with each step from each agent. By step 20 of a 3-agent pipeline, you have a context window that's carrying the history of all three agents simultaneously — and the models are spending attention on each other's internal reasoning rather than on the task.

The better design communicates results explicitly: agent A produces a typed output. That output becomes the input to agent B. B doesn't see A's intermediate steps — only A's final answer, in a structured format B was designed to consume. Agents coordinate through well-defined interfaces, not shared memory.

This matters at the architecture level, not just the implementation level. When designing a multi-agent system, the question isn't "what context should each agent have access to?" It's "what is the minimal structured output that the next agent needs to do its job?" Define those interfaces first. The context management follows from them.

Concretely:

  • Discrete, parallelizable tasks → fresh sub-agents with task-specific instructions only
  • Sequential tasks with dependencies → typed handoffs, not shared history
  • Complex reasoning requiring trajectory understanding → structured summaries, not raw history dumps

Measuring Action Space Efficiency

Two metrics are worth tracking explicitly in production:

Tool call precision: Of all tool calls the agent makes, what fraction were actually necessary? Agents with bloated action spaces tend to call tools defensively — "just in case" retrieval, redundant checks, multiple tools where one would do. High tool call count with moderate output quality is a signal the action space is over-supplied.

Tool selection accuracy: For tasks where there's a clear correct tool to use, does the agent consistently select it? Drifting selection accuracy — correct on step 1, wrong on step 7 — often indicates context rot interacting with a large action space. The model's grip on "which tool does what" loosens as context grows.

Both metrics are measurable against real production traces. Neither requires LLM-as-judge evaluation. A tool call is either unnecessary or it isn't. A tool selection is either correct or it isn't. These are the cleaner signals.

The Subtractive Approach

The dominant instinct in agent development is additive: when the agent can't do something, add a tool. When coordination is clumsy, add an orchestration layer. When outputs are wrong, add validation.

The practitioners who've run agents in production for long enough eventually learn the subtractive version: when the agent is unreliable, remove tools. When coordination is clumsy, narrow the interface. When outputs are wrong, check whether the action space is too noisy for the model to reason clearly about its options.

Context engineering — what goes into the window, and when — is one half of this. Action space engineering — what the agent can do, and how it's organized — is the other. Most of the effort goes into context. Most of the remaining wins are in action space design.

The model hasn't changed. Your job is to create conditions under which it can reason well. Fewer tools, cleaner interfaces, and typed handoffs between agents are often more valuable than any amount of prompt optimization.

References:Let's stay in touch and Follow me for more thoughts and updates