Skip to main content

The Agent Planning Module: A Hidden Architectural Seam

· 10 min read
Tian Pan
Software Engineer

Most agentic systems are built with a single architectural assumption that goes unstated: the LLM handles both planning and execution in the same inference call. Ask it to complete a ten-step task, and the model decides what to do, does it, checks the result, decides what to do next—all in one continuous ReAct loop. This feels elegant. It also collapses under real workloads in a way that's hard to diagnose because the failure mode looks like a model quality problem rather than a design problem.

The agent planning module—the component responsible purely for task decomposition, dependency modeling, and sequencing—is the seam most practitioners skip. It shows up only when things get hard enough that you can't ignore it.

Why Monolithic Agents Fail at Scale

The problem isn't that LLMs are bad at planning. It's that step-wise greedy reasoning is mathematically incompatible with long-horizon planning. A monolithic agent picks the locally most plausible next action at each step. That's a greedy policy, and greedy policies are provably suboptimal for multi-step problems—not occasionally, but in a way that compounds geometrically.

Research formalized this in 2026: on multi-hop question-answering benchmarks, greedy reasoning agents selected myopic actions that derailed the overall task 55.6% of the time at the very first decision point. Accuracy on shallow tasks hovered near 47%; on longer horizons it approached zero. Wider beam search made it worse, pushing the trap selection rate to 71.9%. Once an agent makes a wrong first move, recovery probability under greedy policies is around 5%. Compare that to approaches with even shallow lookahead: recovery probability jumps to nearly 30%.

But there's a second failure that compounds the first: error self-conditioning. When an LLM makes an error early in a task, it then conditions on that error as if it were ground truth. The mistake becomes part of the context that shapes every subsequent inference call. Benchmarks on repeated-execution tasks found that frontier models with hundreds of billions of parameters drop below 50% accuracy within 15 turns on straightforward sequential tasks. The degradation isn't gradual—it accelerates as the error history accumulates.

And beneath both of these is context rot. LLMs don't process their context window uniformly. The 10,000th token is not attended to with the same fidelity as the 100th. As a monolithic agent accumulates tool responses, observation histories, and intermediate reasoning across dozens of steps, the effective context available for new decisions shrinks. Research into this phenomenon found that even single distractors measurably reduce accuracy, and the impact compounds with context length—affecting even the largest frontier models. The effective context that actually shapes model behavior can be a small fraction of the nominal window.

Put these three failures together—greedy myopia, error self-conditioning, and context rot—and you have a system that degrades predictably as task complexity increases. On one of the most comprehensive long-horizon planning benchmarks released recently, the best frontier model scored 0.343 out of 1.0. The average across all models tested was 0.232. These aren't edge cases; they're the current state of the art.

The Architectural Fix: Separate Planning from Execution

The insight is straightforward even if the implementation isn't: planning and execution are cognitively distinct operations that should not share the same inference call.

A planner takes a high-level task and produces a structured multi-step decomposition. It reasons about goals, constraints, inter-step dependencies, and ordering. It doesn't execute anything. An executor takes a single step from that plan and translates it into concrete tool calls. It doesn't need to reason about the whole problem—it needs to complete one well-specified unit of work.

This mirrors the separation of concerns principle that appears throughout systems engineering. The executor can be a smaller, cheaper model or even deterministic code. It doesn't carry the cognitive burden of the entire task; it only needs step-level context and step-level tool access. The planner is a separate, slower inference call that runs upfront or at key checkpoints rather than on every step.

The practical impact of this separation is measurable. Architectures that pre-plan all tool calls and use variable substitution—where later steps reference earlier outputs by name rather than re-reading full observation histories—achieve roughly 65% fewer tokens and meaningful accuracy improvements compared to ReAct-style loops. The executor never sees the bloated observation history; it sees $E1 and $E2 as inputs and produces $E3 as output. Context rot is structurally prevented.

When the planner also produces a dependency graph rather than a flat ordered list, a further improvement emerges: independent tasks can run concurrently. A sequential agent doing three independent lookups does them one at a time. A graph-aware executor with a task scheduler runs them in parallel, substituting results into dependent steps only after their inputs are available. In production benchmarks, this approach produced latency improvements of 3–4x, cost reductions of 6x, and accuracy gains of roughly 9% compared to sequential ReAct-style execution.

Graph-Based Planning: Not Just an Optimization

The shift from an ordered list to a dependency graph is more than a performance improvement—it's a different correctness model.

A flat numbered plan implies an ordering that may not actually exist. Steps that are independent get sequentialized by accident. Steps whose outputs genuinely depend on earlier steps may be ordered correctly in the plan but the dependency isn't encoded anywhere the system can check or enforce.

A directed acyclic graph makes dependencies explicit. Each task node specifies its tool, its arguments, and the list of task IDs it depends on. A task scheduler—essentially a topological sort with parallel dispatch—runs tasks as soon as their dependencies are satisfied. Tasks with no dependencies run immediately and concurrently. The executor handling step 3 receives $E1 and $E2 already resolved; it doesn't need to reason about what those values are or where they came from.

In practice, this looks like a planner outputting something structured:

Task 1: search("France capital") → $E1  [deps: none]
Task 2: search("Germany capital") → $E2 [deps: none]
Task 3: compare($E1, $E2, question="which is older?") [deps: 1, 2]

Tasks 1 and 2 run in parallel. Task 3 runs when both complete. The planner expressed the structure; the scheduler enforced it.

The accuracy improvement from this pattern on multi-hop reasoning benchmarks is consistent: graph-aware planning with parallel scheduling raises accuracy by 10–15 percentage points on multi-hop question-answering tasks while reducing the number of sequential steps and total response length. The gains aren't from a smarter model—they're from a smarter execution structure.

Hierarchical planning extends this further. For genuinely complex tasks, a global planner produces high-level milestones. Domain-specific sub-planners decompose each milestone into concrete steps. Executors handle individual tool calls without ever seeing the full problem. Evaluations on long-horizon legal and research tasks found that global planning alone contributed 8+ percentage point improvements in ablation studies, with the full hierarchical setup achieving 12–20% gains over ReAct baselines.

Replanning Is a Feature, Not a Fallback

The most common objection to upfront planning is that plans go stale. The agent reaches step 4 and discovers something that invalidates step 7. A static plan can't adapt.

This objection is real, but it's an argument for building replanning into the architecture—not for collapsing planning back into execution. Replanning is a distinct trigger that fires under specific conditions and produces a revised plan. It's not free (it's another planning inference call), so the question is when to trigger it.

Four replanning strategies span most use cases:

Failure-triggered replanning fires only when an executor step returns an error or produces an output that violates a precondition. It's the cheapest option and handles most task deviations adequately when tasks are well-scoped.

Observation-triggered replanning fires after every executor step, passing the new environmental state back to the planner. It's expensive but produces the best results for tasks where the environment is dynamic or partially observable. Production evaluations on web navigation tasks found that observation-triggered replanning contributed more than 10 percentage points of accuracy improvement over static planning on the same tasks.

Periodic replanning fires every N steps and balances cost against adaptability. It's a reasonable default for long tasks where the environment changes gradually.

Bounded-horizon replanning commits only to the next K steps in the plan, replanning at each K-step boundary. It trades off global optimality for local correctness and works well when task structure is predictable at the step level but uncertain at the horizon.

The key insight is that the plan is a hypothesis, not a contract. Encoding this explicitly in the architecture—rather than assuming the planner produced a perfect sequence—produces more robust systems.

What This Means for System Design

Separating the planning module creates a seam in your system that is worth designing deliberately rather than discovering accidentally.

The planner needs a plan representation format that the executor can consume. A numbered list works for sequential tasks. A JSON task graph with explicit dependency arrays works for parallel tasks. A structured DAG with variable substitution handles both. The format isn't incidental—it determines what execution patterns are possible and what the executor needs to reason about.

The executor should receive only what it needs for the current step. Not the full tool set—just the tools relevant to this step's operation. Not the full conversation history—just the plan step, its resolved input variables, and any step-specific context. This is both a performance property (shorter context, less rot) and a security property (smaller blast radius for hallucinated or adversarial tool calls).

The planner and executor don't need to use the same model. Planners benefit from strong reasoning at the global level; executors benefit from speed and cost efficiency at the local level. Using a large frontier model for planning and a smaller specialized model for execution is architecturally clean and often substantially cheaper.

Monitoring needs to happen at both layers. Planning failures—invalid dependency graphs, impossible step orderings, ambiguous variable references—look different from execution failures—tool errors, malformed arguments, unexpected outputs. Treating them as a single category makes debugging significantly harder. Traces should mark the boundary between planning and execution events.

The Right Time to Add a Planning Module

Not every agent needs an explicit planning module. For tasks with two or three steps, no dependencies, and stable tool behavior, a ReAct loop is fine. The overhead of a separate planner isn't justified.

The signal that you need a planning module is usually one of three things: your agents fail predictably on tasks that require more than five or six sequential steps; you're seeing accuracy degrade as context length grows rather than as task difficulty increases at the individual step level; or you're running tasks with independent subtasks that are being serialized for no reason other than the default sequential structure.

At that point, the planning module isn't an optimization. It's the architectural change that makes the difference between a system that sort of works and one that scales.

The seam was always there. Exposing it deliberately, designing the interface between planner and executor explicitly, and routing monitoring and replanning logic through it is the decision that separates agents that degrade gracefully from agents that fail mysteriously.

References:Let's stay in touch and Follow me for more thoughts and updates