The Agent Planning Module: A Hidden Architectural Seam
Most agentic systems are built with a single architectural assumption that goes unstated: the LLM handles both planning and execution in the same inference call. Ask it to complete a ten-step task, and the model decides what to do, does it, checks the result, decides what to do next—all in one continuous ReAct loop. This feels elegant. It also collapses under real workloads in a way that's hard to diagnose because the failure mode looks like a model quality problem rather than a design problem.
The agent planning module—the component responsible purely for task decomposition, dependency modeling, and sequencing—is the seam most practitioners skip. It shows up only when things get hard enough that you can't ignore it.
Why Monolithic Agents Fail at Scale
The problem isn't that LLMs are bad at planning. It's that step-wise greedy reasoning is mathematically incompatible with long-horizon planning. A monolithic agent picks the locally most plausible next action at each step. That's a greedy policy, and greedy policies are provably suboptimal for multi-step problems—not occasionally, but in a way that compounds geometrically.
Research formalized this in 2026: on multi-hop question-answering benchmarks, greedy reasoning agents selected myopic actions that derailed the overall task 55.6% of the time at the very first decision point. Accuracy on shallow tasks hovered near 47%; on longer horizons it approached zero. Wider beam search made it worse, pushing the trap selection rate to 71.9%. Once an agent makes a wrong first move, recovery probability under greedy policies is around 5%. Compare that to approaches with even shallow lookahead: recovery probability jumps to nearly 30%.
But there's a second failure that compounds the first: error self-conditioning. When an LLM makes an error early in a task, it then conditions on that error as if it were ground truth. The mistake becomes part of the context that shapes every subsequent inference call. Benchmarks on repeated-execution tasks found that frontier models with hundreds of billions of parameters drop below 50% accuracy within 15 turns on straightforward sequential tasks. The degradation isn't gradual—it accelerates as the error history accumulates.
And beneath both of these is context rot. LLMs don't process their context window uniformly. The 10,000th token is not attended to with the same fidelity as the 100th. As a monolithic agent accumulates tool responses, observation histories, and intermediate reasoning across dozens of steps, the effective context available for new decisions shrinks. Research into this phenomenon found that even single distractors measurably reduce accuracy, and the impact compounds with context length—affecting even the largest frontier models. The effective context that actually shapes model behavior can be a small fraction of the nominal window.
Put these three failures together—greedy myopia, error self-conditioning, and context rot—and you have a system that degrades predictably as task complexity increases. On one of the most comprehensive long-horizon planning benchmarks released recently, the best frontier model scored 0.343 out of 1.0. The average across all models tested was 0.232. These aren't edge cases; they're the current state of the art.
The Architectural Fix: Separate Planning from Execution
The insight is straightforward even if the implementation isn't: planning and execution are cognitively distinct operations that should not share the same inference call.
A planner takes a high-level task and produces a structured multi-step decomposition. It reasons about goals, constraints, inter-step dependencies, and ordering. It doesn't execute anything. An executor takes a single step from that plan and translates it into concrete tool calls. It doesn't need to reason about the whole problem—it needs to complete one well-specified unit of work.
This mirrors the separation of concerns principle that appears throughout systems engineering. The executor can be a smaller, cheaper model or even deterministic code. It doesn't carry the cognitive burden of the entire task; it only needs step-level context and step-level tool access. The planner is a separate, slower inference call that runs upfront or at key checkpoints rather than on every step.
The practical impact of this separation is measurable. Architectures that pre-plan all tool calls and use variable substitution—where later steps reference earlier outputs by name rather than re-reading full observation histories—achieve roughly 65% fewer tokens and meaningful accuracy improvements compared to ReAct-style loops. The executor never sees the bloated observation history; it sees $E1 and $E2 as inputs and produces $E3 as output. Context rot is structurally prevented.
When the planner also produces a dependency graph rather than a flat ordered list, a further improvement emerges: independent tasks can run concurrently. A sequential agent doing three independent lookups does them one at a time. A graph-aware executor with a task scheduler runs them in parallel, substituting results into dependent steps only after their inputs are available. In production benchmarks, this approach produced latency improvements of 3–4x, cost reductions of 6x, and accuracy gains of roughly 9% compared to sequential ReAct-style execution.
Graph-Based Planning: Not Just an Optimization
The shift from an ordered list to a dependency graph is more than a performance improvement—it's a different correctness model.
A flat numbered plan implies an ordering that may not actually exist. Steps that are independent get sequentialized by accident. Steps whose outputs genuinely depend on earlier steps may be ordered correctly in the plan but the dependency isn't encoded anywhere the system can check or enforce.
A directed acyclic graph makes dependencies explicit. Each task node specifies its tool, its arguments, and the list of task IDs it depends on. A task scheduler—essentially a topological sort with parallel dispatch—runs tasks as soon as their dependencies are satisfied. Tasks with no dependencies run immediately and concurrently. The executor handling step 3 receives $E1 and $E2 already resolved; it doesn't need to reason about what those values are or where they came from.
In practice, this looks like a planner outputting something structured:
Task 1: search("France capital") → $E1 [deps: none]
Task 2: search("Germany capital") → $E2 [deps: none]
Task 3: compare($E1, $E2, question="which is older?") [deps: 1, 2]
- https://arxiv.org/html/2601.22311
- https://arxiv.org/html/2509.09677v1
- https://blog.langchain.com/planning-agents/
- https://arxiv.org/html/2503.09572v3
- https://arxiv.org/html/2503.13657v1
- https://arxiv.org/html/2510.25320v1
- https://arxiv.org/abs/2504.16563
- https://arxiv.org/abs/2502.14563
- https://arxiv.org/abs/2601.18137
- https://arxiv.org/html/2602.19281v1
- https://www.trychroma.com/research/context-rot
- https://arxiv.org/html/2310.04406v3
- https://arxiv.org/pdf/2312.04511
- https://www.anthropic.com/engineering/multi-agent-research-system
