Tool Call Ordering Is a Partial Order, Not a Set
A "create then notify" sequence works in dev. A "notify then create" sequence emits a webhook for an entity that doesn't exist yet, the consumer 404s, and your team spends a week debugging what looks like a flaky integration test. The flake isn't flaky. It's deterministic given a hidden ordering invariant your tool set has and your planner doesn't know about.
This is the shape of most tool-call-ordering bugs in production agents: a tool set that secretly composes as a partial order — some operations must happen before others, others can run in any order — being treated by the planner as an unordered set of capabilities. The model picks an order that worked yesterday. A prompt edit, a model upgrade, or even a different temperature sample picks a different order tomorrow. Both look reasonable to anyone reading the trace. Only one is correct.
The team that doesn't declare the order is shipping a bug surface that the model's prompt sensitivity will eventually find.
Why the Order Drifts in the First Place
The autoregressive nature of LLMs means small input changes produce non-local output changes. A tool definition reordered in the system prompt, a few-shot example added, a system-message wording cleanup — any of these can shift which tool the model picks first on a given turn. The model isn't choosing an order from a stable distribution; it's resampling a sequence whose first element strongly conditions the rest.
This shows up three ways in practice:
- Across runs of the same prompt: temperature > 0, or even temperature = 0 with a different model build, produces different orderings. Calls that appear "almost always A→B→C" are actually A→B→C 92% of the time, B→A→C 6%, A→C→B 2%. The 8% never showed up in eval because eval used three fixed traces.
- Across prompt edits: a "tighten the system prompt" cleanup that reorders the tool list in the registry shifts the planner's preferred order across the entire surface. The diff looks like wording. The behavior change shows up in production a week later.
- Across model upgrades: the same prompt against a new model build picks a different order. The release notes say "improved tool use." What that means for your specific tool set is that the ordering distribution moved, and the long tail moved with it.
Recent work on order sensitivity in LLMs measures exactly this: shuffling the input sequence almost always decreases accuracy on multi-step tasks, with the magnitude depending on how much of the task's structure was implicit in the prompt's ordering rather than the model's reasoning. Few-shot prompting partially mitigates this. It does not fix it.
The Trap of Treating Tools as a Set
The mental model most teams ship with is: "the planner is intelligent; the tools are commutative; the planner will figure out the right order." This is wrong on both counts.
The planner is not intelligent in the way you need. It's a next-token predictor with strong priors over common patterns. If your tool names are similar to a public dataset's tool names, the planner inherits the public dataset's ordering preference — which has nothing to do with your dependency structure. Custom tools with names like internal_create_workspace and internal_attach_member look unfamiliar to the model, so the prior is weaker and the variance is higher.
The tools are not commutative. This is the part teams underweight because the tools are almost commutative — for the cases the team thought about. The non-commutativity lives in the cases they didn't:
- An auth-refresh tool and a fetch tool are commutative when the token is fresh; non-commutative when it's expired.
- A create-resource tool and a notify-collaborator tool are commutative for personal workspaces; non-commutative for shared workspaces because notification renders the resource title.
- A search tool and a write tool are commutative for read-only queries; non-commutative when search is also building the agent's context for the write decision.
Each of these is a partial-order edge that lives in the team's heads, not in the tool definition. The planner has no access to the team's heads. So the discipline has to be: make the partial order explicit in the artifact the planner consumes, or make the tools genuinely commutative through idempotency.
Declare the Order Where the Tool Lives
The cleanest fix is to extend the tool definition itself with a depends_on declaration that the harness enforces structurally. Graph-harness designs take this further by lifting the entire control structure into a static DAG: each node's ready-set is computed from the graph, the harness only dispatches a tool when its prerequisites have been satisfied, and the planner's preferred order is reordered or rejected against the declared dependencies. The LLM Compiler pattern runs the same idea with a DEPENDS_ON: [node_id, ...] field on each task, with an empty list meaning "no prerequisites — schedulable immediately."
The point isn't that you need a full DAG scheduler for every agent. The point is that the dependency information has to live somewhere the harness reads, not somewhere the prompt suggests. The two operative properties:
- https://arxiv.org/html/2604.11378v1
- https://www.huuphan.com/2026/04/deterministic-agentic-ai-architecture.html
- https://www.buildmvpfast.com/blog/idempotent-ai-agent-retry-safe-patterns-production-workflow-2026
- https://news.ycombinator.com/item?id=46933954
- https://www.philschmid.de/agent-harness-2026
- https://addyosmani.com/blog/agent-harness-engineering/
- https://arxiv.org/html/2502.04134v2
- https://medium.com/@abhaychougule0907/underlying-factors-behind-inconsistency-in-llm-responses-with-multi-tool-calling-628ce7b4de76
- https://agent-patterns.readthedocs.io/en/stable/patterns/llm-compiler.html
- https://arxiv.org/html/2508.01249v3
- https://arxiv.org/abs/2602.16708v2
