Tool Call Ordering Is a Partial Order, Not a Set
A "create then notify" sequence works in dev. A "notify then create" sequence emits a webhook for an entity that doesn't exist yet, the consumer 404s, and your team spends a week debugging what looks like a flaky integration test. The flake isn't flaky. It's deterministic given a hidden ordering invariant your tool set has and your planner doesn't know about.
This is the shape of most tool-call-ordering bugs in production agents: a tool set that secretly composes as a partial order — some operations must happen before others, others can run in any order — being treated by the planner as an unordered set of capabilities. The model picks an order that worked yesterday. A prompt edit, a model upgrade, or even a different temperature sample picks a different order tomorrow. Both look reasonable to anyone reading the trace. Only one is correct.
The team that doesn't declare the order is shipping a bug surface that the model's prompt sensitivity will eventually find.
Why the Order Drifts in the First Place
The autoregressive nature of LLMs means small input changes produce non-local output changes. A tool definition reordered in the system prompt, a few-shot example added, a system-message wording cleanup — any of these can shift which tool the model picks first on a given turn. The model isn't choosing an order from a stable distribution; it's resampling a sequence whose first element strongly conditions the rest.
This shows up three ways in practice:
- Across runs of the same prompt: temperature > 0, or even temperature = 0 with a different model build, produces different orderings. Calls that appear "almost always A→B→C" are actually A→B→C 92% of the time, B→A→C 6%, A→C→B 2%. The 8% never showed up in eval because eval used three fixed traces.
- Across prompt edits: a "tighten the system prompt" cleanup that reorders the tool list in the registry shifts the planner's preferred order across the entire surface. The diff looks like wording. The behavior change shows up in production a week later.
- Across model upgrades: the same prompt against a new model build picks a different order. The release notes say "improved tool use." What that means for your specific tool set is that the ordering distribution moved, and the long tail moved with it.
Recent work on order sensitivity in LLMs measures exactly this: shuffling the input sequence almost always decreases accuracy on multi-step tasks, with the magnitude depending on how much of the task's structure was implicit in the prompt's ordering rather than the model's reasoning. Few-shot prompting partially mitigates this. It does not fix it.
The Trap of Treating Tools as a Set
The mental model most teams ship with is: "the planner is intelligent; the tools are commutative; the planner will figure out the right order." This is wrong on both counts.
The planner is not intelligent in the way you need. It's a next-token predictor with strong priors over common patterns. If your tool names are similar to a public dataset's tool names, the planner inherits the public dataset's ordering preference — which has nothing to do with your dependency structure. Custom tools with names like internal_create_workspace and internal_attach_member look unfamiliar to the model, so the prior is weaker and the variance is higher.
The tools are not commutative. This is the part teams underweight because the tools are almost commutative — for the cases the team thought about. The non-commutativity lives in the cases they didn't:
- An auth-refresh tool and a fetch tool are commutative when the token is fresh; non-commutative when it's expired.
- A create-resource tool and a notify-collaborator tool are commutative for personal workspaces; non-commutative for shared workspaces because notification renders the resource title.
- A search tool and a write tool are commutative for read-only queries; non-commutative when search is also building the agent's context for the write decision.
Each of these is a partial-order edge that lives in the team's heads, not in the tool definition. The planner has no access to the team's heads. So the discipline has to be: make the partial order explicit in the artifact the planner consumes, or make the tools genuinely commutative through idempotency.
Declare the Order Where the Tool Lives
The cleanest fix is to extend the tool definition itself with a depends_on declaration that the harness enforces structurally. Graph-harness designs take this further by lifting the entire control structure into a static DAG: each node's ready-set is computed from the graph, the harness only dispatches a tool when its prerequisites have been satisfied, and the planner's preferred order is reordered or rejected against the declared dependencies. The LLM Compiler pattern runs the same idea with a DEPENDS_ON: [node_id, ...] field on each task, with an empty list meaning "no prerequisites — schedulable immediately."
The point isn't that you need a full DAG scheduler for every agent. The point is that the dependency information has to live somewhere the harness reads, not somewhere the prompt suggests. The two operative properties:
- Locality: the dependency is declared next to the tool, so a developer adding the tool can see and update it. A dependency expressed as a paragraph in the system prompt drifts immediately and silently.
- Enforcement: the harness rejects (or reorders) plans that violate the declaration, so the planner cannot ship a prompt-induced bug into production. A dependency expressed as "the system prompt asks the model to please call A before B" is a suggestion, not an invariant.
The lightweight version of this for teams not ready to adopt a full graph harness: a static check at agent-definition time that reads the tool registry, builds the implied dependency graph, and fails CI if a planner-generated trace from the eval suite ever produces a topological-order violation. This catches the regression at PR time rather than in production.
Idempotency Is the Other Half
Declared dependencies handle the case where order matters. Idempotency handles the case where it shouldn't. These are not alternatives — you need both, and the boundary between them is the design decision.
Idempotent agent patterns treat each tool call as a deterministic side effect: hash (workflow_id, tool, args) into an idempotency key, store the result in a ledger, and on replay return the cached result instead of re-executing. The agent-ledger library sits at the tool-call boundary specifically to prevent the "reasoning differently on retry causes double side effects" failure mode.
For ordering specifically, idempotency lets you mean it when you say "these tools are commutative." If create_workspace is idempotent, the planner can call it twice — once before attach_member, once after — without harm. If send_email is idempotent on (template_id, recipient, idempotency_key), a reordering that ends up sending the email earlier than intended doesn't double-send when the planner recovers and tries again.
The trap is asserting commutativity without enforcing it. Teams write "this tool is idempotent" in a docstring, the tool isn't actually idempotent at the storage layer (it's only idempotent on a clean database), and a reordering that happens once a quarter produces a duplicate row that takes two days to track down. The discipline: idempotency is a property of the storage and side-effect path, not of the tool's signature, and the harness should treat undeclared idempotency as "not idempotent" rather than "probably fine."
Eval Adversarially, Or You Won't Catch It
Eval suites built from happy-path traces lock in the ordering bias of the model that generated the traces. They will not catch ordering bugs because they only sample one ordering per task.
The adversarial pattern that works: for any task whose tool set has more than two effectful tools, construct an eval case with the same task description and at least three plausible orderings — two correct, one incorrect (where "incorrect" means it violates a real partial-order edge). Score on whether the agent reaches the correct end state regardless of which permutation it chose. The two correct orderings catch the case where the agent only knows one path; the incorrect ordering catches the case where the agent will follow whatever order the prompt happens to suggest.
A second discipline: when you add a new tool, the eval suite should run a permutation-fuzz pass on existing tasks that now include the new tool, generating N random orderings and verifying the harness either reorders them to a valid topology or rejects them with a clear error. This is where the "team adds a sixth tool that introduces a new ordering dependency and 0.3% of production traffic starts producing inconsistent state" failure gets caught. 0.3% is the long tail of the prompt-sensitivity distribution, and it doesn't show up in canary unless canary specifically samples adversarial orderings.
The metric to graph alongside accuracy: ordering-stability — given the same task across N samples at the production temperature, what fraction of runs choose the same order, and what fraction of different orderings reach the correct end state? A stable agent has high agreement on order. A robust agent has low agreement on order but high agreement on outcome. You want robust. Stable-and-fragile is the worst quadrant: it looks fine in eval and breaks the moment a model upgrade shifts the distribution.
What This Looks Like in the Architecture
The architectural realization underneath all of this: tool composition is a partial order, not a set. A set treats {A, B, C} as interchangeable; a partial order has edges like A → C and B → C while leaving A and B unordered. Most production tool sets are partial orders the team never wrote down.
Concretely, this implies four design decisions worth making explicitly rather than by accident:
- Declare dependencies at the tool layer. A
depends_onfield in the tool definition, enforced by the harness, with a static check at registry-build time that the graph is a DAG. - Make commutative tools genuinely commutative. Idempotency keys, content-addressed storage, or natural idempotency at the database layer — not docstring assertions.
- Eval the partial order, not just the happy path. Adversarial orderings as first-class eval cases. Permutation fuzzing on tool addition.
- Accept that the planner's preferred order is a load-bearing artifact. It's a sample from a distribution that shifts when the prompt or model shifts. Treat it like a configuration value: pin it, version it, rebaseline it on upgrade.
Teams that do this end up with agents whose ordering behavior is boring — the harness enforces what must be true, the planner is free to optimize within those constraints, and a model upgrade shifts the order without shifting correctness. Teams that don't end up debugging a flaky-looking integration test for a week and discovering it was a hidden topological violation all along.
The bug surface is real. The model's prompt sensitivity will eventually find it. The only question is whether your eval finds it first.
- https://arxiv.org/html/2604.11378v1
- https://www.huuphan.com/2026/04/deterministic-agentic-ai-architecture.html
- https://www.buildmvpfast.com/blog/idempotent-ai-agent-retry-safe-patterns-production-workflow-2026
- https://news.ycombinator.com/item?id=46933954
- https://www.philschmid.de/agent-harness-2026
- https://addyosmani.com/blog/agent-harness-engineering/
- https://arxiv.org/html/2502.04134v2
- https://medium.com/@abhaychougule0907/underlying-factors-behind-inconsistency-in-llm-responses-with-multi-tool-calling-628ce7b4de76
- https://agent-patterns.readthedocs.io/en/stable/patterns/llm-compiler.html
- https://arxiv.org/html/2508.01249v3
- https://arxiv.org/abs/2602.16708v2
