The Deadlock Your Agent Can't See: Circular Tool Dependencies in Generated Plans
A planner agent emits seven steps. Each looks reasonable. The orchestrator dispatches them, the first three return values, the fourth waits on the fifth, the fifth waits on the seventh, and the seventh — buried three lines deep in the planner's prose — quietly waits on the fourth. Nothing is locked. No EDEADLK ever fires. The agent burns 40,000 tokens reasoning about why the fourth step "is taking longer than expected" and ultimately gives up with a soft, plausible apology to the user.
This is the deadlock your agent can't see. It is not the textbook deadlock from operating systems class — there are no mutexes, no resource graphs the kernel can introspect, no holders or waiters anyone in your stack would recognize. The dependencies live inside English sentences that the planner produced, the cycles form in latent semantics rather than in any data structure, and the failure mode looks indistinguishable from "the model is thinking hard." Classic deadlock detection is useless here, but the cost is identical: the workflow halts, tokens evaporate, and your trace tells you nothing.
The good news is that this category of failure is mechanically detectable — once you stop expecting the operating system to find it for you and start treating the plan itself as a graph that needs validation before any tool ever runs.
Why LLM-Authored Plans Produce Cycles
Plan-and-execute architectures, including LLMCompiler-style planners that stream a DAG of tasks with explicit dependencies fields, all share a quiet assumption: that the planner can correctly enumerate, in advance, what each step actually needs from the others. In practice, planners hallucinate dependencies the same way they hallucinate citations.
A few common patterns produce cycles:
- Bidirectional fact dependency. "Geocode the address to a country, then look up the address format for that country, then validate the address." The planner notes that step 3 depends on step 2, and step 2 on step 1 — but the prompt to step 1 says "use the canonical format for that country," which makes step 1 depend on step 2 in flat contradiction with what the dependency field says. Each step is locally coherent. The graph is not.
- Self-referential aggregation. "Summarize the findings from steps 4–9, then use that summary to choose which sub-questions to investigate in step 5." A human reader would catch it. The planner does not, because it generated the plan in linear order and never re-checked the back-edges.
- Implicit shared state. Two steps both write to a "scratchpad" memory and both read from it. Step A's tool description says it reads the scratchpad first; step B's says it must run after A but writes prerequisite values to the scratchpad before A runs. The dependency field declares A → B, but the actual data flow is B → A → B.
- Repair loops. A validator step is added "to retry if the previous step failed." When the previous step's failure mode is producing input the validator rejects, you have an unbounded cycle dressed as defensive engineering.
Recent research on tabular data generation makes this concrete: when LLMs are asked to extract feature dependencies and emit a graph for downstream topological traversal, the resulting graph routinely contains cycles. The example given is mundane — latitude and longitude determine country, but country constrains the valid ranges of latitude and longitude. Both directions are true, and the planner faithfully encoded both.
The pattern is the same in agent planning. Whenever the planner emits a graph, you should expect cycles, not be surprised by them.
Why Classic Deadlock Detection Doesn't Apply
If you've worked on databases or kernels, your reflex is to reach for cycle detection in a wait-for graph. That mental model breaks here for three reasons.
First, no resource is ever held. A real deadlock requires mutual exclusion plus hold-and-wait. The fourth tool call hasn't acquired anything; it's just sitting in the orchestrator's queue waiting for an upstream value that will never arrive. Your runtime sees a pending future, not a held lock.
Second, the dependency edges are not in the graph the runtime sees. The orchestrator's DAG says step 4 depends on step 5. It does not say step 7 also depends on step 4, because that edge exists only in the natural-language argument template inside step 7. Until step 7 actually runs and the LLM resolves its template, the runtime cannot see the back-edge. By the time the back-edge becomes visible, the agent has already been spinning for minutes.
Third, the symptom looks like slow progress, not a halt. A real deadlocked process is suspended; you can ps it and see it blocked on a futex. A "deadlocked" agent is busy. It is reasoning, calling its planner again, emitting tentative tool calls, generating thoughtful prose about how to proceed. The token meter spins. The trace fills with content. From a monitoring perspective, this is the worst possible failure shape: a high-cost, high-activity, zero-progress state that defeats every alarm tuned to "stuck" or "errored."
This is why the Multi-Agent System Failure Taxonomy lists "step repetition" as a distinct failure class. It is not an error. It is the absence of progress dressed in the costume of work.
A Static Plan-Graph Pass That Catches It Before the First Tool Call
The most effective defense is the cheapest one: extract the plan into a graph and check it for cycles before dispatching anything. The planner has already done the hard work of writing the steps down. You just have to read them more rigorously than the planner did.
A practical pass looks like this:
- Normalize the plan into nodes and edges. Each step becomes a node. For each step, parse out two kinds of edges: declared dependencies (the explicit
depends_onfield, if your planner emits one) and referenced dependencies (any${step_3.output},<step3>, or named placeholder mentioned in the step's arguments or prompt). The referenced edges are where the back-edges hide. - Resolve aliases. Planners refer to the same step five different ways: by index, by tool name, by a variable name they invented, by paraphrase. A simple alias table built once per plan, plus a quick LLM-as-extractor pass to canonicalize references, removes 90% of the noise.
- Run a depth-first cycle detection. Tarjan's algorithm or a plain DFS with a recursion stack will find the strongly connected components in milliseconds. Any SCC larger than one node is a cycle. Report it with the path.
- Reject or repair. If a cycle is found, return the plan to the planner with the offending path quoted back: "Step 4 depends on step 5; step 5 depends on step 7; step 7 references step 4's output. Please revise." Most planners will fix it on the first retry, because the cycle is genuinely a mistake, not a deliberate choice.
The economics of this are absurdly favorable. A cycle-detection pass over a 20-step plan costs roughly nothing. A single execution of a deadlocked plan can cost tens of dollars in token charges before the orchestrator's wall-clock budget kills it. One static check pays for itself thousands of times over.
The same idea generalizes to other plan-level invariants worth checking before dispatch: unreachable steps, steps whose inputs are never produced, output types that don't match input types, tools called with required arguments missing. Treat the plan as code. Lint it.
A Runtime Watchdog for the Cycles That Slip Through
Static analysis catches the cycles encoded in the plan. It does not catch the cycles that emerge dynamically — the planner-replanner loop where each repair generates a new plan that fails in nearly the same way, or the validator-repair loop that keeps retrying a step whose output the validator will never accept.
For these, you need a runtime watchdog whose definition of "stuck" is more sophisticated than "no tool call in N seconds." Because remember: a cycling agent is not silent. It is busily spending money.
A watchdog that actually works combines several signals:
- Step count and token budget hard caps. Every agent run gets a fixed maximum number of LLM calls and a fixed token budget. When either is hit, the run dies. This is non-negotiable infrastructure, the agent equivalent of a per-request timeout. Reports of agents burning $50/hour while looping are routine; without a hard cap, your worst-case cost is unbounded.
- Structural repetition detection. Hash each tool call by
(tool_name, normalized_arguments). If the same hash appears more than three times in a window, the agent is cycling at the call level. Kill it. The hashes don't need to be exact — argument normalization (lowercased, whitespace-collapsed, with timestamps and request IDs stripped) catches the common case where the model varies trivial fields each iteration. - Semantic repetition detection. Embed each reasoning step or message and compare against the recent window. When cosine similarity exceeds a threshold for several consecutive turns, the agent is paraphrasing itself rather than making progress. This catches the replanner loop that varies surface form across attempts but produces near-identical plans.
- Plan adherence drift. If the agent has a current plan and its latest tool call is not in the plan and not a recognized repair action, something is wrong. Either the plan is being abandoned silently (a separate failure mode) or the agent is wandering in cycles outside the planned graph.
The point of layering these signals is that no single one is reliable. Step-count caps stop runaway loops but accept enormous waste before firing. Structural hashing catches identical retries but misses paraphrased ones. Semantic similarity catches paraphrased loops but generates false positives on legitimate iteration. Combined, they build a picture: this agent is busy, but it is not making progress.
Treating the Plan as a First-Class Artifact
The deeper shift here is to stop thinking of the plan as scaffolding the agent uses internally and start treating it as a first-class artifact your infrastructure inspects, validates, and constrains. Planners are not authoritative just because they happen first. They are LLM outputs, with all the unreliability that implies.
Three operational habits help:
- Persist the full plan, not just the trace of executed steps. When something goes wrong, you want to be able to point at the original DAG, the cycles in it, and the moments where the runtime tried to repair them. Most agent observability tools today log the trace and lose the plan; this asymmetry is exactly backwards for diagnosing planning failures.
- Make plan validation a separate stage with its own metrics. "Plans rejected for cycles," "plans rejected for unreachable steps," "plans accepted on first try" should be charts on the dashboard you actually look at. If those numbers are zero, your validator is broken or your planner is suspiciously well-behaved. If they are high, your planner needs work but at least your validator is paying for itself.
- Constrain the planner's vocabulary. A planner that emits free-form prose dependencies will produce more cycles than one constrained to a typed schema with explicit
depends_on: [step_id]lists. The schema doesn't eliminate cycles, but it makes them mechanically extractable, which is the prerequisite for catching them. When you see teams shipping planners that emit unconstrained natural language, they are accepting a class of bug they then cannot detect.
The unifying observation is that an LLM-generated plan deserves the same skepticism as any other LLM-generated artifact. We have learned, painfully, to validate model outputs that face the user. The output that drives the next twenty tool calls deserves at least as much scrutiny.
What to Build This Quarter
If your agent today dispatches plans without a validation pass, the highest-leverage thing you can ship this quarter is a 200-line cycle detector that runs between the planner and the orchestrator. It will pay for itself the first time it catches a cycle, and the structural metrics it produces will start telling you things about your planner that the trace alone never could.
Pair that with a watchdog that watches for token budget exhaustion, repeated-call hashes, and reasoning-step similarity, and you will have moved this entire class of failure from "invisible cost we eat in the long tail" to "alarm with a stack trace pointing at the cycle." That's the difference between an agent system you can operate and one that occasionally bills you four figures while accomplishing nothing.
The deadlock your agent can't see is the one your infrastructure isn't looking for. Start looking.
- https://arxiv.org/html/2511.10650
- https://blog.langchain.com/planning-agents/
- https://langchain-ai.github.io/langgraph/tutorials/llm-compiler/LLMCompiler/
- https://arxiv.org/html/2603.20356v1
- https://www.agentpatterns.tech/en/failures/infinite-loop
- https://medium.com/@instatunnel/agentic-resource-exhaustion-the-infinite-loop-attack-of-the-ai-era-76a3f58c62e3
- https://arxiv.org/html/2507.19334
