Tool Reentrancy Is the Bug Class Your Function-Calling Layer Doesn't Know Exists
The agent took four hundred milliseconds to answer a simple question, then crashed with a recursion-limit error. The trace showed twenty-five tool calls. Reading the trace top-to-bottom, an engineer would conclude the agent was confused — calling the same handful of tools in slightly different orders, never converging. That conclusion is wrong. The agent wasn't confused. It was stuck in a cycle: tool A invoked the model, the model picked tool B, tool B's implementation invoked the model again to format its output, and the formatter chose tool A. The trace UI rendered four nested calls as four sibling calls in a flat list, and the cycle was invisible to the only human who could have caught it.
This is tool reentrancy, and it's a bug class your function-calling layer almost certainly doesn't model. Concurrency-safe code has decades of primitives for it: reentrant mutexes that count nested acquisitions by the same thread, recursion limits at the language level, stack inspection APIs, and a cultural understanding that any function which calls back into the runtime needs a clear contract about what re-entry is allowed. Tool-calling layers default to fire-and-forget. There is no call stack the runtime can inspect, no cycle detector before dispatch, no reentrancy attribute on the tool definition, and the trace UI is shaped like a log, not a graph. The result is that every tool catalog past about a dozen entries silently becomes a recursion the framework can't see.
The Failure Mode Hides Inside "Successful" Tool Calls
The way teams notice tool reentrancy in production is rarely "the agent looped." That's the failure case obvious enough to land in a postmortem. The far more common case: requests that complete successfully but cost three or four times what the cost model predicted, and an eval suite that scores fine because the final output is correct. The cycle ran, the model burned tokens discovering it, the loop broke when the outer turn budget hit, and the answer came out the other side because the model is good at recovering from its own confusion. The bill is the only place the bug shows.
LangGraph's default recursion limit of twenty-five was chosen specifically to catch this — not because twenty-five is a reasonable number of supersteps for a complex workflow, but because it's a backstop against agents that can't stop. The error message tells you the loop hit the wall. It does not tell you why. Production teams see GraphRecursionError and the first instinct is to raise the limit. That converts a noisy bug into a quiet one. The cycle keeps happening; it just runs longer before it gives up. DeepAgents users have reported that subagents silently inherit the default limit regardless of parent configuration, so a parent graph configured for fifty supersteps spawns children that hit twenty-five and crash without surfacing the right cause to the caller. The cycle and the limit are in different places. So is the diagnosis.
The recent literature has started to call out an even uglier variant: hidden cycles, where the trajectory looks structurally non-cyclic — different tools, different arguments, different retrieved documents — but the agent is in fact revisiting the same logical state. Structural analysis of the call stack misses these entirely. They only show up if you can compare the semantic content of consecutive states, which most observability tooling does not do.
Reentrancy Is the Right Word, Not Recursion
Calling this "recursion" lumps it together with the case where an agent legitimately decomposes a task into sub-tasks and asks itself for help. That recursion is fine. ReDel and the broader recursive-language-model literature show useful patterns where a parent agent spawns child agents to handle complex sub-problems, and the call tree is well-formed and bounded. The pathological cases are not recursion in the well-formed sense. They are reentrancy.
Reentrancy is a property from systems programming: the question is whether a function can be safely re-entered before its previous invocation finishes. A reentrant function is one that can. A non-reentrant function called reentrantly is the bug. The systems-programming world solved this by giving every function a clear answer to "what happens if I'm called again while my previous call is still on the stack?" Some functions are safe (pure functions, well-designed library code). Some functions need a reentrant mutex that counts nested acquisitions by the same caller. Some functions are explicitly non-reentrant and the runtime is supposed to catch the violation.
Tools in agent frameworks don't have this property declared anywhere. The framework doesn't know whether it's safe for the model, mid-execution of tool A, to choose tool A again. There's no annotation, no enforcement, and the most common runtime behavior is to dispatch the call as if it were the first one. If tool A holds external state (an open transaction, a paginated cursor, a partially-written file), calling it reentrantly is the kind of bug whose stack trace is on a different machine three hops away.
The Trace UI Is Lying By Omission
Most agent observability platforms render tool calls as a hierarchical tree of spans, and they do this well for the simple case where the agent is a flat loop of model-then-tool-then-model. The hierarchy degenerates to a flat list and the timeline view tells you everything you need.
The case it does not handle: a tool whose implementation invokes the model again, which then chooses another tool. That nested model call is logically a child of the tool, and the second tool is a grandchild of the first. Most tracing libraries either flatten the whole thing into siblings of the outer agent loop (because the inner model call uses the same client and the tracing context didn't propagate) or render the parent-child relationship technically correctly but in a UI optimized for shallow trees, where deep nesting wraps off the right side of the screen and gets visually lost.
The cycle is invisible in both cases. A flat list of [outer_call_to_A, inner_call_to_B, inner_call_to_A, outer_call_to_B] reads like the agent picked four tools in sequence. The bug is that the second inner_call_to_A is the same logical call as the first outer_call_to_A — the model re-entered A. A graph view would show a cycle. A list view shows four sequential picks.
What the Discipline Looks Like
The fix is not a single feature. It is a stack of small changes that collectively give the runtime the same inspection capability a programming language gives a debugger.
A per-request tool call graph the runtime maintains and inspects before dispatching. Every tool invocation carries an edge to its parent in the call graph. Before dispatching a new call, the runtime walks the ancestors and checks whether the same (tool_name, normalized_arguments) pair already appears. If it does, the runtime has the data to decide whether to permit the re-entry, surface a warning, or break the cycle with a structured error. The model can recover from a structured error far better than it can recover from running out of turns.
A cycle-detection rule that breaks the loop with a structured error rather than letting the model burn through retries discovering it. When the runtime sees a re-entry it cannot justify, it should not silently dispatch. It should return a tool result that says "this would create a cycle: tool A is already on your call stack with arguments X." The model is excellent at reading that and producing a different plan. Hidden cycles — where the arguments are different but the semantic state is the same — need a richer detector, but the structural case alone catches most production occurrences and is cheap to implement.
A stack-depth budget for nested model-in-tool-in-model patterns, separate from the outer turn limit. The outer turn limit ("how many times can the agent loop?") is not the same constraint as the inner stack depth ("how deep can a tool recurse into the model?"). Conflating them is why deep agents and subagents both inherit the same recursion_limit of twenty-five and both run out of headroom at the same time. Treat the two as orthogonal: an agent should be allowed thirty turns of progress with a stack depth of three, or three turns of progress with a stack depth of ten, depending on the workload.
An explicit reentrancy attribute on tool definitions. The tool author knows whether their tool is safe to call reentrantly. A database.transaction.begin tool is not. A text.summarize tool is. A filesystem.read tool is fine to call reentrantly with different paths but not with the same path within an in-progress write. Encode this. The runtime cannot guess. Default to "non-reentrant" so the cost of getting the annotation wrong is a noisy error rather than a silent corruption.
A trace UI that renders the call graph as a graph, not a flat list. The cycle becomes visible the moment the visualization is shaped like the underlying structure. Datadog and Langfuse have started shipping flow-graph views alongside their timeline views; the flow-graph view is the one where reentrancy bugs are obvious. Use it. If your observability vendor only offers a timeline, post-process traces into a graph layout for the production debugger; even a static SVG generated nightly is better than a flat list of spans for this class of bug.
A Tool-Using Agent Is a Recursive Evaluator
The architectural realization that makes the rest of these disciplines obvious: a tool-using agent is a recursive evaluator. It is not a state machine, not a workflow, not a pipeline. It is, structurally, an interpreter for a language whose programs are written in natural language and whose values are tool results. The model is the eval function. The tool catalog is the standard library. The conversation history is the heap.
Every concurrency bug fifty years of programming-language research has names for is reachable in this system. Reentrancy. Stack overflow. Deadlock (when two tools each await a resource the other holds). Livelock (the most common production occurrence — two tools that keep handing control back and forth without progress). Race conditions (parallel tool calls that read and write shared state). The tooling that programming languages built for these — recursion limits, stack inspection, monitor objects, lock orderings, deadlock detectors — has direct analogs in agent runtimes. Most of those analogs are missing.
Teams building agent systems often act like the field is unprecedented and every bug needs a new name. The bugs are not new. The names exist. Reading the literature on reentrant mutexes from 1980 will give a working agent engineer more leverage than reading another blog post about prompt patterns. The pattern that has to land in agent runtimes is the one programming languages already adopted: the runtime is responsible for knowing what's on the stack, the developer is responsible for declaring what's safe, and the type system enforces the contract. None of that exists yet for tools. Building it is the work that turns "the agent occasionally loops" into a bug class that gets caught at dispatch instead of at the recursion-limit wall.
The Honest Cost Line
The reason this matters more than another correctness concern is the bill. Reentrancy bugs do not show up as failures in the eval suite (the answer is correct, the cycle just took longer). They do not show up as elevated error rates (the recursion limit catches the worst case and the model recovers from the rest). They show up as a per-request token count that drifts upward over months as the tool catalog grows, with no clear cause and no obvious fix. The cost-attribution dashboard says the agent is "more expensive than it used to be," and the only way to recover the lost margin is to catch the cycles at dispatch rather than after they've burned tokens.
A tool-using agent without cycle detection is paying for a stochastic loop it never priced. The fix is not exotic. It is the same fix every interpreter eventually adopts: model the call stack explicitly, give the runtime the authority to refuse a dispatch, declare the contracts on the things being dispatched, and visualize the structure as a graph so humans can see what the runtime sees. Until the function-calling layer treats itself as the recursive evaluator it actually is, every team building on it will rediscover the same bugs the systems-programming world named decades ago — at a hundred times the per-occurrence cost.
- https://en.wikipedia.org/wiki/Reentrant_mutex
- https://en.wikipedia.org/wiki/Reentrancy_(computing)
- https://docs.langchain.com/oss/python/langgraph/errors/GRAPH_RECURSION_LIMIT
- https://github.com/langchain-ai/langgraph/discussions/1725
- https://github.com/langchain-ai/langgraph/issues/6731
- https://github.com/langchain-ai/deepagents/issues/1698
- https://arxiv.org/html/2511.10650
- https://arxiv.org/html/2408.02248v1
- https://www.braintrust.dev/articles/agent-observability-tracing-tool-calls-memory
- https://www.langchain.com/articles/agent-observability
