The Closed-Loop Escalation Bug: When Your Specialist Agents Route in Circles
A multi-agent system for market data research quietly burned through $47,000 in inference cost over four weeks before anyone noticed. The original weekly bill was $127. The cause wasn't a traffic spike or a model upgrade — it was two agents passing the same conversation back and forth for eleven days, each one confident the other was the right place for the request to live. Nothing errored. No alarm fired. The bot's "queue transferred" metric and the other bot's "task received" metric both went up in lockstep, and both dashboards looked healthy.
This is the closed-loop escalation bug. It is the multi-agent version of two helpful colleagues each insisting "no, you take it," except neither of them ever gets bored and walks away. The architecture diagram you drew at design time has each specialist owning a clean slice of the problem. The architecture the runtime actually executes has a routing cycle nobody in the room can see.
It's the failure mode you don't catch in pre-launch evals because per-agent quality looks great in isolation. Your support agent answers support questions correctly. Your billing agent answers billing questions correctly. The bug is in the seam between them — and the seam isn't owned by either team.
How the Cycle Forms
The setup is mundane. You have a support agent and a billing agent. The support agent's routing prompt says: if this is a billing question, hand off to billing. The billing agent's routing prompt says: if this is a service question, hand off to support. Each prompt was written by a different sub-team, reviewed in isolation, and shipped behind the same supervisor.
Then a real user shows up with a request that genuinely lives on the seam — "I was charged for support I never received and I want my access restored." The support agent reads "charged" and routes to billing. The billing agent reads "access restored" and routes back. The supervisor, which has no global view of the routing graph, faithfully delivers each message. The conversation bounces forty times until either the per-conversation token budget trips or a turn cap fires, and the user gets a generic fallback that says "let me connect you to a human" — after the system has already spent more compute on this single request than a human agent would cost in a week.
Berkeley researchers who systematically annotated 150 production traces across seven popular multi-agent frameworks placed this category under "inter-agent misalignment" in their MAST taxonomy: agents that operate on incorrect assumptions about who owns what, ignore peer signals, or fail to verify whether anyone is making progress. The three top-level failure categories — specification and system design failures, inter-agent misalignment, and task verification and termination — each contribute to the same end state: a loop that consumes resources without producing an answer.
The structural reason this is so easy to ship is that the routing logic is local. Each agent decides where to hand off based only on what it sees in front of it. Distributed systems engineers spent decades on this exact class of problem. Spanning-tree, BGP, IS-IS — every routing protocol in the layer-2 and layer-3 stack exists in part because local routing decisions can compose into global cycles, and the way you stop that is not by writing better local heuristics. Agent frameworks that hand off via tool calls and free-text routing prompts are essentially building a routing fabric without the loop-detection primitives the routing community made standard equipment in the 1980s.
Why Per-Agent Evals Don't See It
Most teams evaluate agents the way they evaluate any other model output: feed in a prompt, score the response, repeat. Each specialist gets graded on its own benchmark — does the billing agent answer billing questions, does the support agent answer service questions. Both pass. The eval slice that would actually catch this bug — adversarial intents that genuinely span both domains — is the one nobody writes because it sits in the gap between two teams' eval suites.
Even worse, the existing dashboards on each side often both show green during a cycling conversation. The support team's dashboard shows "agent didn't help, transferred to billing" — clean handoff, success per their definition. The billing team's dashboard shows "wrong queue, transferred to support" — also clean per their definition. Neither team owns the seam. Neither team's metrics surface that the same conversation ID appears as a "transfer out" event on both dashboards within the same minute, twenty times in a row.
A recent line of academic work on unsupervised cycle detection in agentic applications quantifies how invisible this is to standard observability. On 1,575 trajectories from a LangGraph stock-market application, structural-only call-stack analysis found cycles with an F1 of 0.08 and semantic-only redundancy detection with an F1 of 0.28. Only a hybrid that combined both reached F1 of 0.72 with recall of 0.86. In other words: if your platform is just looking at "did the agent return," you will catch almost none of the cycles in your traffic. You need to look at the shape of the call graph and the semantic redundancy of the messages together.
The Discipline That Has to Land
Treating multi-agent handoffs as a routing protocol — not as a domain abstraction — pays for itself the first time you avoid an incident. The primitives that have to exist are not exotic; they're the load-bearing parts that production routing protocols figured out a long time ago.
A global hand-off ledger. Every handoff event for a conversation gets logged into a per-conversation ledger that the supervisor (or a router-side middleware) reads on every routing decision. The rule is simple: a conversation cannot re-enter an agent it just left without an explicit, supervisor-acknowledged justification. The ledger lives outside any single agent's context window because no agent can be trusted to remember its own routing history reliably across long conversations.
An explicit "I disagree with your routing" signal. When agent B receives a handoff from agent A and concludes the conversation belongs to A, it should be able to say so — emit a structured "reject-and-explain" event — rather than silently re-routing back. The supervisor sees the disagreement and decides what to do, the same way an L3 routing protocol carries route-rejection signaling rather than letting endpoints silently NAT packets back at each other. LangChain's published handoff guidance frames this as deliberate target declaration: each agent declares the agents it can hand off to, and the framework enforces those paths rather than letting handoffs be inferred from free-text generation.
A hand-off budget separate from the token budget. Most teams cap conversations on token spend. That's necessary but not sufficient: by the time the token cap trips on a cycling conversation, you've already burned the budget for fifty good conversations. A small integer cap — three or five hand-offs per conversation, depending on your topology — fires earlier and gives you a clean signal at the source. OpenAI's Agents SDK exposes this kind of bounded-loop primitive as part of its handoff abstraction; the team writing the agent should pick the bound deliberately rather than inheriting the framework's default.
An adversarial-seam eval slice. The eval that finds this bug is not "does each agent answer questions in its domain." It's "construct prompts whose intent genuinely spans both domains, run them through the live multi-agent system, and assert that the conversation reaches a terminal state in a bounded number of hand-offs." The acceptance criterion is termination, not correctness — though both should be measured. This slice has to be owned by someone whose job is the seam, because the per-agent teams will keep optimizing in their own silos.
The Observability That Surfaces It
Even with the protocol-level primitives in place, you need to see what's happening in production. Three signals are cheap to compute and easy to alert on:
Hand-off depth per conversation. Plot the histogram of how many hand-offs each conversation accumulates. Healthy distributions are heavy on the left — most conversations terminate after zero or one hand-off — with a thin tail. A bimodal distribution, or a sudden lift in the tail, is a cycling-conversation signature. Set a page on "p99 hand-off depth crosses 5."
An inter-agent transition heatmap. Build a square matrix where rows are source agents, columns are destination agents, and cells are the count of hand-offs from row to column over the last hour. The diagonal should be light (agents rarely hand off to themselves) and the off-diagonal should be sparse and asymmetric. If you see a symmetric pair of bright cells — A→B and B→A roughly equal and large — you're looking at a closed loop. This is the kind of dashboard the first incident usually justifies, and once it exists, every subsequent topology change has a way to be visually sanity-checked before launch.
A per-pair re-entry counter. For every ordered pair (X, Y) of agents, count how often a conversation that handed off X→Y subsequently hands off Y→X within N turns. This should round to zero for healthy traffic. Anything above zero is a candidate cycle and worth alerting on per pair, because the failure is per-pair: it's the contract between two specific agents that broke, not the system as a whole.
The Maxim AI guidance on multi-agent reliability puts this in protocol terms: model your coordination as a directed acyclic graph with bounded depth, and instrument every handoff so cycles and unbounded chains can be detected before they compound into cost or latency incidents. The DAG framing matters because it forces an explicit answer to the question "is this a hand-off or a re-entry?" — a question free-text routing prompts cannot answer.
Owning the Seam
The deeper organizational fix is that somebody has to own the seam. The support team owns its agent. The billing team owns its agent. The handoff between them is owned by neither, and that's exactly why the cycle ships.
The teams that handle this well usually create a small platform group — sometimes a single engineer — whose remit is the routing fabric: the supervisor logic, the hand-off ledger, the inter-agent eval slice, the topology dashboards. This group doesn't write specialist agents. It writes the contracts between them. It ships the framework that makes adding a new specialist agent safe — declaring its hand-off targets, registering it in the topology graph, getting it onto the heatmap before it goes live to real traffic.
This is the same pattern that microservices teams discovered fifteen years ago: when you decompose a monolith into services owned by different teams, you don't actually decompose the operational responsibility — you just hide the seams. The platform group exists to make those seams visible and manageable. Multi-agent systems are repeating the discovery.
What This Costs to Skip
The architectural realization is not that multi-agent systems are bad. They are often the right answer; specialist agents with focused prompts and tools genuinely outperform a single overloaded generalist for complex domains. The realization is that the moment you compose two agents into a system, you have a distributed system, and distributed systems have pathologies that no amount of per-component quality will protect you against.
Spanning-tree wasn't invented because Ethernet engineers couldn't write good local switching logic. It was invented because local switching logic, composed at scale, will form loops, and once it does, the network melts. The multi-agent version of that meltdown is quieter — it shows up as cost drift, latency tails, and frustrated-user complaints that don't map cleanly to any single agent's quality metric — but it's the same shape of bug. The team that designed hand-offs as a domain abstraction without the loop-detection primitives is rebuilding routing from first principles, and is going to learn the same lessons in the same order, except the bill will arrive in inference dollars instead of dropped packets.
The fix is not to write smarter agents. The fix is to treat the routing fabric as a first-class system, instrument it like one, and put someone in charge of the seam.
- https://arxiv.org/abs/2503.13657
- https://arxiv.org/abs/2511.10650
- https://galileo.ai/blog/multi-agent-llm-systems-fail
- https://www.getmaxim.ai/articles/multi-agent-system-reliability-failure-patterns-root-causes-and-production-validation-strategies/
- https://docs.langchain.com/oss/python/langchain/multi-agent/handoffs
- https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/ai-agent-design-patterns
- https://community.openai.com/t/agents-sdk-looping-handoffs/1256231
- https://towardsdatascience.com/how-agent-handoffs-work-in-multi-agent-systems/
