Skip to main content

The Closed-Loop Escalation Bug: When Your Specialist Agents Route in Circles

· 11 min read
Tian Pan
Software Engineer

A multi-agent system for market data research quietly burned through $47,000 in inference cost over four weeks before anyone noticed. The original weekly bill was $127. The cause wasn't a traffic spike or a model upgrade — it was two agents passing the same conversation back and forth for eleven days, each one confident the other was the right place for the request to live. Nothing errored. No alarm fired. The bot's "queue transferred" metric and the other bot's "task received" metric both went up in lockstep, and both dashboards looked healthy.

This is the closed-loop escalation bug. It is the multi-agent version of two helpful colleagues each insisting "no, you take it," except neither of them ever gets bored and walks away. The architecture diagram you drew at design time has each specialist owning a clean slice of the problem. The architecture the runtime actually executes has a routing cycle nobody in the room can see.

It's the failure mode you don't catch in pre-launch evals because per-agent quality looks great in isolation. Your support agent answers support questions correctly. Your billing agent answers billing questions correctly. The bug is in the seam between them — and the seam isn't owned by either team.

How the Cycle Forms

The setup is mundane. You have a support agent and a billing agent. The support agent's routing prompt says: if this is a billing question, hand off to billing. The billing agent's routing prompt says: if this is a service question, hand off to support. Each prompt was written by a different sub-team, reviewed in isolation, and shipped behind the same supervisor.

Then a real user shows up with a request that genuinely lives on the seam — "I was charged for support I never received and I want my access restored." The support agent reads "charged" and routes to billing. The billing agent reads "access restored" and routes back. The supervisor, which has no global view of the routing graph, faithfully delivers each message. The conversation bounces forty times until either the per-conversation token budget trips or a turn cap fires, and the user gets a generic fallback that says "let me connect you to a human" — after the system has already spent more compute on this single request than a human agent would cost in a week.

Berkeley researchers who systematically annotated 150 production traces across seven popular multi-agent frameworks placed this category under "inter-agent misalignment" in their MAST taxonomy: agents that operate on incorrect assumptions about who owns what, ignore peer signals, or fail to verify whether anyone is making progress. The three top-level failure categories — specification and system design failures, inter-agent misalignment, and task verification and termination — each contribute to the same end state: a loop that consumes resources without producing an answer.

The structural reason this is so easy to ship is that the routing logic is local. Each agent decides where to hand off based only on what it sees in front of it. Distributed systems engineers spent decades on this exact class of problem. Spanning-tree, BGP, IS-IS — every routing protocol in the layer-2 and layer-3 stack exists in part because local routing decisions can compose into global cycles, and the way you stop that is not by writing better local heuristics. Agent frameworks that hand off via tool calls and free-text routing prompts are essentially building a routing fabric without the loop-detection primitives the routing community made standard equipment in the 1980s.

Why Per-Agent Evals Don't See It

Most teams evaluate agents the way they evaluate any other model output: feed in a prompt, score the response, repeat. Each specialist gets graded on its own benchmark — does the billing agent answer billing questions, does the support agent answer service questions. Both pass. The eval slice that would actually catch this bug — adversarial intents that genuinely span both domains — is the one nobody writes because it sits in the gap between two teams' eval suites.

Even worse, the existing dashboards on each side often both show green during a cycling conversation. The support team's dashboard shows "agent didn't help, transferred to billing" — clean handoff, success per their definition. The billing team's dashboard shows "wrong queue, transferred to support" — also clean per their definition. Neither team owns the seam. Neither team's metrics surface that the same conversation ID appears as a "transfer out" event on both dashboards within the same minute, twenty times in a row.

A recent line of academic work on unsupervised cycle detection in agentic applications quantifies how invisible this is to standard observability. On 1,575 trajectories from a LangGraph stock-market application, structural-only call-stack analysis found cycles with an F1 of 0.08 and semantic-only redundancy detection with an F1 of 0.28. Only a hybrid that combined both reached F1 of 0.72 with recall of 0.86. In other words: if your platform is just looking at "did the agent return," you will catch almost none of the cycles in your traffic. You need to look at the shape of the call graph and the semantic redundancy of the messages together.

The Discipline That Has to Land

Treating multi-agent handoffs as a routing protocol — not as a domain abstraction — pays for itself the first time you avoid an incident. The primitives that have to exist are not exotic; they're the load-bearing parts that production routing protocols figured out a long time ago.

A global hand-off ledger. Every handoff event for a conversation gets logged into a per-conversation ledger that the supervisor (or a router-side middleware) reads on every routing decision. The rule is simple: a conversation cannot re-enter an agent it just left without an explicit, supervisor-acknowledged justification. The ledger lives outside any single agent's context window because no agent can be trusted to remember its own routing history reliably across long conversations.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates