The Multi-Agent Deadlock That Hangs on Two Calendars
Agent A asks Agent B for a piece of data it needs to finish its task. Agent B, before answering, asks Agent A for a piece of context it needs to produce that data. Both requests cross a "human review required" boundary on the way out. The first request lands in a Slack approval channel watched by Priya. The second lands in a Jira queue watched by Marcus. Priya is at lunch. Marcus is in a customer call. Neither knows the other exists. The workflow hangs for nineteen hours, and nobody notices until a customer escalation forces somebody to ask why the rollup never landed.
This is not a novel failure. It is the oldest failure in distributed systems, wearing a new costume. The Coffman conditions — mutual exclusion, hold and wait, no preemption, circular wait — were named in 1971, and a multi-agent system with human-in-the-loop approval queues satisfies all four by default. The new wrinkle is that one of the "resources" in the deadlock is a person's attention, which means your liveness guarantee is now bound by how quickly two humans who don't know they're paired can independently context-switch.
If you are building agent workflows where any tool call can route to human approval, you are running a scheduler whose latency is non-deterministic and whose deadlock detector is "someone asks a question at standup." That is not a strategy. It is a bet that you will get lucky every day.
The Four Conditions, Reincarnated
Walk through the Coffman conditions with an agent system in mind. Each one shows up in your stack whether you designed for it or not.
Mutual exclusion. An agent session holds open state — a thread ID, an in-flight tool lease, a partially-written memory entry, a checkpointer row in your persistence layer. Another agent cannot resume that session in parallel; the framework's persistence model assumes one writer per thread. The session is a resource. It is held exclusively.
Hold and wait. The agent is paused on interrupt() waiting for a human reply, but it is still holding the session, the checkpoint, the open file handle to whatever it was producing, the conversation context, and whatever upstream resources it grabbed before the pause. LangGraph's persistence layer is explicit about this: when a thread is interrupted, the checkpointer holds the frozen state indefinitely, with no built-in TTL.
No preemption. The approval semantics don't allow the system to take the request back. You cannot tell Priya "actually, we changed our mind, abandon the review." Even if you could, the agent would have to roll back partial side effects it has already committed: a row written to a database, a calendar event tentatively scheduled, a draft email queued for send. Most agent frameworks have no notion of a transactional sub-graph that can be cleanly aborted.
Circular wait. This is the one that bites. Agent A's request is in Priya's queue; Priya's review depends on a fact only Agent B could produce; Agent B's request is in Marcus's queue; Marcus's review depends on a fact only Agent A could produce. The cycle is across two humans, two agents, and two queues. None of the four parties has a complete view of the graph. The cycle is invisible.
The classical literature solves this by either preventing one of the four conditions or detecting the cycle and breaking it. Most agent frameworks ship with none of these primitives in the box. The graph forms; nothing watches for cycles; the workflow hangs; eventually a human notices the silence.
Why the Human Is Not "Just Another Tool"
The tempting framing is that a human reviewer is a tool with high latency — call it like any other tool, await the response, move on. This framing breaks down quickly.
A tool call has a service-level contract. You know the rough p50 and p99 latency. You can set a timeout. You can retry. The tool either returns or fails within a bounded window.
A human approver has none of this. A human's response latency is a function of their calendar, their timezone, their phone battery, whether they noticed the notification, whether they understood the request, and whether they decided your request was more important than the seventeen other things on their plate. There is no SLA. There is no retry semantics. There is no liveness guarantee.
What you have is a scheduler whose dispatching policy is opaque to you, whose queue depth you cannot read, and whose worker pool is humans who are not paid to be your runtime. Treating that as "a slow tool" is how you ship a system whose tail latency is measured in days.
The right framing borrows from distributed systems: a human approver is a non-deterministic external service with no published SLO, no health endpoint, and no guarantee of forward progress. Every primitive you would build around such a service — circuit breakers, timeouts with escalation, dead-letter queues, observability into queue depth — applies directly. Most teams build none of them, because the agent framework's tutorial showed interrupt() as a one-line addition and made it look free.
- https://docs.langchain.com/oss/python/langgraph/interrupts
- https://www.abstractalgorithms.dev/langgraph-human-in-the-loop
- https://en.wikipedia.org/wiki/Deadlock_(computer_science)
- https://www.geeksforgeeks.org/computer-networks/conditions-for-deadlock-in-distributed-system/
- https://www.geeksforgeeks.org/computer-networks/wait-for-graph-deadlock-detection-in-distributed-system/
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://altaflow.com/blog/manual-approval-bottlenecks-causes-and-fixes
- https://aws.amazon.com/blogs/apn/how-temporal-uses-amazon-bedrock-agentcore-to-create-robust-ai-systems/
- https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/ai-agent-design-patterns
