Skip to main content

Agent Fleet Concurrency: Coordinating Dozens of Agents Without Deadlock or the Thundering Herd

· 12 min read
Tian Pan
Software Engineer

Eleven agents started at the same second. Three died before the first tool call returned. That 27% fatality rate was not a model problem, a prompt problem, or a tool problem. It was a scheduling problem — the same kind of problem an operating system solves when fifty processes wake up at once and fight over a single CPU. The difference is that the OS has forty years of accumulated wisdom and the agent runtime has about two.

Anyone who has wired up more than a handful of concurrent LLM workers has seen some version of this. You kick off a scheduled job at 02:00, thirty agents spin up, they all hit the same provider within 200 ms of each other, and most of them fail with a mix of 429s, 502s, and connection resets. The survivors get half the rate budget they were promised because the provider's fair-share logic has already started throttling your API key. By 02:05 the surviving agents finish and your dashboard shows a completion rate that would embarrass a first-year CS student writing their first producer-consumer. Your on-call rotation debates whether to add retries, add a queue, or just run fewer of them.

None of those are the right answer by themselves. The right answer is that a fleet of agents is a small distributed system and needs to be designed like one.

The Fleet Is Not the Agent

A single agent is a language model in a loop. It gets a prompt, it calls tools, it calls tools again, it stops. You can reason about its latency as a sum of token generation plus tool time. You can reason about its failure modes as "the model got confused" or "the tool timed out." You can debug it with a transcript.

A fleet of agents is not that. It is thirty agents contending for a shared rate budget, a shared pool of database connections, a shared filesystem, and a shared downstream service that does not care how clever your prompting is. The fleet has emergent behaviors that no single agent can produce on its own: convoy effects, priority inversion, livelock, herd synchronization, and fair-share starvation. The agents are not the interesting part. The contention is.

This reframing matters because it tells you where to put engineering effort. If you treat the fleet as "thirty copies of one agent," you will spend your time tuning prompts and wondering why reliability does not scale past ten concurrent workers. If you treat the fleet as a scheduler problem, you will spend your time on admission control, backpressure, and coordination primitives — and those are the things that actually move the needle.

The Thundering Herd, Rediscovered

A thundering herd is not a traffic problem; it is a synchronization problem. It is what happens when a large number of workers wake up at the same instant and all try to use a shared resource. The classic examples — a cache entry expiring, a failover promoting a new leader, a cron job firing — are exactly the patterns that agent fleets reproduce.

The 02:00 cron trigger is the obvious one. Less obvious is the implicit herd that appears at the end of a common step. Every agent in a fleet tends to finish step N and start step N+1 at roughly the same time, because their runtimes are dominated by the same model with roughly the same latency distribution. So even if you staggered their start, they resynchronize at each tool boundary. This is the same reason parallel HTTP fetches bunch up at TCP slow-start: they all cross the same bottleneck together.

The mitigations are not new. Add jitter to every timer. Stagger start times deliberately rather than accidentally. Use exponential backoff with randomness, not fixed backoff. Cap the number of agents that can be in the "calling the model" phase at any one time, separately from the number that can exist in the fleet. A fleet of 100 agents where only 20 can be mid-flight against the provider at once will outperform a fleet of 50 where all 50 fire at once, because the first design amortizes the rate budget across time while the second produces a sawtooth of 429s and retries.

Shared-Tool Rate Limit Contention

The nastiest contention is not over CPU or memory — those are cheap and local. It is over the rate budget on a shared downstream, which is a global resource that no single agent can observe. Provider rate limits are usually expressed as three dimensions at once: requests per minute, tokens per minute, and concurrent in-flight requests. Each dimension throttles independently. You can be well under your RPM and still get 429s because you exhausted the concurrent-request cap.

This is where the OS analogy pays off. An OS solves this problem with admission control: the scheduler does not let a process run until it has the resources to complete a quantum. For an agent fleet, the analogue is a centralized limiter that every agent consults before making a model call. The limiter holds the shared budget and returns either "go" or "wait this many milliseconds." Implemented as a token bucket with replenishment tuned to the provider's declared RPM, and a separate semaphore for concurrent in-flight calls, it eliminates the two most common failure modes — the herd on wake-up and the concurrent-request cliff.

A few things make this harder than it sounds. First, provider rate limits are approximate; they are enforced with some slack and some burst, and the limits you read in the documentation are a lower bound on what you get, not an upper bound. Second, the rate budget is not yours alone; other applications under the same API key share it, and your limiter needs a safety margin. Third, the limiter itself must not become the new single point of failure — a crashed limiter that blocks the whole fleet is worse than the herd it was preventing.

The workable answer is an adaptive limiter: start conservative, add capacity linearly when calls succeed, and cut capacity multiplicatively when they fail. This is AIMD — the same algorithm TCP uses for congestion control — and it converges to the provider's actual current capacity without needing to know it in advance. When the provider has a noisy neighbor or a regional incident, AIMD backs off in seconds. When conditions improve, it ramps back up over the next minute. You do not need to tune it per provider; you tune it per observable signal (error rate, latency p99, explicit 429s).

Work Stealing, Dedicated Queues, and the Scheduling Seam

Once admission control is in place, the next question is how work gets assigned to agents. Two patterns dominate, and they have different tradeoffs.

Dedicated queues give each agent its own inbox. A dispatcher pre-assigns work items to agents at enqueue time. This is simple to reason about, easy to debug (every item has a known owner), and plays well with sticky state — if agent 3 already loaded a large context for customer X, send more of customer X's work to agent 3. The cost is that load imbalance is invisible to the system: if agent 7 gets five hard tasks in a row and agent 2 gets five trivial ones, agent 2 sits idle while agent 7 thrashes.

Work stealing gives the fleet a shared pool of ready tasks. Each agent pulls from its own local queue first; when local is empty, it steals from a random peer's queue. This is what modern language runtimes use because it self-balances load without any central coordinator. The cost is that it breaks locality: any agent can end up doing any task, so state that was warm in one agent is cold in another.

For agent fleets specifically, dedicated queues work well when tasks are heterogeneous but predictable — "process customer X's data" belongs to the agent that has customer X's context. Work stealing works well when tasks are homogeneous and short — a large batch of similar summarization jobs, for example, where no agent has a locality advantage. A common mistake is to pick one pattern on day one and never revisit it; the right pattern depends on what the tasks look like, and the task distribution is usually not what you predicted.

There is a third option that practitioners often miss: dedicated queues with a steal threshold. Agents work their own queue by default, but if an agent's queue depth exceeds some threshold and a peer's is empty, the peer can steal. This captures most of the balance benefits without the locality loss, and it is closer to how production workflow systems like Temporal and Zeebe actually behave.

Coordination Through External State

The biggest source of deadlock in multi-agent systems is also the most obvious in hindsight: agents trying to coordinate by talking to each other. A recent benchmark adapts the Dining Philosophers problem to LLM agents and finds that three agents making simultaneous decisions about shared resources deadlock up to 95–100% of the time. The root cause is convergent reasoning — independently smart agents pick the same "reasonable" action, and that same action is exactly the one that conflicts.

The fix is to stop trying to coordinate through the agents. Coordinate through external state. A shared database row with a version number, a distributed lock with a lease, a leader election backed by a real consensus system — these work because they move the coordination out of the nondeterministic part of the system and into the deterministic part. The agents then make local decisions that respect an authoritative external state, and the coordination problem becomes a well-understood distributed-systems problem rather than an open research question in multi-agent AI.

Concretely, this means a few patterns are worth internalizing. If two agents might touch the same record, put an advisory lock on it and have the second agent wait. If you need a single agent to own a long-running task, elect a leader and give it a lease that expires if the leader dies. If you need to prevent a herd on wake-up, have agents register their intent in a shared table and only let N of them proceed at a time. These are boring, old, and reliable. They beat anything that requires agents to negotiate with each other.

Operational Primitives You Should Already Have

Three pieces of infrastructure belong in any fleet that runs more than a handful of agents.

A circuit breaker per dependency. When a downstream provider starts failing, the entire fleet needs to react identically and immediately. A per-agent retry loop is exactly the wrong response: it multiplies load against a dependency that is already unhealthy. A shared circuit breaker cuts the load to zero, gives the dependency room to recover, and lets one probe request through periodically to detect recovery. For LLM-specific circuit breakers, you also need to handle the weird case where the request "succeeds" but the output is garbage — a signal that the provider has degraded quality without degrading availability. That is hard to detect and worth building detection for separately.

Bounded concurrency at the layer that owns the shared resource. A semaphore around the provider call is non-negotiable. A semaphore around database writes is usually non-negotiable. A semaphore around expensive tools (search APIs, code execution sandboxes) is often overlooked and always necessary. The right place for these is not inside the agent loop — agents are optimized for reasoning, not for resource management — but in the layer that wraps the tool.

Idempotent task IDs end-to-end. Every task that enters the fleet should carry an ID that is preserved through every retry, every compensation, and every checkpoint. This is the single cheapest piece of infrastructure you can add and it pays for itself the first time you need to debug a fleet-wide incident. Without it, you cannot tell whether a task ran once, twice, or zero times, and you cannot tell whether the duplicate result in your database is a bug or an artifact of a retry.

The Pattern Underneath

The uncomfortable truth is that running a fleet of agents is not an AI problem. It is an operating systems problem dressed in AI clothing. The primitives that matter — admission control, bounded concurrency, backpressure, circuit breakers, leader election, idempotent identifiers — were all worked out decades before transformers existed. The surprising part is that the AI community is rediscovering them one at a time, usually after a production incident.

The upside is that once you see the pattern, the mitigations are concrete and boring. They do not require novel research. They require the same discipline that any competent distributed system requires, applied to the part of the stack where the language model lives. Build the scheduler layer first, and the agent layer on top of it becomes much easier to reason about — because the agent is doing what it does best (reasoning about a task) while the scheduler does what it does best (making sure the fleet does not eat itself).

The agents will get smarter. The scheduling problems will not go away. Invest in the layer that will still be there when the model names change.

References:Let's stay in touch and Follow me for more thoughts and updates