Agent Fleet Concurrency: Coordinating Dozens of Agents Without Deadlock or the Thundering Herd
Eleven agents started at the same second. Three died before the first tool call returned. That 27% fatality rate was not a model problem, a prompt problem, or a tool problem. It was a scheduling problem — the same kind of problem an operating system solves when fifty processes wake up at once and fight over a single CPU. The difference is that the OS has forty years of accumulated wisdom and the agent runtime has about two.
Anyone who has wired up more than a handful of concurrent LLM workers has seen some version of this. You kick off a scheduled job at 02:00, thirty agents spin up, they all hit the same provider within 200 ms of each other, and most of them fail with a mix of 429s, 502s, and connection resets. The survivors get half the rate budget they were promised because the provider's fair-share logic has already started throttling your API key. By 02:05 the surviving agents finish and your dashboard shows a completion rate that would embarrass a first-year CS student writing their first producer-consumer. Your on-call rotation debates whether to add retries, add a queue, or just run fewer of them.
None of those are the right answer by themselves. The right answer is that a fleet of agents is a small distributed system and needs to be designed like one.
The Fleet Is Not the Agent
A single agent is a language model in a loop. It gets a prompt, it calls tools, it calls tools again, it stops. You can reason about its latency as a sum of token generation plus tool time. You can reason about its failure modes as "the model got confused" or "the tool timed out." You can debug it with a transcript.
A fleet of agents is not that. It is thirty agents contending for a shared rate budget, a shared pool of database connections, a shared filesystem, and a shared downstream service that does not care how clever your prompting is. The fleet has emergent behaviors that no single agent can produce on its own: convoy effects, priority inversion, livelock, herd synchronization, and fair-share starvation. The agents are not the interesting part. The contention is.
This reframing matters because it tells you where to put engineering effort. If you treat the fleet as "thirty copies of one agent," you will spend your time tuning prompts and wondering why reliability does not scale past ten concurrent workers. If you treat the fleet as a scheduler problem, you will spend your time on admission control, backpressure, and coordination primitives — and those are the things that actually move the needle.
The Thundering Herd, Rediscovered
A thundering herd is not a traffic problem; it is a synchronization problem. It is what happens when a large number of workers wake up at the same instant and all try to use a shared resource. The classic examples — a cache entry expiring, a failover promoting a new leader, a cron job firing — are exactly the patterns that agent fleets reproduce.
The 02:00 cron trigger is the obvious one. Less obvious is the implicit herd that appears at the end of a common step. Every agent in a fleet tends to finish step N and start step N+1 at roughly the same time, because their runtimes are dominated by the same model with roughly the same latency distribution. So even if you staggered their start, they resynchronize at each tool boundary. This is the same reason parallel HTTP fetches bunch up at TCP slow-start: they all cross the same bottleneck together.
The mitigations are not new. Add jitter to every timer. Stagger start times deliberately rather than accidentally. Use exponential backoff with randomness, not fixed backoff. Cap the number of agents that can be in the "calling the model" phase at any one time, separately from the number that can exist in the fleet. A fleet of 100 agents where only 20 can be mid-flight against the provider at once will outperform a fleet of 50 where all 50 fire at once, because the first design amortizes the rate budget across time while the second produces a sawtooth of 429s and retries.
Shared-Tool Rate Limit Contention
The nastiest contention is not over CPU or memory — those are cheap and local. It is over the rate budget on a shared downstream, which is a global resource that no single agent can observe. Provider rate limits are usually expressed as three dimensions at once: requests per minute, tokens per minute, and concurrent in-flight requests. Each dimension throttles independently. You can be well under your RPM and still get 429s because you exhausted the concurrent-request cap.
This is where the OS analogy pays off. An OS solves this problem with admission control: the scheduler does not let a process run until it has the resources to complete a quantum. For an agent fleet, the analogue is a centralized limiter that every agent consults before making a model call. The limiter holds the shared budget and returns either "go" or "wait this many milliseconds." Implemented as a token bucket with replenishment tuned to the provider's declared RPM, and a separate semaphore for concurrent in-flight calls, it eliminates the two most common failure modes — the herd on wake-up and the concurrent-request cliff.
A few things make this harder than it sounds. First, provider rate limits are approximate; they are enforced with some slack and some burst, and the limits you read in the documentation are a lower bound on what you get, not an upper bound. Second, the rate budget is not yours alone; other applications under the same API key share it, and your limiter needs a safety margin. Third, the limiter itself must not become the new single point of failure — a crashed limiter that blocks the whole fleet is worse than the herd it was preventing.
The workable answer is an adaptive limiter: start conservative, add capacity linearly when calls succeed, and cut capacity multiplicatively when they fail. This is AIMD — the same algorithm TCP uses for congestion control — and it converges to the provider's actual current capacity without needing to know it in advance. When the provider has a noisy neighbor or a regional incident, AIMD backs off in seconds. When conditions improve, it ramps back up over the next minute. You do not need to tune it per provider; you tune it per observable signal (error rate, latency p99, explicit 429s).
Work Stealing, Dedicated Queues, and the Scheduling Seam
- https://arxiv.org/abs/2604.17111
- https://encore.dev/blog/thundering-herd-problem
- https://en.wikipedia.org/wiki/Thundering_herd_problem
- https://en.wikipedia.org/wiki/Work_stealing
- https://en.wikipedia.org/wiki/Additive_increase/multiplicative_decrease
- https://cordum.io/blog/ai-agent-circuit-breaker-pattern
- https://brandonlincolnhendricks.com/research/circuit-breaker-patterns-ai-agent-reliability
- https://reputagent.com/research/why-ai-agents-keep-getting-stuck-when-they-decide-together
- https://galileo.ai/blog/multi-agent-ai-failures-prevention
- https://github.blog/ai-and-ml/github-copilot/run-multiple-agents-at-once-with-fleet-in-copilot-cli/
- https://code.claude.com/docs/en/sub-agents
- https://www.typedef.ai/resources/handle-token-limits-rate-limits-large-scale-llm-inference
- https://docs.langchain.com/langsmith/rate-limiting
- https://www.designgurus.io/answers/detail/how-do-you-implement-dynamic-concurrency-limits-aimd-queuedepth
