Skip to main content

The Tool That Worked Until Two Agents Called It At Once

· 9 min read
Tian Pan
Software Engineer

A tool passes its tests. You called it from one agent, watched it read a record, transform it, write it back, and return a clean result. It did exactly that, every time, for weeks. Then you scaled the agent fleet from one worker to twelve, and a customer reported that their subscription got upgraded twice in the same minute. The tool did not change. The number of things calling it did.

This is the failure mode that single-agent testing cannot catch, because single-agent testing never produces the condition that triggers it. One caller is, by construction, a serial workload. Every concurrency assumption your tool quietly relies on — that nobody else is mid-write when it reads, that a counter it increments is its own, that the draft it is editing will still be there when it saves — holds trivially when there is exactly one caller. The tool is not correct. It is untested. Those are different things, and the difference stays invisible until a second agent shows up.

The reason this catches teams off guard is that the agent layer hides the concurrency from the people writing tools. A backend engineer who builds a update_account endpoint knows it will be hammered by many clients and designs for it. But a tool exposed to an agent often starts life as a thin wrapper someone wrote in an afternoon to let the model "do the thing." It feels like a function call. It looks like a function call in the trace. And functions, in the mental model most of us carry, do not have other functions running inside them at the same time. The agent framework makes a distributed-systems problem look like a local one.

A Single Caller Certifies Nothing

Consider what your eval suite actually exercised. It sent a prompt, the agent picked a tool, the tool ran to completion, and the result came back. Even if you ran a hundred eval cases, you ran them as a hundred serial episodes. The tool never observed a world where its own preconditions were being mutated by a peer.

Concurrency bugs are not rare events that you got lucky enough to miss. They are a different category of behavior that your test setup is structurally incapable of producing. You can increase eval coverage from 100 cases to 10,000 and still have zero coverage of the two-callers-at-once case, because coverage of failure modes is not the same as coverage of inputs. A serial test harness has exactly one sample of the concurrency dimension, and that sample is "concurrency = 1."

The practical consequence: "the tool passes" is a statement about a workload you will not run in production. The moment an orchestrator fans out subtasks to parallel workers, or you run two agents for two different users that happen to touch the same shared resource, or a single agent's retry overlaps with its own original call, you are in untested territory. Race conditions in multi-agent systems are a predictable consequence of parallel execution, not edge cases — and the system will not raise an error when it corrupts state. It will just return a plausible-looking result built on a read that was already stale.

Where Agents Collide

It helps to be concrete about what "shared state" means here, because the contention points are not always the obvious database.

Mutable records. Two agents both run read account → modify → write account on the same customer. They both read version 5. Agent A writes version 6 with its change. Agent B writes version 6 with its change, overwriting A. This is the classic lost update, and nothing logs an error — both writes "succeeded."

Counters and quotas. A tool that decrements a credit balance, increments a usage counter, or claims a slot from a fixed pool is a race waiting to happen. Two agents read "3 slots left," both decide they can proceed, both decrement, and now you have allocated four things into three slots.

Shared API credentials. Many agents call the same downstream API with the same key, against the same provider rate limit. A tool tested with one caller never approaches the quota. Run twelve in parallel and they collectively blow through it — each agent's individual behavior is reasonable, but the aggregate is a self-inflicted denial of service. This is the noisy-neighbor problem moved inside your own system: one fleet of agents starves itself because the rate limit is a shared resource nobody is accounting for per-caller.

Draft and scratch state. An agent that builds up a document, a cart, or a plan in a shared scratch location assumes the scratch is its own. Two agents pointed at the same key clobber each other's intermediate work, and the final artifact is an incoherent merge neither of them intended.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates