Your Eval Harness Runs Single-User. Your Agents Don't.
Your agent passes 92% of your eval suite. You ship it. Within an hour of real traffic, something that never appeared in any trace is happening: agents are stalling on rate-limit retry storms, a customer sees another customer's draft email in a tool response, and your provider connection pool is sitting at 100% utilization while CPU is idle. None of these failures live in the model. They live in the gap between how you tested and how production runs.
The gap has a single shape. Your eval harness loops one agent at a time through a fixed dataset. Your production loops many agents at once through shared infrastructure. Sequential evaluation hides every bug whose precondition is "two things touching the same resource." Until you build adversarial concurrency into the harness itself, those bugs will only surface as on-call pages.
The Bugs a Sequential Harness Cannot Surface
There are four categories that the standard for case in dataset: run_agent(case) loop is structurally incapable of catching. Each one has a clear staging-friendly façade and an ugly production face.
Per-tenant tool rate-limit exhaustion. A single agent run uses well under any reasonable per-minute quota. Nine agents from the same tenant launched in the same window do not. One report from a multi-agent system showed nine concurrent agents exhausting GitHub's 5,000-per-hour quota in 22 minutes, with synchronized retries producing 60-plus chained 429s in a single incident. The eval pass rate was 100% — every agent could complete a PR in isolation. The integration was the bug.
Scratchpad and memory key collisions. Frameworks lean heavily on string-keyed shared stores: agent:{run_id}, memory:{user_id}:notes, cache:{hash(query)}. Single-user tests with deterministic IDs cannot collide. Production with concurrent runs, missing tenant prefixes, or hash collisions on similar inputs produces silent overwrites where Agent B reads what Agent A wrote three milliseconds ago. This is the same shape as a classic race on a shared counter, except the corrupted state is a JSON blob the next agent treats as authoritative context.
Semantic-cache cross-tenant bleed. Semantic caching trades exact-match for embedding similarity. A tenant-aware key prevents leakage; a tenant-blind key turns the cache into a side channel. Recent work demonstrated that adversarial queries against shared semantic caches achieve 90.6% hit rates on entries belonging to other users — meaning a cleverly phrased question gets back an answer computed for someone else. Sequential single-tenant evals never inject a second tenant, so the test never sees what an attacker (or a buggy hash function) would see.
Provider connection-pool starvation. Streaming completions hold a connection for the full duration of generation. A 30-second response is 30 seconds of pool occupancy spent mostly idle waiting for tokens. In a thread-per-job model, a handful of long-running agents can pin every connection in your pool while CPU sits at 5%. The single-user benchmark says the system is fast and the pool is healthy. The production graph says the pool is full and new requests are queued behind streams that will not finish for half a minute.
Notice the pattern. Each category fails in a way that looks like infrastructure, not intelligence. Each category requires at least two simultaneous agents to reproduce. Each one is invisible to your golden dataset.
Why "It Worked in Staging" Is the Wrong Sentence
When an engineer says staging passed, they almost always mean one of two things: the offline eval suite scored above threshold, or a manual smoke test in a staging environment returned the right answer. Both are single-actor experiments. Staging environments typically have one engineer poking at them at a time, often with a fresh database and warm caches. Even staging deployed behind a load test usually replays uniformly distributed synthetic traffic — the well-behaved cousin of the actual workload.
Real workloads are bursty, multi-tenant, and adversarial in distribution even when no attacker is involved. A single customer's batch job produces correlated calls. A cron schedule causes synchronized traffic from every account. A retry policy turns one slow upstream into amplified load. None of these patterns exist in your harness because you never ran two cases at the same time.
Anthropic's own engineering writing on agent evals notes that shared state between trials can both mask failures and inflate performance — they observed Claude gaining an unfair advantage on some tasks by examining git history left over from previous trials. That is the friendly version. The hostile version is two production agents racing on the same git working copy and corrupting it.
The reframe is uncomfortable but accurate: a sequential harness is not measuring your agent. It is measuring your agent under the most charitable possible scheduling assumption. Production removes the charity.
Designing a Multi-User Eval Harness
The fix is not exotic. It is the discipline of building, at evaluation time, the same conditions your production exposes. Four design moves separate a single-user harness from a multi-user one.
Run cases in parallel, not just batched. Batching means submitting many cases for sequential processing. Parallelism means launching N agents that share resources at the same wall-clock time. The implementation is a thread pool or async fan-out around your existing case runner — but with one critical addition: the resources (tools, caches, memory stores, model client pools) must be the same instances across runs, not per-case fixtures. The whole point is contention. Per-case isolation defeats the test.
Inject burst patterns explicitly. Uniform parallelism (50 agents at constant rate) catches some bugs but not all. The pathological case is co-launch: 50 agents starting in the same 100-millisecond window. Synchronized retries, cache stampedes, and connection-pool exhaustion need this initial impulse to manifest. Make co-launch a first-class scenario in the harness. Also include staggered ramps and slow drains — the failure modes differ.
Mix tenants and adversarial probes. A multi-user harness that runs 50 copies of the same tenant misses the cross-tenant bugs. Allocate explicit tenant IDs across runs, then add probe cases whose job is to ask, "could I see another tenant's data?" The probe is a query phrased to maximize semantic similarity to a sensitive query from a different tenant, run after the first one populates the cache. If the probe ever returns the other tenant's response, you have a bug. This kind of red-team-during-eval is what cross-session leak detection actually requires.
Score on contention metrics, not just task success. The single-user harness reports task accuracy. The multi-user harness needs to report the same accuracy under load, plus contention signals: tool 429 rate, cache hit-but-wrong-tenant rate, scratchpad write conflicts, p99 connection-pool wait time. A regression where accuracy stays at 92% but tool 429 rate triples should fail the gate. The accuracy number, on its own, hides the regression.
Adversarial Concurrency Patterns to Steal
You do not need to invent the adversarial scenarios from scratch. The same property-based and chaos-engineering toolkits that distributed systems engineers have used for two decades apply directly. A short list of patterns worth porting into your harness:
- Burst co-launch. N identical agents starting within a 100ms window, sharing one tool quota. Asserts: no agent fails; tail latency stays bounded; 429 retry storms do not amplify.
- Interleaved tool schedules. Two agents whose plans require the same tool in inverted order. Asserts: no deadlock when the tool serializes; both agents complete.
- Tenant probe pairs. Tenant A populates the cache with a sensitive output. Tenant B issues a similar query. Asserts: B never sees A's cached output verbatim. Run thousands of these with mutated phrasings.
- Long-stream + burst-short. One agent holds a 30-second streaming completion; 20 short-completion agents try to start. Asserts: short agents are not starved by the long one's pool occupancy.
- Crash mid-write. Kill an agent process holding a scratchpad lock. Asserts: subsequent agents see consistent state, not a half-written blob, and the lock TTL releases.
- Out-of-order messages. When using a queue, deliver tool results to an agent in non-issue order. Asserts: the agent rejects the stale result instead of treating it as the latest.
Each of these is one to two days of work to add. Each closes off a category of production incident that no amount of single-user accuracy improvement would have prevented.
What Changes in Your CI Pipeline
A multi-user harness is more expensive to run than a sequential one, and the additional cost shows up in two places: real provider quota consumed during the burst tests, and longer wall-clock time for the suite. Both are worth paying for, but they need to be budgeted explicitly.
Run the sequential accuracy suite on every PR. Run the contention suite on a merge gate or nightly schedule, not on every push, so the cost stays bounded. Treat the contention metrics as separate SLIs from accuracy — you want a dashboard that shows accuracy, 429 rate under burst, cross-tenant leak rate, and pool wait time as four separate trend lines. A regression in any one is a release blocker.
The contention suite also wants its own staging environment, ideally one that mirrors production's connection limits, cache topology, and tool quotas. A test pool of 10 connections will never reproduce a starvation bug that needs 50. If your real connection cap is 100, your harness needs to be capable of saturating something close to that — which usually means the harness lives outside your dev laptop.
The Hidden Win
There is a second-order benefit to running adversarial concurrency in eval. It forces conversations across team boundaries that would otherwise stay deferred. Who owns the tenant key on the cache? What is the retry policy when the tool returns 429? What is the SLA for scratchpad consistency under crash? Sequential evals let everyone postpone these answers. Concurrent evals demand them, because the failure modes are concrete and reproducible at test time instead of vague and rare in production.
Treat your eval harness as a model of your production environment. If the model is missing concurrency, the eval is measuring something else — a clean, charitable, single-user version of a system that does not exist outside your laptop. Build the harness that matches the workload. The bugs you find will be the bugs you would have shipped.
- https://machinelearningmastery.com/handling-race-conditions-in-multi-agent-orchestration/
- https://www.tamirdresher.com/blog/2026/03/21/rate-limiting-multi-agent
- https://www.giskard.ai/knowledge/cross-session-leak-when-your-ai-assistant-becomes-a-data-breach
- https://arxiv.org/html/2509.17360v1
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
- https://blog.langchain.com/agent-evaluation-readiness-checklist/
- https://atlan.com/know/agent-harness-failures-anti-patterns/
- https://agentgateway.dev/blog/2025-11-02-rate-limit-quota-llm/
