Your Team's Benchmarks Are Lying to Each Other: Shared Eval Infrastructure Contamination
Your red team just finished a jailbreak sweep. They found three novel attack vectors, wrote them up, and dropped the prompts into your shared prompt library for others to learn from. The next week, the safety team runs their baseline evaluation and reports a 12% improvement in robustness. Everyone celebrates. Nobody asks why.
What actually happened: the safety team's baseline eval silently incorporated the red team's attack prompts. The model didn't get more robust — the eval got contaminated. Your benchmarks are now measuring inoculation against known attacks, not generalization to new ones.
This is shared eval infrastructure contamination, and it is far more common than most teams realize. The symptom is artificially inflating metrics. The cause is treating evaluation infrastructure like production infrastructure — optimized for sharing and efficiency, instead of isolation and fidelity.
Three Ways Shared Eval Infrastructure Poisons Your Results
Channel 1: Cached Completion Leakage
Modern LLM serving systems cache the key-value (KV) states of common prefixes to reduce inference latency. This is correct behavior in production. In shared eval infrastructure, it becomes a liability.
When Team A runs an evaluation, their prompts get cached. When Team B's evaluation follows on the same serving instance, it may silently hit cached states for prefix-overlapping prompts. Depending on your caching strategy, Team B's results are now partially determined by Team A's eval run — not by the model's fresh inference.
This gets worse with timing. Research on KV-cache sharing across tenants shows that cache hit-or-miss patterns alone can be exploited to reconstruct other teams' prompts with high accuracy. Even without adversarial intent, overlapping prompt prefixes across teams (common system prompt templates, shared instruction formats, reused few-shot examples) will cause state to bleed silently through the cache.
The practical consequence: you cannot trust absolute scores generated on shared serving infrastructure without knowing what cache state existed when the evaluation ran. A metric that looks like a 3% improvement might be 1.5% improvement and 1.5% cache artifact.
Channel 2: Sequential Run Pollution
The second contamination channel is subtler. It doesn't require caching — it requires time.
When eval runs execute sequentially on shared infrastructure, artifacts from one run persist into the next: leftover files, accumulated memory state, warm-started model processes with residual gradient state, file system entries that affect subsequent reads. These artifacts are often invisible in the eval results but they affect reproducibility.
There's also the determinism illusion. Teams often assume that setting a fixed random seed guarantees reproducible outputs across runs. It doesn't. At the sampling layer, operations like torch.multinomial are inherently nondeterministic under different batch sizes — the GPU scheduling order changes the effective randomness. Research shows that simply changing the number of GPUs or the batch size during evaluation can shift accuracy by as much as 9%, even with the seed held constant.
This means that two teams running the "same" evaluation with "the same" seed on different days, with different queue depths and batch configurations, will get systematically different numbers. Neither result is wrong per se — but they are not comparable, and treating them as comparable is where the contamination enters.
Channel 3: Prompt-State Bleedover
The third channel is organizational, not technical. It happens when the shared resources feeding your eval harness — prompt libraries, dataset registries, example stores — change between runs without explicit versioning.
Consider the red team scenario from the opening. But there are less dramatic versions everywhere:
- A developer adds adversarial examples to the "training examples" section of a shared prompt template while debugging, then forgets to revert.
- A baseline prompt gets iteratively improved over several weeks by different team members, none of whom leave a paper trail.
- A shared dataset registry has a new version silently deployed because someone fixed a labeling error — which also happens to remove examples where the model was weak.
In each case, the eval harness is evaluating something slightly different from what it was last week, but the version label hasn't changed. The metrics improve. The contamination is invisible.
Why "Fixed Seeds" and "Isolated Namespaces" Aren't Enough
Teams that recognize this problem often implement two partial mitigations: deterministic seeds and infrastructure isolation (separate Kubernetes namespaces, separate queues). Neither is sufficient on its own.
Fixed seeds fail because determinism is only guaranteed at the level of a single process, single batch, single GPU configuration. Change any of those — which happens constantly in shared infrastructure as load varies — and your seed produces different samples. Temperature zero doesn't help either: even with no sampling randomness, model outputs depend on floating point accumulation order, which varies with hardware and batch size. The "deterministic" run is reproducible only in the narrow sense of same binary, same hardware, same batch size, same day.
Isolated namespaces fail because they isolate compute, not data. If Team A and Team B share a prompt registry, a dataset store, or even a common evaluation framework configuration, namespace isolation still allows prompt-state contamination to flow through those shared data layers.
What's required is isolation at every layer where state can be shared: compute, cache, and data.
The Isolation Primitives That Actually Work
Hermetic Eval Environments
A hermetic eval environment runs with zero shared external state. In practice, this means:
- Fresh container per eval run: each evaluation starts from a known-good image with no carryover state from previous runs. This means no warm-up caching benefit, but it also means no contamination.
- In-memory mock services for anything that would otherwise hit a shared database or model registry during the run.
- Explicit environment snapshots: the exact configuration — model checkpoint, prompt versions, dataset hashes — is captured at run start and stored with the results.
- https://github.com/LiveBench/LiveBench
- https://arxiv.org/abs/2603.10726
- https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf
- https://aclanthology.org/2025.eval4nlp-1.12.pdf
- https://docs.vllm.ai/en/latest/usage/reproducibility/
- https://testdriver.ai/articles/understanding-hermetic-testing-a-comprehensive-guide
- https://arxiv.org/abs/2403.04960
- https://arxiv.org/html/2406.04244v1
- https://www.lmsys.org/blog/2024-03-01-policy/
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
