Your Team's Benchmarks Are Lying to Each Other: Shared Eval Infrastructure Contamination
Your red team just finished a jailbreak sweep. They found three novel attack vectors, wrote them up, and dropped the prompts into your shared prompt library for others to learn from. The next week, the safety team runs their baseline evaluation and reports a 12% improvement in robustness. Everyone celebrates. Nobody asks why.
What actually happened: the safety team's baseline eval silently incorporated the red team's attack prompts. The model didn't get more robust — the eval got contaminated. Your benchmarks are now measuring inoculation against known attacks, not generalization to new ones.
This is shared eval infrastructure contamination, and it is far more common than most teams realize. The symptom is artificially inflating metrics. The cause is treating evaluation infrastructure like production infrastructure — optimized for sharing and efficiency, instead of isolation and fidelity.
Three Ways Shared Eval Infrastructure Poisons Your Results
Channel 1: Cached Completion Leakage
Modern LLM serving systems cache the key-value (KV) states of common prefixes to reduce inference latency. This is correct behavior in production. In shared eval infrastructure, it becomes a liability.
When Team A runs an evaluation, their prompts get cached. When Team B's evaluation follows on the same serving instance, it may silently hit cached states for prefix-overlapping prompts. Depending on your caching strategy, Team B's results are now partially determined by Team A's eval run — not by the model's fresh inference.
This gets worse with timing. Research on KV-cache sharing across tenants shows that cache hit-or-miss patterns alone can be exploited to reconstruct other teams' prompts with high accuracy. Even without adversarial intent, overlapping prompt prefixes across teams (common system prompt templates, shared instruction formats, reused few-shot examples) will cause state to bleed silently through the cache.
The practical consequence: you cannot trust absolute scores generated on shared serving infrastructure without knowing what cache state existed when the evaluation ran. A metric that looks like a 3% improvement might be 1.5% improvement and 1.5% cache artifact.
Channel 2: Sequential Run Pollution
The second contamination channel is subtler. It doesn't require caching — it requires time.
When eval runs execute sequentially on shared infrastructure, artifacts from one run persist into the next: leftover files, accumulated memory state, warm-started model processes with residual gradient state, file system entries that affect subsequent reads. These artifacts are often invisible in the eval results but they affect reproducibility.
There's also the determinism illusion. Teams often assume that setting a fixed random seed guarantees reproducible outputs across runs. It doesn't. At the sampling layer, operations like torch.multinomial are inherently nondeterministic under different batch sizes — the GPU scheduling order changes the effective randomness. Research shows that simply changing the number of GPUs or the batch size during evaluation can shift accuracy by as much as 9%, even with the seed held constant.
This means that two teams running the "same" evaluation with "the same" seed on different days, with different queue depths and batch configurations, will get systematically different numbers. Neither result is wrong per se — but they are not comparable, and treating them as comparable is where the contamination enters.
Channel 3: Prompt-State Bleedover
The third channel is organizational, not technical. It happens when the shared resources feeding your eval harness — prompt libraries, dataset registries, example stores — change between runs without explicit versioning.
Consider the red team scenario from the opening. But there are less dramatic versions everywhere:
- A developer adds adversarial examples to the "training examples" section of a shared prompt template while debugging, then forgets to revert.
- A baseline prompt gets iteratively improved over several weeks by different team members, none of whom leave a paper trail.
- A shared dataset registry has a new version silently deployed because someone fixed a labeling error — which also happens to remove examples where the model was weak.
In each case, the eval harness is evaluating something slightly different from what it was last week, but the version label hasn't changed. The metrics improve. The contamination is invisible.
Why "Fixed Seeds" and "Isolated Namespaces" Aren't Enough
Teams that recognize this problem often implement two partial mitigations: deterministic seeds and infrastructure isolation (separate Kubernetes namespaces, separate queues). Neither is sufficient on its own.
Fixed seeds fail because determinism is only guaranteed at the level of a single process, single batch, single GPU configuration. Change any of those — which happens constantly in shared infrastructure as load varies — and your seed produces different samples. Temperature zero doesn't help either: even with no sampling randomness, model outputs depend on floating point accumulation order, which varies with hardware and batch size. The "deterministic" run is reproducible only in the narrow sense of same binary, same hardware, same batch size, same day.
Isolated namespaces fail because they isolate compute, not data. If Team A and Team B share a prompt registry, a dataset store, or even a common evaluation framework configuration, namespace isolation still allows prompt-state contamination to flow through those shared data layers.
What's required is isolation at every layer where state can be shared: compute, cache, and data.
The Isolation Primitives That Actually Work
Hermetic Eval Environments
A hermetic eval environment runs with zero shared external state. In practice, this means:
- Fresh container per eval run: each evaluation starts from a known-good image with no carryover state from previous runs. This means no warm-up caching benefit, but it also means no contamination.
- In-memory mock services for anything that would otherwise hit a shared database or model registry during the run.
- Explicit environment snapshots: the exact configuration — model checkpoint, prompt versions, dataset hashes — is captured at run start and stored with the results.
The overhead is real: hermetic environments cost 10–30% more compute than shared infrastructure. For CI-style evaluation runs that gate model deployments, this is worth paying. For exploratory research, lighter isolation (namespace + cache eviction) may be sufficient.
Batch-Invariant Deterministic Sampling
Recent work on LLM serving infrastructure introduced a principled solution to the determinism problem: instead of relying on a seed to control torch.multinomial, perturb the logits with Gumbel noise generated from a seeded hash function. Because the hash function takes (input_ids, seed) as input and is evaluated before sampling, the result is invariant to batch size and scheduling order. The same input with the same seed always produces the same sample, regardless of what else is running.
This solves the sequential pollution problem at the sampling layer. Teams can now make meaningful comparisons across runs with different batch configurations, and "fixed seed" actually means something reproducible.
Run-Scoped Prompt State
The most practical mitigation for prompt-state bleedover is making prompt versions explicit and immutable for the duration of an eval run.
Concretely:
- Every prompt template used in an eval is pinned to a specific Git commit hash at run start.
- The harness refuses to proceed if the resolved hash doesn't match the expected value.
- Results are stored with the full prompt hash manifest, not just a human-readable version label.
This means that when a prompt changes, old eval results aren't retroactively invalidated — they're still valid for the pinned version they ran against. And it means that a silent registry update can't corrupt an in-progress run.
KV Cache Boundaries Between Teams
For teams that share serving infrastructure and want to avoid the full overhead of hermetic environments, selective cache isolation is a middle path. The idea is to monitor which prefixes are being shared across teams and selectively evict or isolate those that cross team boundaries. This preserves within-team caching benefits (which dominate total cache hits in practice) while blocking the cross-team leakage that contaminates eval results.
The overhead for this approach is small — roughly 5% compared to fully shared caching — and it prevents the most common contamination vector without requiring complete infrastructure separation.
The Organizational Layer: Red Team Data Segregation
Technical isolation solves the compute and data layer. The organizational layer requires a different kind of discipline: preventing red team findings from contaminating baseline evaluations.
The core principle is temporal separation. Baseline evaluations run before red-teaming begins for a given model version. Red team findings are stored in a segregated registry, separate from the baseline prompt library. Any promotion of red team prompts into shared infrastructure (for training, for defensive evaluation, for example libraries) goes through an explicit review step that checks whether the promotion would affect any existing baseline evaluation datasets.
This sounds bureaucratic, but the overhead is low if you automate the check. Before any merge to the shared prompt library, a pre-merge hook scans the incoming prompts against the set of hashes used in the current baseline. If there's overlap, the merge is blocked with a clear error.
The complementary practice is marking baselines with their temporal position relative to red team discoveries. A baseline that predates a known jailbreak discovery is a different kind of measurement than one that postdates it. Both are valid, but they're not comparable without that label.
What Good Eval Infrastructure Looks Like
Putting these primitives together, a well-isolated eval infrastructure has the following properties:
Every result is fully reproducible. Given the run ID, you can reconstruct exactly what ran: the model checkpoint hash, the prompt SHA manifest, the dataset version, the seed and batch configuration, and the hardware the run executed on.
Team boundaries are enforced at the data layer. Prompt registries, dataset stores, and example libraries have explicit access controls and immutable version refs. A team cannot accidentally (or intentionally) affect another team's eval by modifying shared resources without leaving an audit trail.
Red team data never silently enters baseline evaluation. The promotion path from red team discovery to shared library is gated by automated hash-scanning. Baselines are labeled with their pre-/post-discovery position.
Cache state is explicitly managed. Serving infrastructure either uses hermetic environments (fresh state per run) or selective cache isolation (cross-team cache boundaries enforced). The choice is documented and reflected in how results are compared.
Determinism claims are real. Sampling uses batch-invariant primitives. Multiple runs with the same seed and configuration produce the same outputs regardless of batch size or queue depth.
The Cost of Ignoring This
Benchmark contamination doesn't fail loudly. It fails by making your numbers look better than they are. A model that improves by 4% on a clean benchmark shows a 7% improvement on a contaminated one. The contaminated number wins the internal debate about whether to deploy. The model performs like the 4% improvement in production.
The most dangerous version of this isn't malice — it's invisible shared state accumulation over months. Prompt libraries drift. Cached completions accumulate. Red team findings propagate through unofficial channels. Nobody does anything wrong, and yet your evaluation infrastructure gradually stops measuring what you think it's measuring.
The fix requires treating eval infrastructure with the same rigorous isolation standards you apply to controlled experiments in other engineering disciplines. When the integrity of the measurement depends on isolation, isolation is not optional overhead — it is the measurement.
The benchmark that can't tell you what it's contaminated by cannot tell you what your model actually learned.
- https://github.com/LiveBench/LiveBench
- https://arxiv.org/abs/2603.10726
- https://www.ndss-symposium.org/wp-content/uploads/2025-1772-paper.pdf
- https://aclanthology.org/2025.eval4nlp-1.12.pdf
- https://docs.vllm.ai/en/latest/usage/reproducibility/
- https://testdriver.ai/articles/understanding-hermetic-testing-a-comprehensive-guide
- https://arxiv.org/abs/2403.04960
- https://arxiv.org/html/2406.04244v1
- https://www.lmsys.org/blog/2024-03-01-policy/
- https://www.anthropic.com/engineering/demystifying-evals-for-ai-agents
