The Composition Testing Gap: Why Your Agents Pass Every Test but Fail Together
Your planner agent passes its eval suite at 94%. Your researcher agent scores even higher. Your synthesizer agent nails every benchmark you throw at it. You compose them into a pipeline, deploy to production, and watch it produce confidently wrong answers that no individual agent would ever generate on its own.
This is the composition testing gap — the systematic blind spot where individually validated agents fail in ways that no single-agent analysis can predict. Research on multi-agent LLM systems shows that 67% of production failures stem from inter-agent interactions rather than individual agent defects. You're testing the atoms but shipping the molecule, and molecular behavior is not the sum of atomic properties.
The Combinatorial Explosion Nobody Budgets For
The math works against you fast. Two agents interacting require testing one communication path. Five agents create ten potential interaction paths. Ten agents create forty-five. But it's worse than simple combinatorics because each interaction path carries non-deterministic behavior — the same input can trigger different agent choreography depending on context, prior conversation state, and the particular way the upstream agent chose to phrase its output.
Recent research from the MAESTRO evaluation framework quantified this: multi-agent systems show approximately 35% variance in execution order across identical runs, even when the structural interaction pattern (which agents talk to which) remains stable with 86% consistency. The agents call each other in the same pattern, but the sequence shifts enough to produce meaningfully different outcomes.
This means your integration test that passed on Tuesday might fail on Wednesday with identical inputs. Not because anything changed — but because the system explored a different path through its own state space.
The Three Composition Failures That Unit Tests Cannot Catch
Individual agent evals test capability in isolation. Composition failures emerge from the spaces between agents. They cluster into three categories that require fundamentally different testing approaches.
Information Loss at Handoff Boundaries
When Agent A passes results to Agent B, something always gets lost. Agent A's internal reasoning — why it chose this particular framing, what alternatives it considered, what caveats it silently dropped — doesn't transfer. Agent B receives a flattened text artifact and builds its own mental model from scratch.
In multi-agent code generation systems, researchers found that semantically equivalent inputs cause performance drops of 7.9% to 83.3% depending on how the planner agent chose to phrase its task decomposition. The coder agent was fully capable. The planner agent was fully capable. But the translation layer between them introduced fragility that neither agent's individual eval would surface.
Conflicting Implicit Assumptions
Each agent develops implicit assumptions about the world based on its system prompt, its tool access, and its training. When agents compose, these assumptions collide silently.
A researcher agent might assume that "recent" means "last 30 days" while the synthesizer agent interprets "recent" as "mentioned in the provided context." Neither agent is wrong in isolation. But when the researcher retrieves information and labels it "recent findings," the synthesizer may treat outdated cached results as current simply because they appeared in its input window.
These assumption mismatches are invisible to capability-based evals because each agent handles the concept correctly within its own frame of reference. The failure only materializes at the composition boundary.
Emergent Resource Contention
Multi-agent systems share resources that single-agent tests never stress: API rate limits, context windows, token budgets, and execution time. In isolation, each agent consumes resources reasonably. Composed, they compete.
Production data shows that single-agent systems achieve 99.5% success rates while equivalent multi-agent implementations observe 97% — a gap that sounds small until you realize it represents a 5x increase in failure rate. Much of this comes from resource contention: agents exhausting shared rate limits, context windows overflowing because multiple agents stuffed their outputs into a shared thread, or cascading timeouts when one agent's slow response blocks the entire pipeline.
Why Distributed Systems Testing Applies (and Where It Breaks Down)
The composition testing problem is not new. Distributed systems engineers have spent decades developing techniques for testing emergent behavior in composed services. Property-based testing, chaos engineering, contract testing — all of these have analogs in multi-agent systems. But the analogy breaks down in important ways.
In microservices, the contract between services is explicit: an API schema, a message format, a documented SLA. In multi-agent systems, the "contract" is natural language — inherently ambiguous, context-dependent, and impossible to validate with schema checks. You can verify that Agent A produced valid JSON, but you cannot verify that Agent A communicated the right nuance for Agent B to make a correct decision.
Additionally, microservices are deterministic within a request. Given the same input and state, a service produces the same output. LLM-based agents are stochastic. MAESTRO's finding — structurally stable but temporally variable execution — has no clean analog in traditional distributed systems. Your chaos engineering playbook assumes that injecting the same fault produces the same failure mode. With LLM agents, it might produce a completely different one.
This means you need to borrow the frameworks but rebuild the assumptions.
Building a Composition Test Harness That Actually Works
Effective composition testing requires accepting that you cannot enumerate all possible interaction states. Instead, you test for properties that should hold regardless of which path the system takes.
Property-Based Invariants Over Point Assertions
Rather than asserting that the system produces a specific output for a specific input, define invariants that must hold across all compositions:
- https://arxiv.org/html/2601.00481v1
- https://arxiv.org/pdf/2503.13657
- https://www.getmaxim.ai/articles/multi-agent-system-reliability-failure-patterns-root-causes-and-production-validation-strategies/
- https://galileo.ai/blog/multi-agent-ai-failures-prevention
- https://virtuslab.com/blog/ai/testing-evaluating-agentic-systems
- https://arxiv.org/abs/2510.10460
