Skip to main content

The Composition Testing Gap: Why Your Agents Pass Every Test but Fail Together

· 9 min read
Tian Pan
Software Engineer

Your planner agent passes its eval suite at 94%. Your researcher agent scores even higher. Your synthesizer agent nails every benchmark you throw at it. You compose them into a pipeline, deploy to production, and watch it produce confidently wrong answers that no individual agent would ever generate on its own.

This is the composition testing gap — the systematic blind spot where individually validated agents fail in ways that no single-agent analysis can predict. Research on multi-agent LLM systems shows that 67% of production failures stem from inter-agent interactions rather than individual agent defects. You're testing the atoms but shipping the molecule, and molecular behavior is not the sum of atomic properties.

The Combinatorial Explosion Nobody Budgets For

The math works against you fast. Two agents interacting require testing one communication path. Five agents create ten potential interaction paths. Ten agents create forty-five. But it's worse than simple combinatorics because each interaction path carries non-deterministic behavior — the same input can trigger different agent choreography depending on context, prior conversation state, and the particular way the upstream agent chose to phrase its output.

Recent research from the MAESTRO evaluation framework quantified this: multi-agent systems show approximately 35% variance in execution order across identical runs, even when the structural interaction pattern (which agents talk to which) remains stable with 86% consistency. The agents call each other in the same pattern, but the sequence shifts enough to produce meaningfully different outcomes.

This means your integration test that passed on Tuesday might fail on Wednesday with identical inputs. Not because anything changed — but because the system explored a different path through its own state space.

The Three Composition Failures That Unit Tests Cannot Catch

Individual agent evals test capability in isolation. Composition failures emerge from the spaces between agents. They cluster into three categories that require fundamentally different testing approaches.

Information Loss at Handoff Boundaries

When Agent A passes results to Agent B, something always gets lost. Agent A's internal reasoning — why it chose this particular framing, what alternatives it considered, what caveats it silently dropped — doesn't transfer. Agent B receives a flattened text artifact and builds its own mental model from scratch.

In multi-agent code generation systems, researchers found that semantically equivalent inputs cause performance drops of 7.9% to 83.3% depending on how the planner agent chose to phrase its task decomposition. The coder agent was fully capable. The planner agent was fully capable. But the translation layer between them introduced fragility that neither agent's individual eval would surface.

Conflicting Implicit Assumptions

Each agent develops implicit assumptions about the world based on its system prompt, its tool access, and its training. When agents compose, these assumptions collide silently.

A researcher agent might assume that "recent" means "last 30 days" while the synthesizer agent interprets "recent" as "mentioned in the provided context." Neither agent is wrong in isolation. But when the researcher retrieves information and labels it "recent findings," the synthesizer may treat outdated cached results as current simply because they appeared in its input window.

These assumption mismatches are invisible to capability-based evals because each agent handles the concept correctly within its own frame of reference. The failure only materializes at the composition boundary.

Emergent Resource Contention

Multi-agent systems share resources that single-agent tests never stress: API rate limits, context windows, token budgets, and execution time. In isolation, each agent consumes resources reasonably. Composed, they compete.

Production data shows that single-agent systems achieve 99.5% success rates while equivalent multi-agent implementations observe 97% — a gap that sounds small until you realize it represents a 5x increase in failure rate. Much of this comes from resource contention: agents exhausting shared rate limits, context windows overflowing because multiple agents stuffed their outputs into a shared thread, or cascading timeouts when one agent's slow response blocks the entire pipeline.

Why Distributed Systems Testing Applies (and Where It Breaks Down)

The composition testing problem is not new. Distributed systems engineers have spent decades developing techniques for testing emergent behavior in composed services. Property-based testing, chaos engineering, contract testing — all of these have analogs in multi-agent systems. But the analogy breaks down in important ways.

In microservices, the contract between services is explicit: an API schema, a message format, a documented SLA. In multi-agent systems, the "contract" is natural language — inherently ambiguous, context-dependent, and impossible to validate with schema checks. You can verify that Agent A produced valid JSON, but you cannot verify that Agent A communicated the right nuance for Agent B to make a correct decision.

Additionally, microservices are deterministic within a request. Given the same input and state, a service produces the same output. LLM-based agents are stochastic. MAESTRO's finding — structurally stable but temporally variable execution — has no clean analog in traditional distributed systems. Your chaos engineering playbook assumes that injecting the same fault produces the same failure mode. With LLM agents, it might produce a completely different one.

This means you need to borrow the frameworks but rebuild the assumptions.

Building a Composition Test Harness That Actually Works

Effective composition testing requires accepting that you cannot enumerate all possible interaction states. Instead, you test for properties that should hold regardless of which path the system takes.

Property-Based Invariants Over Point Assertions

Rather than asserting that the system produces a specific output for a specific input, define invariants that must hold across all compositions:

  • Monotonic information: If the researcher found three relevant sources, the synthesizer should reference at least three sources. Information should not vanish at handoff boundaries.
  • Consistency: If Agent A reports a constraint, downstream agents should not violate that constraint.
  • Bounded resource consumption: Total token usage across all agents for a single request should not exceed a defined budget.
  • Idempotent handoffs: Replaying the same inter-agent message should not change the downstream agent's behavior.

These invariant checks run across multiple executions of the same input. MAESTRO recommends a minimum of 20 repeated runs to characterize variance patterns — enough to distinguish systematic failures from stochastic noise.

Recorded Trajectory Replay

Capture the full execution trace of a multi-agent run: every inter-agent message, every tool call, every intermediate result. This trace becomes a test artifact.

When you change one agent in the pipeline, replay recorded trajectories through the modified system and diff the outputs. You're not looking for identical results — you're looking for broken invariants. Did the new planner agent produce a task decomposition that the coder agent can't handle? Did the updated synthesizer drop a citation that the researcher provided?

Trajectory-based regression testing is the multi-agent equivalent of snapshot testing: it catches unexpected changes in composed behavior without requiring you to enumerate every possible interaction.

Seam Injection for Controlled Failure

In the composition, identify the "seams" — the points where agents hand off to each other. At each seam, inject controlled perturbations:

  • Truncated handoffs: Cut the upstream agent's output at various points. Does the downstream agent fail gracefully or hallucinate missing context?
  • Delayed responses: Introduce latency between agents. Does the orchestrator handle timeouts, or does it hang indefinitely?
  • Contradictory inputs: Feed the downstream agent a result that conflicts with what the upstream agent was asked. Does it catch the inconsistency or process it blindly?
  • Format drift: Slightly alter the formatting of inter-agent messages. Agents that are robust will handle minor variations; brittle compositions will break.

This is chaos engineering for agent compositions. The goal is not to prove the system works — it's to map the failure boundaries before users find them.

Contract Testing for Agent Interfaces

Even though agent "contracts" are natural language, you can formalize them partially. For each agent-to-agent interface, define:

  • Required fields: What information must the upstream agent provide?
  • Prohibited content: What should the upstream agent never include (e.g., raw tool outputs that the downstream agent shouldn't see)?
  • Semantic constraints: What properties must the handoff message satisfy (e.g., "must contain a confidence score," "must not exceed 2000 tokens")?

Validate these contracts at every handoff boundary. This won't catch every composition failure, but it catches the class of failures where agents drift from their expected interface — the multi-agent equivalent of API versioning problems.

The Silent Majority: Gray Failures

The most insidious composition failures don't throw errors. Research on multi-agent evaluation found that 75% of failures manifest as "silent gray errors" — the system completes without any exception, returns a plausible-looking result, but the result is wrong. Nearly half of these (48%) are missing or underspecified outputs, and 28% are factual errors.

These gray failures are composition-specific. An individual agent that doesn't know the answer will often say so. But in a composed pipeline, uncertainty gets laundered: the researcher returns thin results, the planner treats them as sufficient, and the synthesizer generates confident prose from insufficient evidence. Each agent did its job. The system failed.

Catching gray failures requires output validation that understands the full pipeline's intent, not just the final agent's output format. This means comparing the richness of the final output against intermediate results to detect information loss, checking that claims in the final output are traceable to specific intermediate agent outputs, and measuring output confidence against the actual evidence available at each stage.

Where to Start

If you're running a multi-agent system in production today, you likely have no composition tests at all. Here's the priority order.

First, instrument every inter-agent handoff with structured logging. You can't test what you can't observe. Emit a trace span for every agent-to-agent message that includes the sender, receiver, message content, and timestamp.

Second, define 3-5 invariants that should hold for your pipeline and run your existing test cases 20 times each. Look at the variance. If your invariants hold across all runs, you have a stable composition. If they fail intermittently, you've found your first composition bugs.

Third, build trajectory replay for your most critical user journey. Record production traces (sanitized), and replay them whenever you update an agent. This is your regression safety net.

Fourth, add seam injection for the handoff boundary that fails most often in production. You probably already know which one it is — it's the one that generates the most confusing user-facing errors.

You don't need a complete composition test suite to start catching composition failures. You need enough instrumentation to see them and enough invariants to recognize them. The gap isn't a tooling problem — it's a visibility problem. Make the compositions observable, and the tests will follow.

References:Let's stay in touch and Follow me for more thoughts and updates