Two-Hop Tool Chains: Why 95% Tools Compose Into 80% Pipelines
The per-tool dashboard in your observability stack tells a comforting lie. search_listings is green at 96%. book_appointment is green at 95%. The agent that uses them back-to-back has been at 78% for three weeks and nobody can explain why. The reason isn't in either tool. It's in the seam between them — the place no dashboard panel exists.
Composition is not addition. When tool A's output flows into tool B's input, the failure surface isn't 1 - (0.96 × 0.95) against B's narrow definition of "valid call." It's the full cartesian product of every way A can be subtly off by B's standards: a date string in MM/DD/YYYY when B expects ISO 8601, a price returned in cents when B parses dollars, a paginated cursor that points one item past the last result, an entity ID that was renamed on the upstream service yesterday. Any of these passes A's own contract tests cleanly. Each one breaks B. The team's per-tool reliability metrics never see it because each tool is, by its own standards, fine.
This compounding is severe at agentic depth. Industry analyses of multi-step pipelines find that a 20-step chain at 95% per-step reliability succeeds end-to-end only about 36% of the time. Even at 99% per step, ten hops drops you to 90%. And these numbers assume the standard independent-failure model — which is the optimistic case. Real tool chains correlate failures: the same upstream API outage cascades into the same downstream null-handling bug every time. The real curve is worse than the multiplicative one.
The Failure Taxonomy That Per-Tool Metrics Cannot See
When you look at composition failures specifically — not single-tool failures — they cluster into shapes that don't show up in unit tests of either tool in isolation:
Lossy contract translation. Tool A returns a richer object than tool B accepts. The glue code (or the LLM, acting as glue) drops fields. Sometimes the dropped field was load-bearing for B's behavior in edge cases nobody enumerated. The classic version is a permissions/scope token that A returns and B silently doesn't propagate, so B operates as a less-privileged identity downstream and fails on records A could see but B can't.
Implicit assumption mismatch. A returns "the customer's most recent address." B expects "the customer's billing address." Most of the time they're the same record, so testing in staging shows green. The 4% of customers where they differ become a long-tail failure mode that looks random because it correlates with a customer attribute neither tool's owners thought to log.
Unit and encoding drift. Currency in cents vs. dollars. Timestamps in UTC vs. local. UUIDs as strings vs. binary. Identifiers as 64-bit ints in A vs. strings in B (with all the ways JSON loses precision on the way through). These are the bugs that ship to production because every developer who looks at the trace mentally normalizes the value before reading it.
Pagination and cursor semantics. A returns the first 50 of 312 items with a next_cursor. B is given the list and proceeds as if it's complete. The agent reasons over a partial picture and produces a confidently wrong answer. The trace shows a clean call to A and a clean call to B, with no error in either. Pagination off-by-ones, "is this cursor exhausted or just empty for this page" ambiguity, and "what happens when the underlying data mutates between page fetches" are all unique to composition; you can't catch them by testing A or B alone.
Error-message-swallowed-by-downstream-tool. A returns a structured error object with an explanatory message in a non-standard field. The glue code maps it to a generic 500 before handing it to B. B's retry logic kicks in, retries succeed because the underlying state already changed, and the original error is lost from both the trace and the logs. The team sees an "intermittent" failure that is actually a deterministic contract-translation bug.
Stale identifier across the seam. A returns an entity ID. The agent does some reasoning, asks a clarifying question, and eventually passes the ID to B several turns later. In the meantime, an upstream service rotated the ID format (or the entity was archived). B fails with "not found." This is composition over time as well as over function — and it's an underappreciated source of production failures because reproducing it requires replaying a stale snapshot of the world.
Why Per-Tool Dashboards Hide This
Look at how observability for tool-using agents is usually instrumented: a span per tool call, with input args, output, latency, and success/failure. Each tool gets a panel. The panel shows the call's own definition of success — usually "did the tool return a 2xx-equivalent without raising."
This is precisely the wrong unit of measurement for a composition problem. The tool returned successfully. The composition still failed. The right unit is the end-to-end task, which is the only level at which "B got bad inputs from A" surfaces as a failure. Most teams don't measure that, or measure it only on a thin offline eval set that doesn't reflect the long tail of inputs that exercise the composition's edge cases.
The second instrumentation gap: the per-tool panel doesn't show you the distribution of inputs B is receiving. So when A starts returning more records of a previously-rare type (because an upstream catalog grew, say), B's failure rate ticks up, but the panel for B still shows "no change in tool implementation, no incident." The change isn't in B; it's in the joint distribution of A's outputs and B's inputs, which nothing in the per-tool view captures.
Contract Tests at the Seam, Not the Tool
The fix is conceptually simple and operationally hard: you need contract tests that exercise the seam between two tools as a unit, with explicit examples of A's full output range fed into B. The pattern looks like this:
- https://redis.io/blog/why-multi-agent-llm-systems-fail/
- https://www.mindstudio.ai/blog/reliability-compounding-problem-ai-agent-stacks
- https://arxiv.org/pdf/2503.13657
- https://stackoverflow.blog/2025/06/30/reliability-for-unreliable-llms/
- https://orq.ai/blog/why-do-multi-agent-llm-systems-fail
- https://galileo.ai/blog/why-multi-agent-systems-fail
- https://docs.bswen.com/blog/2026-03-20-llm-tool-call-contracts/
- https://circleci.com/blog/building-llm-agents-to-validate-tool-use-and-structured-api/
- https://agenta.ai/blog/the-guide-to-structured-outputs-and-function-calling-with-llms
