Two-Hop Tool Chains: Why 95% Tools Compose Into 80% Pipelines
The per-tool dashboard in your observability stack tells a comforting lie. search_listings is green at 96%. book_appointment is green at 95%. The agent that uses them back-to-back has been at 78% for three weeks and nobody can explain why. The reason isn't in either tool. It's in the seam between them — the place no dashboard panel exists.
Composition is not addition. When tool A's output flows into tool B's input, the failure surface isn't 1 - (0.96 × 0.95) against B's narrow definition of "valid call." It's the full cartesian product of every way A can be subtly off by B's standards: a date string in MM/DD/YYYY when B expects ISO 8601, a price returned in cents when B parses dollars, a paginated cursor that points one item past the last result, an entity ID that was renamed on the upstream service yesterday. Any of these passes A's own contract tests cleanly. Each one breaks B. The team's per-tool reliability metrics never see it because each tool is, by its own standards, fine.
