Cross-Team Agent SLAs Don't Compose: The 99% Math Your Org Forgot to Budget
Team A's agent advertises a 99% success rate. Team B's agent advertises 99%. The new joint workflow that calls both lands at 98% on a good day, 96% on a bad one — and the team that owns the joint workflow is now the de facto SRE for two systems they don't own, can't reproduce locally, and didn't write the eval set for. Each upstream team is hitting its SLO. The composite product is missing its SLO. Nobody's pager is ringing on the right side of the boundary.
This is the math of independent failure rates, and it has been hiding in plain sight ever since the org started letting agents call each other. Five components at 99% reliability give you 95% end-to-end. Ten components give you 90%. A 20-step process at 95% per-step succeeds 36% of the time — more than half of operations fail before completion. By the time a workflow chains 50 components — not unusual once an enterprise agent starts calling sub-agents that call tool agents — a system where every individual piece is "99% reliable" will fail roughly four out of ten requests.
Researchers analyzing five popular multi-agent frameworks across more than 150 tasks identified failure rates between 41% and 87%, with the top three failures being step repetition, reasoning–action mismatch, and unawareness of termination conditions — and unstructured multi-agent networks have been observed to amplify errors up to 17× compared to single-agent baselines. The math isn't subtle. The problem is that the org's SLO sheets, dashboards, on-call rotations, and PRDs are still scoped one agent at a time.
The single-hop fallacy
When microservice meshes were the new shape, every team learned the same lesson the same way: a 200ms p95 latency target at the edge means each downstream hop has to fit inside a slice of that budget, and the team that owns the user-facing endpoint is the one that has to do the slicing. A practical decomposition might allocate 20ms for TLS and edge, 70ms for app logic, 80ms for cache and DB, and 30ms of slack for serialization and GC. The SRE community spent a decade building dashboards, tracing, and alerting that surfaced the per-hop contribution to an end-user metric — partial-SLO methods that mathematically derive each downstream's allocation, stacked-bar dashboards that show which hop just grew, distributed traces that name the slow span.
Agent meshes have inherited the same shape with none of the decade of tooling. The 99% number on the agent's spec sheet is almost always measured solo: a benchmark run, a curated eval set, a friendly distribution that looks nothing like the prompts the upstream caller will actually send. When the joint workflow stops working, the upstream team opens a ticket against the downstream team. The downstream team runs their solo eval, sees the 99% number is still 99%, and closes the ticket. The customer is still seeing the bug.
The fix is not "demand a higher SLO from the downstream agent." The fix is the same one microservices learned: give each hop a budget, denominated in a metric the joint workflow actually cares about, with a dashboard that shows the per-hop contribution, and an alert that fires on the composition before it fires on any single hop.
A per-hop reliability budget
Borrow the microservice playbook directly. Define the joint workflow's reliability target — say, 95% end-to-end success — and decompose it across the agent chain. A four-hop chain that needs 95% can either be evenly sliced (each hop carries 1.27% of the failure budget, hitting ~98.7% per-hop) or weighted by where the team has more headroom. A hop that the org can only get to 96% becomes a known constraint that tightens the budget for everyone else. A hop with 99.5% headroom releases budget for harder hops downstream.
The discipline that has to land is not the spreadsheet — it's the enforcement. The budget is real only if it shows up in three places:
- The composite SLO has its own dashboard and alert, separate from any individual agent's SLO.
- Each per-hop budget shows up on the upstream caller's dashboard, not just on the downstream owner's.
- Burn-rate alerts on the composite fire fast enough to start an incident before the per-hop owners notice their own number is fine.
Without these three, the per-hop budget is a Notion page nobody opens. With them, the joint workflow's owner has the same situational awareness an SRE team has when the checkout endpoint slows down.
Naming the failure modes the caller can route around
Microservices have HTTP status codes. A 503 means try again, a 422 means stop and surface the validation error, a 401 means refresh credentials. The caller can write code against these because the contract is closed.
Agent calls have, on average, none of this. The downstream agent might refuse the request (with prose), throw a tool-execution error (with a stack trace nobody parses), time out (silently, after the upstream's own deadline has fired), or return malformed structured output that crashes the upstream parser two stack frames later. The upstream caller has no clean way to distinguish "transient, retry" from "your input is wrong, don't retry" from "I refused on policy grounds, you'll never succeed." So the upstream just retries everything, which both inflates the bill and makes the failure mode worse — a recurring finding in production multi-agent post-mortems.
The contract layer between agents has to name these failure modes explicitly. Concretely:
- A typed error envelope —
refusal,tool_error,timeout,schema_violation,policy_block— that the downstream agent emits structurally instead of mixing into a free-form string.
- https://www.oreilly.com/radar/the-hidden-cost-of-agentic-failure/
- https://arxiv.org/pdf/2503.13657
- https://github.blog/ai-and-ml/generative-ai/multi-agent-workflows-often-fail-heres-how-to-engineer-ones-that-dont/
- https://medium.com/@Micheal-Lanham/multi-agent-in-production-in-2026-what-actually-survived-f86de8bb1cd1
- https://medium.com/@ThinkingLoop/slo-first-development-10-latency-budgets-you-can-keep-6bcdb19e9c95
- https://horovits.medium.com/sre-revisited-slo-in-the-age-of-microservices-30c1ff80cb6a
- https://cloud.google.com/blog/products/gcp/building-good-slos-cre-life-lessons
- https://github.com/a2aproject/A2A
- https://a2a-protocol.org/latest/
- https://www.mindstudio.ai/blog/reliability-compounding-problem-ai-agent-stacks
- https://wand.ai/blog/compounding-error-effect-in-large-language-models-a-growing-challenge
- https://medium.com/@deolesopan/data-contracts-for-agents-keep-tools-and-schemas-stable-as-systems-evolve-8af6f3e024ba
- https://developers.openai.com/cookbook/examples/structured_outputs_multi_agent
- https://dev.to/coordimap/time-to-owner-in-incident-response-how-platform-teams-cut-escalation-delay-4j7j
