Skip to main content

Cross-Team Agent SLAs Don't Compose: The 99% Math Your Org Forgot to Budget

· 11 min read
Tian Pan
Software Engineer

Team A's agent advertises a 99% success rate. Team B's agent advertises 99%. The new joint workflow that calls both lands at 98% on a good day, 96% on a bad one — and the team that owns the joint workflow is now the de facto SRE for two systems they don't own, can't reproduce locally, and didn't write the eval set for. Each upstream team is hitting its SLO. The composite product is missing its SLO. Nobody's pager is ringing on the right side of the boundary.

This is the math of independent failure rates, and it has been hiding in plain sight ever since the org started letting agents call each other. Five components at 99% reliability give you 95% end-to-end. Ten components give you 90%. A 20-step process at 95% per-step succeeds 36% of the time — more than half of operations fail before completion. By the time a workflow chains 50 components — not unusual once an enterprise agent starts calling sub-agents that call tool agents — a system where every individual piece is "99% reliable" will fail roughly four out of ten requests.

Researchers analyzing five popular multi-agent frameworks across more than 150 tasks identified failure rates between 41% and 87%, with the top three failures being step repetition, reasoning–action mismatch, and unawareness of termination conditions — and unstructured multi-agent networks have been observed to amplify errors up to 17× compared to single-agent baselines. The math isn't subtle. The problem is that the org's SLO sheets, dashboards, on-call rotations, and PRDs are still scoped one agent at a time.

The single-hop fallacy

When microservice meshes were the new shape, every team learned the same lesson the same way: a 200ms p95 latency target at the edge means each downstream hop has to fit inside a slice of that budget, and the team that owns the user-facing endpoint is the one that has to do the slicing. A practical decomposition might allocate 20ms for TLS and edge, 70ms for app logic, 80ms for cache and DB, and 30ms of slack for serialization and GC. The SRE community spent a decade building dashboards, tracing, and alerting that surfaced the per-hop contribution to an end-user metric — partial-SLO methods that mathematically derive each downstream's allocation, stacked-bar dashboards that show which hop just grew, distributed traces that name the slow span.

Agent meshes have inherited the same shape with none of the decade of tooling. The 99% number on the agent's spec sheet is almost always measured solo: a benchmark run, a curated eval set, a friendly distribution that looks nothing like the prompts the upstream caller will actually send. When the joint workflow stops working, the upstream team opens a ticket against the downstream team. The downstream team runs their solo eval, sees the 99% number is still 99%, and closes the ticket. The customer is still seeing the bug.

The fix is not "demand a higher SLO from the downstream agent." The fix is the same one microservices learned: give each hop a budget, denominated in a metric the joint workflow actually cares about, with a dashboard that shows the per-hop contribution, and an alert that fires on the composition before it fires on any single hop.

A per-hop reliability budget

Borrow the microservice playbook directly. Define the joint workflow's reliability target — say, 95% end-to-end success — and decompose it across the agent chain. A four-hop chain that needs 95% can either be evenly sliced (each hop carries 1.27% of the failure budget, hitting ~98.7% per-hop) or weighted by where the team has more headroom. A hop that the org can only get to 96% becomes a known constraint that tightens the budget for everyone else. A hop with 99.5% headroom releases budget for harder hops downstream.

The discipline that has to land is not the spreadsheet — it's the enforcement. The budget is real only if it shows up in three places:

  • The composite SLO has its own dashboard and alert, separate from any individual agent's SLO.
  • Each per-hop budget shows up on the upstream caller's dashboard, not just on the downstream owner's.
  • Burn-rate alerts on the composite fire fast enough to start an incident before the per-hop owners notice their own number is fine.

Without these three, the per-hop budget is a Notion page nobody opens. With them, the joint workflow's owner has the same situational awareness an SRE team has when the checkout endpoint slows down.

Naming the failure modes the caller can route around

Microservices have HTTP status codes. A 503 means try again, a 422 means stop and surface the validation error, a 401 means refresh credentials. The caller can write code against these because the contract is closed.

Agent calls have, on average, none of this. The downstream agent might refuse the request (with prose), throw a tool-execution error (with a stack trace nobody parses), time out (silently, after the upstream's own deadline has fired), or return malformed structured output that crashes the upstream parser two stack frames later. The upstream caller has no clean way to distinguish "transient, retry" from "your input is wrong, don't retry" from "I refused on policy grounds, you'll never succeed." So the upstream just retries everything, which both inflates the bill and makes the failure mode worse — a recurring finding in production multi-agent post-mortems.

The contract layer between agents has to name these failure modes explicitly. Concretely:

  • A typed error envelope — refusal, tool_error, timeout, schema_violation, policy_block — that the downstream agent emits structurally instead of mixing into a free-form string.
  • A retry-policy hint per error type, owned by the downstream agent, so the upstream caller doesn't have to reverse-engineer it.
  • A per-edge runbook documenting what each error type means and what the caller should do — the agent equivalent of "what to do when checkout returns 503."

The A2A protocol that emerged in 2025 starts to formalize some of this — typed task states, structured artifacts, well-known agent-discovery endpoints, JSON-RPC over HTTPS — and orgs adopting it are realizing that most of the value isn't the wire format. It's that someone, somewhere, finally wrote down what the failure modes are.

A per-edge eval, not just per-agent

The most common eval pattern for multi-agent systems is also the most misleading one: each team runs their own eval against their own agent, posts the score, and assumes the joint workflow is the AND of those scores. It isn't. The joint workflow's distribution is whatever the upstream agent actually sends, which is almost never the distribution the downstream agent's eval set was drawn from. The downstream team built their eval against a clean distribution of intent-classified queries; the upstream agent feeds them the messy, partially-reformulated, schema-occasionally-violated outputs of its own LLM.

The per-edge eval grades the handshake: under the upstream agent's actual output distribution, does the downstream agent still hit its target? This is harder to set up than a per-agent eval, because it requires capturing real upstream traffic and replaying it against the downstream — but it's the only eval that catches the most expensive bug class in agent meshes, where the downstream is fine in isolation and broken in composition.

The version-matrix testing pattern is the natural extension. Any combination of agent versions that has not been explicitly tested together is untested. When team A bumps their model from a deprecating snapshot to a new one, the per-edge eval has to re-run against every downstream that consumes their output — even though team A's solo eval is green, even though team B hasn't shipped anything. The breakage isn't in either agent. It's in the seam.

Schema versioning as the breaking-vs-additive boundary

The same versioning hygiene that REST APIs learned twenty years ago applies to agent output schemas, and orgs are mostly relearning it the hard way. A semver-style classification of changes — breaking versus additive — has to be a first-class part of the agent's deployment process:

  • Breaking: rename a field, remove a field, change a type, change a permitted enum value, tighten a validation rule.
  • Additive: add an optional field, expand an enum, loosen a validation rule.

Breaking changes carry a deprecation window and a contract_version bump that downstream consumers can pin against. Additive changes ship freely. The CI gate that fails when a breaking schema change ships without a deprecation entry is a five-line lint — and it is the difference between "the upstream caller's parser broke at 3am" and "the upstream caller had two weeks to migrate."

The architectural realization is small but load-bearing: the agent's structured output is a public API. It has the same contract-stability problems any public API has. Treat it that way.

A shared incident channel and time-to-owner

When the joint workflow degrades, two things have to happen fast: somebody has to know it's degraded, and the right somebody has to know which hop is to blame. Time-to-owner — the wall-clock from "incident opened" to "the team that owns the broken thing is in the room" — is the underrated metric here. In cross-team agent dependencies, time-to-owner can stretch into hours because the on-call rotations are scoped per agent and nobody is on-call for the seam.

The fix is operational, not architectural. A shared incident channel for cross-team agent dependencies, with the upstream and downstream owners both subscribed. A service catalog entry for the edge between the two agents, with a named owner. A standing review where both teams look at the per-edge eval and the per-edge SLO together, monthly, before either side ships a model bump.

This sounds heavy until you've watched a 72-hour incident bounce between three teams who each ran their solo eval, each saw their own number was green, and each closed the ticket. The escalation delay isn't a people problem. It's a systems problem disguised as a people problem — the teams don't share a current view of the dependency path, the per-edge contract, or the recent changes that touched the seam.

The architectural realization

Agent-to-agent composition has all the consistency, latency, and reliability tax of microservice meshes — without the decade of tooling that microservices got. The org that hasn't built the SLA layer between its agents is going to discover the math of independent failure rates the hard way, usually three months into a flagship workflow's rollout, when the composite success rate has quietly drifted from "we hit the target on demo day" to "we are missing the target every day and nobody's solo dashboard knows."

The four pieces that have to be in place — a per-hop reliability budget with a composite alert, a typed contract layer that names failure modes, a per-edge eval that grades the handshake, and a versioned schema with breaking-vs-additive discipline — are not exotic. They are the same four pieces every microservices org built between 2014 and 2020. The only thing new is that they apply to agents now, and the orgs shipping agent meshes today are mostly operating like it's 2014: every team confident in their own number, the composite quietly bleeding, and nobody looking at the seam until the customer complaint thread crosses a hundred replies.

The teams that get ahead of this are doing one specific thing. They are treating the edge between two agents as an artifact a person owns — not a passing reference in someone else's PRD, not an implicit assumption in someone's eval, not a Notion doc nobody opens. An owner, a budget, an alert, a contract, an eval. Five things. Once those five exist, the math stops being a surprise and starts being a budget item, which is where it should have lived from the start.

References:Let's stay in touch and Follow me for more thoughts and updates