Skip to main content

Fourth-Party Risk: When Your Vendor's Vendor Owns Your Customer's Incident

· 11 min read
Tian Pan
Software Engineer

Your contract is with the model provider. Your runbook handles the case where that provider is degraded. Your status page subscription pages you when their dashboard turns yellow. You feel covered. Then one Wednesday afternoon the underlying cloud region your provider runs in starts brownouts, your provider's failover region is also affected because they consolidated capacity to control unit economics, and your product is half-down for ninety minutes because of a vendor decision two layers upstream from any contract you signed.

The customer postmortem request lands in your inbox the next morning. They want a root cause. The root cause lives in a layer your status page cannot see and your contract does not let you compel. That layer is what fourth-party risk actually is — not a procurement checkbox, but a silent dependency tier that propagates failures upward with attenuation but not absorption.

In financial services and healthcare, fourth-party risk has been a regulated discipline for years. In LLM-powered products, it is mostly an unmodeled assumption that the vendor you call is the bottom of the stack. It is not. Your inference provider runs on someone else's GPUs, in someone else's data centers, behind someone else's network backbone, often within a handful of regions that the entire industry shares. The visible layer you contracted with is the top of a stack that owns your availability profile whether you priced it in or not.

The Stack Below Your Contract

A typical LLM-powered product has at least four risk tiers stacked on top of each other. The team usually only models the first two.

Tier one: your code. What your service deploys, owns, and operates. This is the layer with the most observability and the least excuse for failure.

Tier two: your direct provider. The model API, the embedding service, the inference gateway. You have a contract here, a status page, and usually a procurement record. Your SRE team can negotiate SLAs and credits with this layer.

Tier three: your provider's infrastructure. The cloud region your provider runs on. The accelerator supply they bought. The network transit they pay for. You did not sign a contract with this layer. Your provider did, and they are not obligated to share its posture with you.

Tier four and below: the upstream substrate. The power grid the data center sits on. The single fiber path between two metros that your "geo-redundant" providers both happen to use. The DNS authority that resolves multiple "independent" endpoints. The CDN that fronts unrelated services because the discount was good. This tier is where industry concentration silently becomes correlated risk.

When tier four fails, tier three feels it acutely, tier two feels it noisily, and tier one — your product — gets paged. The failure mode is not that the bottom layer broke. The failure mode is that the bottom layer broke and the middle layers were not contractually obligated to tell you why, when, or for how much longer.

Multi-Provider Redundancy That Is Not

The instinctive response to a vendor outage is to add a second vendor. You sign a contract with a different inference provider, write a fallback router, and feel covered. You are not, and the reason is geometric: independence at the contract layer does not give you independence at the substrate layer.

Two providers can resolve to the same cloud region. Two cloud regions can resolve to the same metro area's power grid. Two metro areas can resolve to the same long-haul fiber corridor. The CrowdStrike incident in 2024 was a textbook example of horizontal propagation through what looked like an independent vendor ecosystem — a single fourth-party update disabled services across dozens of vendors simultaneously, because the vendors had silently converged on the same kernel-level agent. Most teams running "redundant" LLM providers have a similar convergence at one or more layers below their contract.

The pattern shows up in three concrete ways.

Identity federation deadlock. Both providers authenticate against the same SSO tenant. The SSO tenant is degraded. Neither provider is technically down, but neither will accept your traffic, and your "failover" is paralyzed at the permissions layer.

Metadata storm during failover. The primary provider is at 60% error rate. The router shifts 100% of traffic to the secondary. The secondary's per-tenant rate limit was sized for your steady-state share, not for the sudden surge, and it starts rejecting too. The shift that was supposed to be a recovery becomes a second outage.

Retry amplification into shared tail latency. Both providers route through the same Tier-1 ISP at a peering point that is mildly degraded. Both providers' tails get longer. Your retry policy, written for independent failures, multiplies the load. What was a one-second p99 becomes a forty-second hang on both providers simultaneously.

The architectural diagnosis is the same in every case. Your redundancy strategy was modeled in the contract layer and tested at the API layer, but the failure correlated at a layer below both. Adding a second vendor without mapping the substrate below it buys you procurement comfort and very little real availability.

Mapping the Layers You Do Not Own

The fix starts with a dependency map that extends past the first contractual layer. You cannot demand a complete map from your vendor — they will not share it — but you can extract enough signal to identify the major concentration risks.

The cheapest entry point is a SOC 2 Type II report. The "subservice organizations" section names the vendors your provider depends on for critical services. Read it. Catalog the named cloud, identity, observability, and CDN providers. Then ask your other "redundant" providers for their SOC 2 reports and compare. If the same name appears in the subservice section of three of your four "independent" vendors, you do not have four-way redundancy. You have one shared dependency wearing four logos.

Beyond SOC 2, contract clauses can compel disclosure that procurement is usually willing to negotiate but engineering rarely asks for. The clauses that matter:

  • A subcontractor disclosure requirement at onboarding and annually thereafter, naming both the entity and the service it provides.
  • A notification obligation when the provider adds, removes, or materially changes a fourth-party relationship that touches your traffic.
  • A "same standard" clause that requires the provider to flow your security and availability requirements down to their critical subcontractors.
  • A breach notification window for fourth-party incidents, scoped to incidents that affect your data or your availability.

These clauses do not eliminate the risk. They make the risk legible, which is the precondition for everything else.

SLOs That Survive the Failure They Were Designed For

The standard provider SLO measures one thing well — was the vendor's endpoint up — and almost nothing about whether your redundancy actually works. The team that bought a second provider needs a different metric: an outage diversity SLO that measures whether the redundancy strategy survives the failures it was designed to survive, not just the failures the vendor reports.

The shape of this SLO is simple. For each defined failure scenario — primary region brownout, identity provider degradation, transit-path congestion, retry-amplified tail latency — you record what fraction of traffic your routing layer would have shifted successfully under that scenario. The numerator is "scenarios survived." The denominator is "scenarios designed for." A 99.99% endpoint SLO with a 40% outage diversity SLO means you bought a provider that is usually up but a redundancy strategy that usually does not save you.

This metric requires two operational disciplines that most teams skip.

First, the routing layer's health probes have to detect correlated degradation, not just per-provider availability. A probe that returns 200 from both providers tells you nothing about whether their shared transit is congested. The probe needs to measure latency distribution and error rate as a joint signal, not as two independent ones. When the joint signal degrades while the individual signals look fine, that is the substrate failing in a way the providers' dashboards will not show for another fifteen minutes.

Second, the failure scenarios have to be exercised. Game-day exercises that simulate a metro-level fiber cut, a regional cloud brownout, or an identity-provider degradation are what convert paper redundancy into observed redundancy. The team that has never run the exercise has subscribed to an availability profile they cannot read off the contract.

The Postmortem You Cannot Fully Write

When the correlated outage finally lands — and for any product running on shared substrate, it will — the customer-facing postmortem is the hardest artifact to produce. You have an incident you do not own. You have a root cause you cannot verify. You have a vendor who is still investigating and will not let you quote their internal analysis. The customer wants a name and a fix.

The discipline that holds up under these conditions is to write the postmortem in two layers, both honest.

The outer layer describes what you observed and what you did. Your error rate, your traffic shift behavior, the moment your routing layer detected correlated degradation, the moment you escalated to your providers, the time-to-customer-recovery. This layer is fully under your control and fully verifiable.

The inner layer describes the upstream cause to the level you can verify from your providers' public communications, no further. "The provider's status page confirmed elevated error rates beginning at 14:02 UTC, attributed by their public update to a regional infrastructure incident at their underlying cloud provider." This sentence is true, sourced, and bounded. It does not claim to know what the cloud provider's root cause was, because you do not.

The customer wants a deeper answer. The deeper answer is what the contract did not buy you. Naming that gap in the postmortem — without blaming a vendor you cannot compel — is the most honest forward-looking artifact you can produce, and the one that justifies the next budget cycle's procurement conversation.

The Cost of Real Redundancy

Genuine redundancy across substrate layers is more expensive than the team will budget for until the first correlated outage forces the question. The cost shows up in three places.

Procurement cost. A second provider whose substrate is genuinely independent of the first is harder to find than a second logo, and the price difference is real. Cross-region or cross-cloud transit is not free. The second-best provider is usually the second-best provider for a reason.

Engineering cost. A router that handles correlated degradation, a health probe that measures joint signals, and a game-day program that exercises substrate failures are months of engineering work that produce no visible feature.

Operational cost. Running both providers warm enough to absorb a full traffic shift, rather than cold enough to be cheap, means paying for capacity you usually do not use. The economics are similar to insurance: the premium is paid in steady state, the payoff comes during the incident.

The team that did not pay these costs learns the price during a customer escalation that includes the procurement team and the legal team. The team that did pay them gets to write a shorter postmortem.

The Architectural Realization

The provider you contracted with is the visible layer of a stack you do not own. The contract describes one tier. The substrate describes the actual availability profile. Most teams optimize the tier they can sign and ignore the tier that determines whether the signing was meaningful.

Fourth-party risk is what happens when the substrate's failure modes propagate to the surface and your contract has nothing to say about it. The fix is not more vendors. The fix is more visibility — into the layers below the contract, into the correlations between supposedly independent providers, into the failure scenarios your redundancy strategy was actually designed for. The teams that build that visibility will write the shorter postmortems. The teams that do not will keep being surprised by an availability profile they never priced in.

The next time a customer asks "who is your model provider," the more useful answer is the one that names the four layers beneath them, and which of those layers your team has actually modeled. That answer is harder to give. It is also the only one that survives the incident the contract did not predict.

References:Let's stay in touch and Follow me for more thoughts and updates