Skip to main content

The Inference Fleet: Applying SRE Discipline to Multi-Provider LLM Dependencies

· 11 min read
Tian Pan
Software Engineer

Here is a failure mode that does not show up on any dashboard until it is too late: your production system is silently degrading because a secondary LLM provider started returning malformed responses three days ago, nobody owns that provider in your on-call rotation, and the only signal is a slow uptick in user-reported errors that your support team has not yet escalated. You find out when a customer cancels.

This is not a model quality problem. It is an operational discipline problem. And it is becoming more common as production AI stacks grow from a single OpenAI integration into a multi-provider, multi-endpoint sprawl that nobody designed as a fleet — but that is what it has become.

The Fleet You Did Not Mean to Build

A year ago, a typical production AI stack was one API key and one provider. Today, it is routinely a combination of: a primary frontier model for complex reasoning, a cheaper fast model for classification and routing, a fine-tuned endpoint for domain-specific tasks, a self-hosted open-weight model for data-sensitive workloads, and a fallback provider for when the primary hits rate limits or goes down.

The number of distinct models available in production has grown from 253 in January 2025 to over 650 by the end of that year. Inference providers tripled from 27 to 90 in the same period. Teams did not plan for this proliferation — they accumulated it one integration at a time, usually under deadline pressure.

The result is an inference fleet that has all the operational complexity of a microservices architecture, but none of the tooling, discipline, or ownership structures that make microservices manageable. The fleet exists. The SRE playbook for it mostly does not.

Why LLM Providers Are Not Like Your Other Dependencies

When you add a new microservice dependency, you get crash semantics you can reason about. The service either responds or it does not. When it fails, it fails loudly — error codes, timeouts, stack traces.

LLM providers fail differently. The documented failure modes from recent incidents include: a model update that reached 180 million users and began systematically endorsing bad decisions for three days before rollback; a load balancing change that silently routed 16% of requests to servers with a smaller effective context window; a TPU configuration error that inflated the probability of rare tokens, degrading output quality for weeks without triggering any error metric.

None of these were outages in the traditional sense. Uptime was 100%. Error rates stayed flat. The API responded with 200 OK. Monitoring saw nothing wrong because monitoring was measuring availability, not behavior.

There are also the straightforward outages. LLM provider API uptime fell from 99.66% in Q1 2024 to 99.46% in Q1 2025, representing roughly 60% more downtime year-over-year as demand outpaced infrastructure scaling. That is still two to four times worse than most cloud infrastructure components. Unlike cloud outages, which tend to be regional and brief, LLM provider degradations can be global, ambiguous, and long-lasting.

The combination of behavioral degradation and below-SLA availability means that the standard production engineering assumption — "if the dependency is up and returning 200, it is working" — does not hold for LLM providers.

The Three Gaps SRE Discipline Fills

Running an inference fleet without SRE practices creates three distinct operational gaps that compound into reliability problems.

Gap 1: Nobody owns the secondary provider. When a team has one LLM provider, ownership is obvious. When it has four, each picked by a different engineer for a different use case at a different time, nobody explicitly owns provider-level observability. The triage process when something degrades becomes a hunt: which provider, which endpoint, which feature team's integration?

Meanwhile the degradation continues. One August 2025 incident lasted weeks before detection precisely because no team was watching the right provider. The fix is a service catalog entry per provider — documented owner, on-call assignment, health check definition, and failure escalation path. It takes an hour to create and prevents incidents that take days to detect.

Gap 2: Capacity is measured in the wrong units. Traditional rate limiting is requests-per-second. LLM capacity is tokens-per-minute, and the relationship between requests and tokens is neither constant nor predictable. A single agentic workflow can consume 5 to 30 times more tokens than a chatbot interaction.

Multi-agent orchestration compounds this: each agent spawns sub-calls, each sub-call consumes a context window, and the total token consumption for a user action that looks like "one request" can be hundreds of thousands of tokens across dozens of API calls. Teams that plan capacity in requests regularly discover their agent-heavy workloads hit token rate limits they did not know existed, usually at peak traffic. Token budgeting — per task, per user session, per agent loop — is the capacity primitive that matters, and it requires instrumentation that most teams have not yet built.

Gap 3: Behavior drift has no alert. Forty percent of production agent failures are attributed to model drift: the model's output distribution changed, not catastrophically but enough to break downstream parsing, downstream logic, or user expectations. Tool versioning issues cause another 60%. Neither of these is an API error. Neither triggers an existing alert.

The only way to catch them is evaluation: automated checks that run representative prompts through each provider endpoint on a schedule and compare outputs against expected characteristics. This is the LLM equivalent of synthetic monitoring, and like synthetic monitoring, it feels optional until you have your first undetected drift incident.

Building the Service Catalog

The service catalog is the starting point for turning an inference fleet from an implicit dependency graph into a managed infrastructure. For each LLM provider or endpoint in production, the catalog should contain:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates