Skip to main content

The Inference Fleet: Applying SRE Discipline to Multi-Provider LLM Dependencies

· 11 min read
Tian Pan
Software Engineer

Here is a failure mode that does not show up on any dashboard until it is too late: your production system is silently degrading because a secondary LLM provider started returning malformed responses three days ago, nobody owns that provider in your on-call rotation, and the only signal is a slow uptick in user-reported errors that your support team has not yet escalated. You find out when a customer cancels.

This is not a model quality problem. It is an operational discipline problem. And it is becoming more common as production AI stacks grow from a single OpenAI integration into a multi-provider, multi-endpoint sprawl that nobody designed as a fleet — but that is what it has become.

The Fleet You Did Not Mean to Build

A year ago, a typical production AI stack was one API key and one provider. Today, it is routinely a combination of: a primary frontier model for complex reasoning, a cheaper fast model for classification and routing, a fine-tuned endpoint for domain-specific tasks, a self-hosted open-weight model for data-sensitive workloads, and a fallback provider for when the primary hits rate limits or goes down.

The number of distinct models available in production has grown from 253 in January 2025 to over 650 by the end of that year. Inference providers tripled from 27 to 90 in the same period. Teams did not plan for this proliferation — they accumulated it one integration at a time, usually under deadline pressure.

The result is an inference fleet that has all the operational complexity of a microservices architecture, but none of the tooling, discipline, or ownership structures that make microservices manageable. The fleet exists. The SRE playbook for it mostly does not.

Why LLM Providers Are Not Like Your Other Dependencies

When you add a new microservice dependency, you get crash semantics you can reason about. The service either responds or it does not. When it fails, it fails loudly — error codes, timeouts, stack traces.

LLM providers fail differently. The documented failure modes from recent incidents include: a model update that reached 180 million users and began systematically endorsing bad decisions for three days before rollback; a load balancing change that silently routed 16% of requests to servers with a smaller effective context window; a TPU configuration error that inflated the probability of rare tokens, degrading output quality for weeks without triggering any error metric.

None of these were outages in the traditional sense. Uptime was 100%. Error rates stayed flat. The API responded with 200 OK. Monitoring saw nothing wrong because monitoring was measuring availability, not behavior.

There are also the straightforward outages. LLM provider API uptime fell from 99.66% in Q1 2024 to 99.46% in Q1 2025, representing roughly 60% more downtime year-over-year as demand outpaced infrastructure scaling. That is still two to four times worse than most cloud infrastructure components. Unlike cloud outages, which tend to be regional and brief, LLM provider degradations can be global, ambiguous, and long-lasting.

The combination of behavioral degradation and below-SLA availability means that the standard production engineering assumption — "if the dependency is up and returning 200, it is working" — does not hold for LLM providers.

The Three Gaps SRE Discipline Fills

Running an inference fleet without SRE practices creates three distinct operational gaps that compound into reliability problems.

Gap 1: Nobody owns the secondary provider. When a team has one LLM provider, ownership is obvious. When it has four, each picked by a different engineer for a different use case at a different time, nobody explicitly owns provider-level observability. The triage process when something degrades becomes a hunt: which provider, which endpoint, which feature team's integration?

Meanwhile the degradation continues. One August 2025 incident lasted weeks before detection precisely because no team was watching the right provider. The fix is a service catalog entry per provider — documented owner, on-call assignment, health check definition, and failure escalation path. It takes an hour to create and prevents incidents that take days to detect.

Gap 2: Capacity is measured in the wrong units. Traditional rate limiting is requests-per-second. LLM capacity is tokens-per-minute, and the relationship between requests and tokens is neither constant nor predictable. A single agentic workflow can consume 5 to 30 times more tokens than a chatbot interaction.

Multi-agent orchestration compounds this: each agent spawns sub-calls, each sub-call consumes a context window, and the total token consumption for a user action that looks like "one request" can be hundreds of thousands of tokens across dozens of API calls. Teams that plan capacity in requests regularly discover their agent-heavy workloads hit token rate limits they did not know existed, usually at peak traffic. Token budgeting — per task, per user session, per agent loop — is the capacity primitive that matters, and it requires instrumentation that most teams have not yet built.

Gap 3: Behavior drift has no alert. Forty percent of production agent failures are attributed to model drift: the model's output distribution changed, not catastrophically but enough to break downstream parsing, downstream logic, or user expectations. Tool versioning issues cause another 60%. Neither of these is an API error. Neither triggers an existing alert.

The only way to catch them is evaluation: automated checks that run representative prompts through each provider endpoint on a schedule and compare outputs against expected characteristics. This is the LLM equivalent of synthetic monitoring, and like synthetic monitoring, it feels optional until you have your first undetected drift incident.

Building the Service Catalog

The service catalog is the starting point for turning an inference fleet from an implicit dependency graph into a managed infrastructure. For each LLM provider or endpoint in production, the catalog should contain:

  • Endpoint identity: the specific model version pinned (not "gpt-4" but "gpt-4-2024-11-20"), the provider, the use case it serves, and the team that owns it.
  • SLO definition: a per-provider target for availability, p95 latency, and error rate, derived from observed historical behavior since providers rarely publish formal SLAs. Leading teams set internal SLOs 1-2% tighter than observed uptime to trigger alerts before customer-facing SLAs break.
  • Failure mode documentation: what this provider does when it degrades (silent quality drop, hard timeout, rate limit error), which downstream features depend on it, and what the fallback path is.
  • Cost envelope: expected token consumption per task, daily and monthly budget ceilings, and who gets paged when spend spikes.
  • Deprecation registry: the model's expected end-of-life date, the migration target, and the testing checklist required before cutover.

The last item matters more than most teams realize. Provider deprecation notices typically arrive with 14 days' warning. For a team with a dozen services depending on a model, 14 days is not enough to test, stage, and migrate safely. Teams with a deprecation registry know which services depend on which model versions and can start migration the day the notice arrives rather than the day panic sets in.

The Gateway Layer

The routing and fallback layer is where operational discipline becomes concrete code. The pattern that has emerged in production is an LLM API gateway — a proxy that sits between application code and provider APIs, handling routing, retries, fallbacks, rate limiting, and observability.

Several tools implement this pattern: LiteLLM provides a self-hosted proxy with a unified OpenAI-compatible interface across 100+ providers, configurable fallback chains, and token budget controls. OpenRouter provides a managed API-key-level interface across 300+ models with automatic transparent fallback. Helicone adds health-aware circuit breaking. The tools differ in hosting model, governance requirements, and routing sophistication, but they share a common architectural premise: the application does not talk to providers directly; it talks to the gateway, and the gateway manages fleet-level concerns.

One operational risk worth noting: vendor consolidation in this space has been rapid. Portkey, a popular routing platform that many teams adopted in 2024 and 2025, was acquired by Palo Alto Networks and repurposed as a security gateway — a different product with different priorities. Teams that had built their fleet management strategy around a specific standalone vendor found themselves replanning on short notice. The mitigation is treating the gateway layer as infrastructure you own, not a SaaS dependency you consume. Whether you run LiteLLM or build a thin routing layer yourself, the business logic for fallback chains and SLO enforcement should be in your codebase, not in a vendor's managed service.

SLO Tracking Without Provider SLAs

The most common objection to SRE practices for LLM fleets is: "How do we set SLOs when our providers don't publish SLAs?" The answer is that SLOs are internal targets, not contractual promises. You are measuring your system's reliability as experienced by your users, not holding a vendor to a contract.

Practical SLO construction for multi-provider setups starts with measurement: instrument every provider call with latency, token consumption, error code, and model version. Aggregate these into per-provider dashboards covering availability (requests succeeding), latency (p95 and p99), and quality (evaluation pass rate from synthetic monitoring). Set SLOs based on 90-day historical baselines, then tighten them incrementally.

The three-tier SLO structure that works in production: a fleet SLO that accounts for fallback behavior ("95% of requests complete within 2 seconds, whether on the primary provider or fallback"), per-provider SLOs that trigger routing decisions ("if provider A exceeds 2% error rate over 5 minutes, route to provider B"), and cost SLOs that enforce budget discipline ("task cost does not exceed 3x the expected token budget"). The third tier is novel compared to traditional SRE practice, but it is the one that prevents the runaway-cost incidents — one documented case escalated from 127to127 to 47,000 per week over 11 days before anyone noticed.

The Operational Cadence

Mature inference fleet management requires a regular operational cadence that most teams do not yet have.

Weekly: Review per-provider error rates and latency trends. Check for model version changes in provider changelogs. Scan cost dashboards for token consumption anomalies.

Monthly: Run evaluation benchmarks on all production endpoints to detect quality drift. Review the deprecation registry for approaching end-of-life dates. Audit on-call ownership for newly added providers that may have been added without explicit assignment.

On model update: Any provider that silently updates a model version (which some do without announcement) should trigger a synthetic evaluation run. Pin to explicit model version identifiers rather than semantic aliases like "latest" or "turbo" — aliases shift under you without notice.

On provider incident: After any provider degradation event, run a postmortem that asks not just "what failed?" but "how long before we detected it, and what would have caught it faster?" Accumulate these into your monitoring strategy.

The cadence is not particularly burdensome. The monitoring setup is an afternoon of work. The service catalog is a living document, not a project. The discipline required is the same discipline that makes traditional infrastructure reliable: ownership, observability, and the habit of treating dependencies as things that will fail, not things that might.

Where This Is Headed

The inference fleet problem is not going away. Model proliferation continues. Multi-agent architectures are pushing token consumption higher and making capacity planning harder. Provider competition means capabilities shift faster than most teams can track.

The teams that navigate this well are not the ones that picked the best providers or the smartest model routing algorithms. They are the ones that applied operational discipline early: service catalogs before the fleet became unmanageable, SLOs before the first missed incident, deprecation registries before the first 14-day notice arrived. The technical complexity of multi-provider AI infrastructure is largely solved by existing tools. The operational complexity requires the same habits that made distributed systems reliable — documented ownership, clear failure modes, and monitoring that reflects what users actually experience.

The inference fleet is infrastructure. Treat it like infrastructure, before an incident teaches you to.

References:Let's stay in touch and Follow me for more thoughts and updates