Skip to main content

The Fallback That Became the Default: Why Your Tier Mix Needs an SLO

· 11 min read
Tian Pan
Software Engineer

The dashboard says the fallback fires on 0.5% of requests. The dashboard has been saying that for six months. Then someone re-runs telemetry from scratch and finds the secondary model is serving 38% of traffic and the canned-response tier is serving another 9%. The frontier-model "primary path" the team has been talking about in roadmap reviews is, in fact, the minority experience. Nobody noticed because no single alert ever fired — every demotion was a small, well-justified, locally correct decision, and the cumulative drift never crossed any threshold someone had thought to set.

This is the failure mode I want to name: the fallback that became the default. It is not an outage. It is not a regression in any single component. It is a slow rotation of the product surface where the degraded path stops being a safety net and starts being the experience. The team's mental model and production reality drift apart, and the gap is invisible because the only meters in place are designed to detect failure, not to detect mix.

I'll claim something stronger: if your AI feature has more than two tiers of service, your tier mix is itself an SLO, and if you aren't measuring it, you don't actually know what you ship.

How the Default Quietly Moves

The mechanism is rarely a single bad decision. It is a sequence of locally reasonable ones, each with a writeup, each with a cost-savings or reliability rationale, each shipped behind some degree of approval. Each leaves a permanent demotion in the routing layer that nobody re-evaluates.

A common shape:

  • A "smart router" lands. It demotes any prompt below some complexity score to the cheaper secondary model under the rationale "save cost when we can." The launch is gated to 5% to "start safe," then ramped to 50% during a week the on-call rotation is busy with an unrelated incident, then never moved off because the cost dashboard looks great.
  • A circuit breaker tripped during an incident in March. The incident was real; the breaker did its job. The post-incident remediation closed the underlying issue but never re-armed the breaker, because the breaker stopped paging once the dashboard was green and nobody owns the half-open probe.
  • A token-spend cost guard was added that drops to canned responses when daily spend crosses a threshold. The threshold has been hit for sixty consecutive days. The alert was originally noisy, so someone routed it to a low-priority channel, where it now scrolls by unread.
  • A response cache was promoted from "advisory" to "preferred" during a capacity scare. The scare ended. The promotion did not.

Each of these is a feature flag that became part of the steady state. Stale flags don't fail loudly on the day they become unnecessary; they keep two realities alive at once — the path you think is current, and the path your software is actually executing. The Knight Capital incident that lost $460 million in 45 minutes was a reactivated flag controlling an obsolete code path. AI fallbacks have the same shape, just with quieter consequences: the user gets a worse answer, not a wrong trade.

Why Latency and Error SLOs Don't Catch It

A team that has done the SRE basics well still misses this. The reason is simple: every standard SLO is a per-request meter. p95 latency on the secondary model can be better than the primary. Error rate on canned responses is by definition zero. If you serve every request from the canned tier, your error SLO will be the healthiest it has ever been. The meters are blind to the question that actually matters: which tier did we serve from?

The standard observability checklist for AI workloads — token throughput, latency, error rate, retry count, fallback firings — emits the right signals to detect an event but not a mix. Fallback firing rate is the closest, but most teams measure it as "did we fall back this request" rather than "what fraction of requests got served by the path the product is supposed to deliver." Those are different questions, and the second one is the one users are scoring you on.

This is a reframing that takes some unlearning. SREs are trained to alert on threshold breaches: latency over X, errors over Y. Tier mix doesn't breach a threshold so much as it slides. A 1% per-week drift in primary-tier share is invisible to any conventional alert and devastating over a quarter. You need a meter whose entire job is to notice slow movement.

Tier Mix as a First-Class SLO

The fix is to treat tier mix as a service-level objective with the same standing as latency and error rate. Concretely:

  • Define a primary-tier share target. "97% of requests served by the primary tier on a 7-day rolling window" is a perfectly good SLO. Pick the number that matches your product position; if the team has been telling customers "powered by frontier models," 97% is generous. If the product narrative is "intelligent routing across tiers," the number might be lower, but it must exist and it must be defended.
  • Burn-rate alert on the share, not just the events. Borrow the multi-window burn-rate alert pattern from latency SLOs. A slow drift over weeks should trigger as surely as a fast drop over hours. The point is to catch the rotation that no single event would surface.
  • Plot the mix over time as a stacked area chart, not a number. Stacked area is the visualization that surfaces drift; a single percentage hides it. A quarterly review where someone scrolls back six months and sees the green band shrinking is worth more than ten dashboards that report "fallback rate: nominal."

The reason this works is that it inverts the failure mode. The current default is "no event fires unless something looks broken," and what's broken here is that nothing looks broken. A mix-share meter is designed to fire when the steady state itself shifts. It is the meter you need precisely because the failure has no spike.

Three Disciplines That Keep the Mix Honest

Naming the SLO is necessary but not sufficient. Three operational disciplines have to land beside it.

Every fallback path is born with a metric and a default-on alert. This is a cultural rule, not a tooling one. When an engineer adds a new degradation switch, the launch checklist requires a meter for "what fraction of traffic are we serving from this tier" plus an alert at a threshold the team commits to. No new fallback can ship without both. This sounds bureaucratic; in practice it takes ten minutes per launch and saves the post-incident archaeology of asking "when did this start serving so much traffic?"

Periodic re-arm exercises for every degradation switch. Borrowing from chaos-engineering practice: every quarter, the team walks the list of every circuit breaker, cost guard, complexity-score router, and cache-promotion lever. Each one gets a yes/no answer to two questions — is this still needed, and is its trip condition still calibrated? Anything answered "I don't know" becomes a ticket. The point is not to remove fallbacks; it is to reset the assumption that they are temporary state. Without this exercise, every emergency lever from the past two years is permanently engaged.

Owner-of-mix as a named role. Every team that adds a fallback owns its addition; nobody owns the mix. This is the org failure that lets the drift happen. The mix is a cross-cutting product surface, and someone — usually the on-call lead or the platform team — needs to be the named owner of the question "what tier does the average user actually get?" Their job during the quarterly review is to defend the chart.

The Cost and Trust Frames

Two framings help when the SLO conversation hits resistance from leadership.

The cost frame is sharp: when 38% of traffic is silently routed to the secondary tier, the team is paying for the primary tier's reserved capacity and serving the secondary tier's quality. It is the worst of both worlds — the capacity reservation does not benefit the user, and the quality the user receives does not match what the team is paying to enable. Naming this in dollars during a quarterly business review tends to surface the SLO faster than any reliability argument.

The trust frame is harder but more important. The user has been receiving the canned response for two months and has concluded the feature is broken. They are not wrong; they are measuring something the team is not. Their internal counter is "did this product help me," and a long enough run of canned responses pushes that counter into the territory where the user disengages and tells their colleagues the feature is bad. The team's dashboards say uptime is 99.95%. Both numbers are true. The user's number is the one that determines whether the product survives.

The Fallback Path Is a Product Surface

The architectural realization that resolves this is that the fallback path is not infrastructure. It is a product surface. Every tier in your degradation hierarchy is a version of your product, served to some fraction of users, and the mix of tiers is the actual product you ship. The team that designs the primary tier with care and ships the secondary tier as "good enough for fallback" will, over time, ship the secondary tier as the user's experience without ever deciding to.

Two consequences follow.

First, the secondary tier deserves its own quality work. If 38% of traffic is going to land there, "an okay 7B model behind a flag" is no longer an acceptable answer. The team that owns the primary path should own the cliff between primary and secondary as a product spec — what specifically degrades, how the user is informed (or not), what the answer looks like compared to the primary path. The cliff is currently a wishful "small drop in helpfulness" and is in practice a step change the user notices.

Second, the canned-response tier needs an honest audit. Most canned-response tiers were written for a 0.5% case, and they show. They are stilted, they refuse to engage with the actual question, they say variations of "I'm having trouble right now." For the 0.5% case those answers are fine. For the 9% case they are a liability, because the same user keeps hitting them and concludes the feature does not work. Either invest in canned-tier quality or invest in not serving so much traffic from it. Continuing to claim the tier is rare while it serves nearly a tenth of traffic is a posture, not an engineering decision.

What to Do This Week

If this post describes your system and you don't yet know it, three concrete things produce signal fast.

Run a one-shot query against your routing layer's logs for the last 30 days that reports request count grouped by served-tier. Plot it as a stacked area chart with day on the x-axis. The chart is either flat (you're fine) or sloping (you have a story to tell). The query takes an afternoon; the chart will reorganize the next sprint.

List every degradation switch in the system — every circuit breaker, cost guard, complexity-score router, response-cache promotion, fallback model selector. For each, write down who owns it, when it last changed state, and what its trip condition is. Half of them will not have an owner. That list is your re-arm backlog.

Pick a target primary-tier share number — 95%, 97%, whatever defends your product narrative — and write it into the team's SLO doc this quarter. Make it as load-bearing as latency. The number will be wrong; that's fine. The act of committing to a number is what creates the pressure to measure and defend it.

The fallback that became the default is not a bug in any single component. It is what happens when no one is responsible for the steady state. The team that adds an SLO for tier mix, names an owner, and re-arms its degradation switches on a cadence is choosing what most users see. The team that doesn't is choosing too — they're just letting the choice be made by the accumulation of unreviewed decisions, and finding out months later that the product they ship and the product they think they ship are different products.

References:Let's stay in touch and Follow me for more thoughts and updates