The Brownout Pattern: When Your LLM Provider Is Slow but Not Down
The pager that wakes you at 3 a.m. for an outage is the easy one. The provider returned 503 for forty minutes, your fallback kicked in, your runbook fired, your post-mortem writes itself. The pager that does not wake you — the one that lets your support queue fill up over six hours while every dashboard stays green — is the brownout. The provider's API still answers. The status page still says "operational." Your p99 latency has quietly drifted from 2.1 seconds to 14 seconds, your error rate from 0.1% to 4%, and the only people who noticed are the users who already left.
Provider availability is not binary. The fallback story most teams write — "if provider is down, switch to backup" — is a state machine with two states for a continuous variable, and it does not fire when the provider is sad rather than dead. Building for brownouts is a different design problem than building for outages, and almost every production agent harness I have seen ships without solving it.
