Load Shedding Was Built for Humans. Agents Amplify the Storm You're Shedding
A 503 to a human is a "try again later" page and a coffee break. A 503 to an agent is a 250-millisecond setback before retry one of seven, and the planner is already asking the LLM whether a different tool can sneak around the failed dependency. The first behavior gives an overloaded service room to recover. The second behavior is what an overloaded service has nightmares about: thousands of correlated retries, each one cheaper and faster than a human's, half of them fanning out into the next dependency over because the planner decided that was a creative workaround.
Load shedding — the discipline of dropping low-priority work to keep the high-priority path alive — was designed in an era when the principal sending traffic was a human at a keyboard or a well-behaved service with a hand-tuned retry policy. Both of those assumptions break the moment a fleet of agents shows up. The agent retries faster, retries from more places at once, replans around the failure, and treats your 503 as a load-balancing hint instead of as the cooperative back-pressure signal you meant it to be.
This piece is about why the standard load-shedding playbook doesn't survive contact with agentic clients, what primitives the upstream service needs in order to actually shed agent traffic, and what the agent itself has to do — at the tool layer and at the planner — to stop being the hostile traffic in someone else's incident report.
Why a 503 Means Something Different Now
The semantics of an HTTP error were never fully written down. They are a contract between server and client, and the contract held because the clients were mostly written by humans who shared a mental model: 429 means slow down, 503 means we're full, Retry-After is a hint you should respect. Browsers, CDNs, and well-behaved SDKs internalized this contract. A retry storm was a bug, and bugs were rare enough that ops teams could name them.
Agents don't share the contract. The tool layer of a typical agent treats every non-2xx response as a transient blip to retry around, because the LLM that designed the loop was trained on Stack Overflow snippets where retrying with backoff is universal good advice. Worse, when retries finally exhaust, the planner doesn't go to lunch — it asks the model what to do, the model proposes a different tool that gets a similar answer, and the agent now hits an entirely different dependency that was never load-tested under correlated demand.
The end state is one most platform teams have already seen at least once: an agent fleet showing up looks indistinguishable from a coordinated DDoS, except that the operator's intent is friendly. The traffic shape is uniform, the timing is correlated, the request shapes don't vary the way human-driven sessions do, and the back-pressure signals get ignored. By every signature your WAF was tuned on, the agents are the attack.
The Three Amplifiers
There are three multiplicative factors that turn a small upstream wobble into an outage when agents are involved.
Faster retries. Human request loops have a built-in cooldown — the human reads the error, switches tabs, refreshes after a minute. Agent loops do not. A naive retry policy on a tool call will start the second attempt within a second of the first, the third within four, and burn a half-dozen retries before any human-scale system has even noticed the upstream is unhealthy. Exponential backoff helps, but only if it's actually wired in; in practice many agent frameworks ship retry primitives whose defaults are tuned for human latency budgets, not for upstream survival.
Correlated fan-out. When ten thousand human users hit a 503, they each react on their own clock. When ten thousand agent threads hit a 503, they retry on the same clock, because they're running the same retry policy with the same base delay. Jitter is the standard fix, but jitter is a feature you have to remember to add — and it's not in the default retry decorator most engineers paste in. Worse, when agents are deployed as a managed pool (think a backend service that runs N agent workers per node), the retry storm clusters at the node level, so a single upstream blip turns into a synchronized wave from a single source IP, which trips your rate limiter at the wrong layer and looks like a single misbehaving client instead of a fleet.
Re-planning around failure. This is the one that human-era load shedding has no defense against. When tool A fails, the planner asks the model what to do. The model — helpfully, fluently — proposes that tool B can probably accomplish a similar goal. Tool B has its own dependency graph, its own rate limit, its own quota pool. Now your shedding has caused the load to move, not to disappear. The user-visible failure rate stays low (the agent eventually succeeds), but the cost-to-serve doubles, and the dependency that was supposed to absorb the spillover quietly becomes the next thing to fail. Worst case, the planner discovers that calling tool A and tool B in parallel is "more robust" — and now every retry is two requests instead of one.
Why the Standard Primitives Don't Match Anymore
Most load-shedding implementations classify requests into priority bands. Netflix, AWS, and the SRE canon all describe the same shape: paying customers in the top tier, logged-in users next, anonymous users below that, known crawlers at the bottom. When the system is overloaded, you shed from the bottom up.
The shape works when the principal is stable. It falls apart when the principal is "an agent acting on behalf of a user." Whose priority does the agent inherit? The user's, presumably — but the agent might be a background task the user isn't watching, where a 30-second delay costs nothing, or it might be the foreground assistant the user is staring at, where a one-second delay is unacceptable. The same user identity covers both cases. Rate-limit-by-principal can't tell them apart.
- https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/
- https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94
- https://cloud.google.com/blog/products/gcp/using-load-shedding-to-survive-a-success-disaster-cre-life-lessons
- https://learn.microsoft.com/en-us/azure/architecture/antipatterns/retry-storm/
- https://blog.mads-hartmann.com/sre/2021/05/14/thundering-herd.html
- https://encore.dev/blog/thundering-herd-problem
- https://www.codereliant.io/p/retries-backoff-jitter
- https://cloud.google.com/blog/products/ai-machine-learning/learn-how-to-handle-429-resource-exhaustion-errors-in-your-llms
- https://towardsdatascience.com/your-react-agent-is-wasting-90-of-its-retries-heres-how-to-stop-it/
- https://blog.meganova.ai/circuit-breakers-in-ai-agent-systems-reliability-at-scale/
- https://cordum.io/blog/ai-agent-circuit-breaker-pattern
- https://nordicapis.com/how-ai-agents-are-changing-api-rate-limit-approaches/
- https://fast.io/resources/ai-agent-rate-limiting/
- https://www.truefoundry.com/blog/rate-limiting-in-llm-gateway
- https://developers.openai.com/api/docs/guides/rate-limits
