Load Shedding Was Built for Humans. Agents Amplify the Storm You're Shedding
A 503 to a human is a "try again later" page and a coffee break. A 503 to an agent is a 250-millisecond setback before retry one of seven, and the planner is already asking the LLM whether a different tool can sneak around the failed dependency. The first behavior gives an overloaded service room to recover. The second behavior is what an overloaded service has nightmares about: thousands of correlated retries, each one cheaper and faster than a human's, half of them fanning out into the next dependency over because the planner decided that was a creative workaround.
Load shedding — the discipline of dropping low-priority work to keep the high-priority path alive — was designed in an era when the principal sending traffic was a human at a keyboard or a well-behaved service with a hand-tuned retry policy. Both of those assumptions break the moment a fleet of agents shows up. The agent retries faster, retries from more places at once, replans around the failure, and treats your 503 as a load-balancing hint instead of as the cooperative back-pressure signal you meant it to be.
This piece is about why the standard load-shedding playbook doesn't survive contact with agentic clients, what primitives the upstream service needs in order to actually shed agent traffic, and what the agent itself has to do — at the tool layer and at the planner — to stop being the hostile traffic in someone else's incident report.
Why a 503 Means Something Different Now
The semantics of an HTTP error were never fully written down. They are a contract between server and client, and the contract held because the clients were mostly written by humans who shared a mental model: 429 means slow down, 503 means we're full, Retry-After is a hint you should respect. Browsers, CDNs, and well-behaved SDKs internalized this contract. A retry storm was a bug, and bugs were rare enough that ops teams could name them.
Agents don't share the contract. The tool layer of a typical agent treats every non-2xx response as a transient blip to retry around, because the LLM that designed the loop was trained on Stack Overflow snippets where retrying with backoff is universal good advice. Worse, when retries finally exhaust, the planner doesn't go to lunch — it asks the model what to do, the model proposes a different tool that gets a similar answer, and the agent now hits an entirely different dependency that was never load-tested under correlated demand.
The end state is one most platform teams have already seen at least once: an agent fleet showing up looks indistinguishable from a coordinated DDoS, except that the operator's intent is friendly. The traffic shape is uniform, the timing is correlated, the request shapes don't vary the way human-driven sessions do, and the back-pressure signals get ignored. By every signature your WAF was tuned on, the agents are the attack.
The Three Amplifiers
There are three multiplicative factors that turn a small upstream wobble into an outage when agents are involved.
Faster retries. Human request loops have a built-in cooldown — the human reads the error, switches tabs, refreshes after a minute. Agent loops do not. A naive retry policy on a tool call will start the second attempt within a second of the first, the third within four, and burn a half-dozen retries before any human-scale system has even noticed the upstream is unhealthy. Exponential backoff helps, but only if it's actually wired in; in practice many agent frameworks ship retry primitives whose defaults are tuned for human latency budgets, not for upstream survival.
Correlated fan-out. When ten thousand human users hit a 503, they each react on their own clock. When ten thousand agent threads hit a 503, they retry on the same clock, because they're running the same retry policy with the same base delay. Jitter is the standard fix, but jitter is a feature you have to remember to add — and it's not in the default retry decorator most engineers paste in. Worse, when agents are deployed as a managed pool (think a backend service that runs N agent workers per node), the retry storm clusters at the node level, so a single upstream blip turns into a synchronized wave from a single source IP, which trips your rate limiter at the wrong layer and looks like a single misbehaving client instead of a fleet.
Re-planning around failure. This is the one that human-era load shedding has no defense against. When tool A fails, the planner asks the model what to do. The model — helpfully, fluently — proposes that tool B can probably accomplish a similar goal. Tool B has its own dependency graph, its own rate limit, its own quota pool. Now your shedding has caused the load to move, not to disappear. The user-visible failure rate stays low (the agent eventually succeeds), but the cost-to-serve doubles, and the dependency that was supposed to absorb the spillover quietly becomes the next thing to fail. Worst case, the planner discovers that calling tool A and tool B in parallel is "more robust" — and now every retry is two requests instead of one.
Why the Standard Primitives Don't Match Anymore
Most load-shedding implementations classify requests into priority bands. Netflix, AWS, and the SRE canon all describe the same shape: paying customers in the top tier, logged-in users next, anonymous users below that, known crawlers at the bottom. When the system is overloaded, you shed from the bottom up.
The shape works when the principal is stable. It falls apart when the principal is "an agent acting on behalf of a user." Whose priority does the agent inherit? The user's, presumably — but the agent might be a background task the user isn't watching, where a 30-second delay costs nothing, or it might be the foreground assistant the user is staring at, where a one-second delay is unacceptable. The same user identity covers both cases. Rate-limit-by-principal can't tell them apart.
The same problem shows up in Retry-After. The header was designed for a single client, where "wait 30 seconds" is a coherent instruction. For an agent, "wait 30 seconds" is ambiguous: should the planner pause the entire workflow, or only this tool path, or attempt the same tool with a different argument? Most agent frameworks today don't propagate Retry-After to the planner at all — the tool layer respects it locally, then immediately fails into a re-planning step that ignores the upstream's advice and tries something adjacent.
And quotas, designed to protect the upstream from any single tenant, miscount when one prompt fans out into multiple tool calls. A user who issues one prompt and triggers four tool calls plus three retries is, by quota arithmetic, seven users worth of load. The pricing model and the quota model and the user's mental model all disagree, which is the kind of disagreement that becomes a billing dispute exactly when the platform is most under pressure.
What the Upstream Needs to Build
The first thing that has to change is at the upstream service, because the agent client has neither the visibility nor the authority to fix this on its own. Three primitives are roughly mandatory.
A traffic-class label that distinguishes agent-driven from human-driven requests. This is more than a header convention; it has to be carried through the auth layer so the rate limiter and the load shedder can see it. Some platforms have started using a request attribute like client-mode: interactive | batch | agent, scoped to the API key. The shedder treats agent-mode traffic as lower priority during incidents, even when the underlying user is premium, on the theory that agent-mode work tolerates retry-later semantics better than a human staring at a spinner.
Cooperative shedding signals that the agent layer can act on, not just the tool layer. The standard Retry-After header tells the immediate client to wait. What's missing is a signal that says "stop planning new work" — a hint that propagates up the agent stack so the planner pauses its loop, not just the HTTP retry. A few platforms have begun experimenting with structured response bodies on 429 that include a recommended-action: pause-workflow | switch-region | abandon field. It is early, the conventions aren't standardized, and most clients ignore it, but it's the right shape: the upstream knows more about its own state than the client does, and giving the agent a richer back-pressure vocabulary lets it choose to abandon a plan rather than retry around it.
Rate-limit-by-intent, not just by principal. The agent fleet is going to issue hundreds of correlated requests per user prompt; the limiter that matters is the one that asks "is this fleet collectively trying to do too much?" In practice this means tagging requests with a workflow ID that survives across tool calls, rate-limiting at the workflow level, and exposing a quota that the agent can check before it starts a new step. The token-bucket primitive is fine; what changes is what the bucket counts.
What the Agent Has to Do
Even with all of those upstream primitives, the agent layer carries half the responsibility. There are a few disciplines that distinguish a well-behaved agent from a load-amplifying one.
The tool layer must respect 429 and 503 as cooperative signals, not transient errors. This sounds obvious; in practice, a stunning fraction of production agent code treats every non-2xx as something to retry blindly. A ReAct-style loop will burn most of its retry budget on errors that can never succeed in the current window. The fix is to classify failures: retryable-after-delay (429 with Retry-After), retryable-after-jitter (503 without a specific hint), non-retryable (4xx that aren't rate limits), and fatal-for-this-tool (the tool is down, abandon it, don't suggest it again this run). Each class needs a different policy, and the policy must be enforced at the framework level — not left to the model to reinvent every turn.
The planner has to know when to abandon, not just when to retry. This is the harder discipline. A planner that always re-routes around failure is a planner that consumes infinite cost in pursuit of a goal the user might no longer want. A circuit breaker at the tool level helps: when one tool has tripped open, the planner sees it as unavailable for the rest of this run, instead of being asked to "creatively work around" a known-bad dependency. Per-tool circuit breakers also stop one degraded service from draining the retry budget that other tools needed.
Cost budgets must include wasted retries, not just successful work. Most agent cost dashboards count tokens, sometimes count tool calls, almost never count the work done in service of failed attempts. That accounting hole is exactly where retry storms hide. Treating each retry as a budgeted operation, with the budget visible to the planner, gives the system a reason to give up before the cost-to-serve spirals — and gives the operator a legible signal that the upstream is in trouble, rather than burying it in an opaque token spike.
Jitter is a default, not an option. Every retry path the agent owns should have full jitter applied at the tool layer, ideally with the jitter window keyed to the tool ID so that two agents retrying the same tool diverge but two agents retrying different tools don't accidentally collide. This is the cheapest fix on the list and the one most often missed.
The Shape of What's Missing
The deeper issue is that the load-shedding primitives we have were designed for a one-to-one relationship between a request and a principal. Agents have shattered that. A single user prompt becomes a tree of tool calls; a single tool call becomes a fan-out across regions; a single failure becomes a re-plan that doubles the load. The primitives that match this world — workflow-scoped quotas, cooperative shedding signals that propagate to the planner, traffic-class labels that survive across hops, intent-aware rate limiters — are not standardized, not implemented in most gateways, and not consistently respected by most agent frameworks.
The architectural conclusion is the uncomfortable one: rate-limit-by-principal is the human-era primitive. Rate-limit-by-intent is the agent-era one nobody has fully built yet. Until it exists, the work falls to the agent operator and the platform team to negotiate by hand: tag your traffic, respect the headers, jitter your retries, abandon paths early, and stop letting the LLM treat a 503 as a creative writing prompt. The upstream services will eventually grow the primitives that match the new world. The agents that survive the transition will be the ones that didn't earn a seat on the WAF blocklist while waiting.
- https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/
- https://netflixtechblog.com/keeping-netflix-reliable-using-prioritized-load-shedding-6cc827b02f94
- https://cloud.google.com/blog/products/gcp/using-load-shedding-to-survive-a-success-disaster-cre-life-lessons
- https://learn.microsoft.com/en-us/azure/architecture/antipatterns/retry-storm/
- https://blog.mads-hartmann.com/sre/2021/05/14/thundering-herd.html
- https://encore.dev/blog/thundering-herd-problem
- https://www.codereliant.io/p/retries-backoff-jitter
- https://cloud.google.com/blog/products/ai-machine-learning/learn-how-to-handle-429-resource-exhaustion-errors-in-your-llms
- https://towardsdatascience.com/your-react-agent-is-wasting-90-of-its-retries-heres-how-to-stop-it/
- https://blog.meganova.ai/circuit-breakers-in-ai-agent-systems-reliability-at-scale/
- https://cordum.io/blog/ai-agent-circuit-breaker-pattern
- https://nordicapis.com/how-ai-agents-are-changing-api-rate-limit-approaches/
- https://fast.io/resources/ai-agent-rate-limiting/
- https://www.truefoundry.com/blog/rate-limiting-in-llm-gateway
- https://developers.openai.com/api/docs/guides/rate-limits
