Skip to main content

3 posts tagged with "backpressure"

View all tags

Rate Limit Hierarchy Collapse: When Your Agent Loop DoSes Itself

· 12 min read
Tian Pan
Software Engineer

The bug report says the service is slow. The dashboard says the service is healthy. Token-per-minute usage is at 62% of the tier cap, well inside the green band. Then you open the traces and see the shape: one user request spawned a planner step, which emitted eleven parallel tool calls, four of which were search fan-outs that each triggered sub-agents, which each called three tools in parallel — and that single "request" is now pounding your own token bucket from forty-seven different workers at the same time. The other ninety-nine users of your product are stuck behind it, getting 429s they never earned. Your agent is DoSing itself, and the rate limiter is doing exactly what you told it to.

This is rate limit hierarchy collapse. You bought a perimeter defense designed for HTTP APIs where one request equals one unit of work, then wired it in front of a system where one request means a tree of unknown depth and unbounded branching factor. The single-bucket model doesn't just fail to protect — it fails invisibly, because your aggregate numbers never breach anything. The damage happens in the tails, in correlated bursts, and in the heads-down users who happen to be adjacent in time to a heavy one.

Backpressure for LLM Pipelines: Queue Theory Applied to Token-Based Services

· 11 min read
Tian Pan
Software Engineer

A retry storm at 3 a.m. usually starts the same way: a brief provider hiccup pushes a few requests over the rate limit, your client library retries them, those retries land on a still-recovering endpoint, more requests fail, and within ninety seconds your queue depth has gone vertical while your provider dashboard shows you sitting at 100% of your tokens-per-minute quota with a backlog measured in five-figure dollars. The post-mortem will say "thundering herd." The honest answer is that you built a fixed-throughput retry policy on top of a variable-capacity downstream and forgot that queue theory has opinions about that.

Most of the well-known service resilience patterns were written for downstreams whose throughput is a wall: a database with a connection pool, a microservice with a known concurrency limit. LLM providers are not that. Your effective throughput is a moving target shaped by your tier, the model you picked, the size of the prompt, the size of the response, the time of day, and whether someone else on the same provider is fine-tuning a frontier model right now. Treating it like a fixed pipe is the root cause of most of the LLM outages I've seen this year.

Backpressure in Agent Pipelines: When AI Generates Work Faster Than It Can Execute

· 9 min read
Tian Pan
Software Engineer

A multi-agent research tool built on a popular open-source stack slipped into a recursive loop and ran for 11 days before anyone noticed. The bill: $47,000. Two agents had been talking to each other non-stop, burning tokens while the team assumed the system was working normally. This is what happens when an agent pipeline has no backpressure.

The problem is structural. When an orchestrator agent decomposes a task into sub-tasks and spawns sub-agents to handle each one, and those sub-agents can themselves spawn further sub-agents or fan out across multiple tool calls, you get exponential work generation. The pipeline produces work faster than it can execute, finish, or even account for. This is the same problem that reactive systems, streaming architectures, and network protocols solved decades ago — and the same solutions apply.