Skip to main content

Backpressure in Agent Pipelines: When AI Generates Work Faster Than It Can Execute

· 9 min read
Tian Pan
Software Engineer

A multi-agent research tool built on a popular open-source stack slipped into a recursive loop and ran for 11 days before anyone noticed. The bill: $47,000. Two agents had been talking to each other non-stop, burning tokens while the team assumed the system was working normally. This is what happens when an agent pipeline has no backpressure.

The problem is structural. When an orchestrator agent decomposes a task into sub-tasks and spawns sub-agents to handle each one, and those sub-agents can themselves spawn further sub-agents or fan out across multiple tool calls, you get exponential work generation. The pipeline produces work faster than it can execute, finish, or even account for. This is the same problem that reactive systems, streaming architectures, and network protocols solved decades ago — and the same solutions apply.

The Unbounded Work Queue Problem

In traditional software, a producer-consumer system without flow control will eventually crash. A fast producer fills memory until the process dies with an OOM error. The failure is obvious and immediate. Agent pipelines fail more insidiously.

Consider a research agent tasked with analyzing competitors. It decomposes the task into five sub-agents, one per competitor. Each sub-agent decides it needs to search the web, read financial reports, and summarize news articles — spawning three to five tool calls each.

The orchestrator's context grows with every response. Five sub-agents producing 2,000 tokens each means 10,000 tokens fed back to the orchestrator per cycle. Over ten cycles, that is 100,000 tokens of accumulated context, plus the original task, plus intermediate reasoning. The context window fills, responses degrade, and the agent either crashes or starts hallucinating because it can no longer attend to its own instructions.

But context exhaustion is only one failure mode. The others are just as damaging:

  • Rate limit cascades. Twenty concurrent tool calls from spawned sub-agents hit your LLM provider's rate limit. Retries from each sub-agent compound the problem. A system that was processing five requests per second is now generating fifty retry requests per second.
  • Budget blowouts. Token costs in multi-agent systems do not scale linearly — they compound. Each tool call adds context. Each sub-agent response feeds back into the orchestrator. Without deliberate budget management, a single runaway job burns through your monthly API allocation in hours.
  • Recursive loops. Two agents with slightly misaligned directives — say, an editor enforcing "professional tone" and a writer instructed to keep things "casual and relatable" — can enter an infinite revision loop. Each agent rejects the other's output and requests another round. The system looks busy. It is accomplishing nothing.

The MAST study (Multi-Agent Systems Failure Taxonomy), published in March 2025, analyzed over 1,600 execution traces across seven open-source agent frameworks. Failure rates ranged from 41% to 87%. Unstructured multi-agent networks amplified errors up to 17x compared to single-agent baselines. Most of these failures were not model failures. They were orchestration failures — the kind that backpressure prevents.

What Backpressure Actually Means for Agents

In reactive systems, backpressure is the mechanism by which a slow consumer signals a fast producer to reduce its output rate. TCP flow control uses sliding windows. Kafka throttles producers when consumers fall behind. Node.js streams pause readable sources when writable destinations cannot keep up.

For agent pipelines, the analogy maps cleanly. The "producer" is the planning or decomposition step that generates sub-tasks. The "consumer" is the execution layer — tool calls, LLM inference, external API requests — that actually does the work. Backpressure means the execution layer can tell the planning layer: stop generating new work until I catch up.

Without this signal, the planning layer will keep decomposing, keep spawning, keep fanning out. It has no reason not to. The LLM generating the plan does not know that the execution queue is 200 items deep or that you have hit 80% of your hourly token budget. You have to build that feedback loop explicitly.

The practical implementation has three components:

  1. Bounded work queues. Every buffer between planning and execution must have a fixed capacity. When the queue is full, the planner blocks. An unbounded queue is just a memory leak with extra steps — it hides the problem by growing until something more catastrophic fails.

  2. Budget-aware planning. Before the orchestrator spawns a new sub-agent or approves a tool call, it checks the remaining token budget, the current rate limit headroom, and the depth of pending work. If resources are tight, it consolidates sub-tasks or defers low-priority ones.

  3. Execution feedback. Completed and failed tasks report their actual token consumption and wall-clock time back to the orchestrator. This lets the planner adjust its concurrency — spawning fewer sub-agents when execution is slow, or batching tool calls when rate limits are tight.

Five Patterns That Actually Work

Pattern 1: Adaptive Concurrency Limits

Instead of hardcoding "max 5 sub-agents," measure the actual throughput of your execution layer and adjust dynamically. Start with a concurrency of 2. If tasks complete within latency targets and without rate limit errors, increase to 3. If you start seeing 429s or context growth exceeding your budget, drop back to 1.

This is the AIMD (Additive Increase, Multiplicative Decrease) algorithm that TCP uses for congestion control, applied to agent orchestration. It converges on the right throughput without requiring you to predict it in advance.

Pattern 2: Hierarchical Budget Allocation

Assign token budgets at three levels:

  • Job level. The entire task gets a ceiling — say, 500,000 tokens. This is the hard stop that prevents runaway spending.
  • Agent level. Each sub-agent gets a fraction of the job budget, proportional to its expected complexity. A simple lookup agent might get 10,000 tokens. A complex analysis agent gets 50,000.
  • Function level. Individual tool calls get the strictest limits. A web search might be capped at 5,000 tokens of context. A code execution call might get 20,000.

When any level exhausts its budget, execution stops at that level — it does not cascade upward to consume the parent's allocation. This is the same principle as Linux cgroups limiting resource consumption per process group.

Pattern 3: Circuit Breakers on Tool Calls

If a particular tool or external API starts failing or responding slowly, stop calling it. A circuit breaker tracks the error rate over a rolling window. When failures exceed a threshold — say, 50% over the last 10 calls — the circuit opens and all subsequent calls to that tool return immediately with a cached fallback or an explicit "unavailable" signal.

This prevents the retry storms that amplify load on already-struggling services. The agent receives a clear signal that the tool is down and can adjust its plan accordingly, rather than burning tokens on retries that will not succeed.

Pattern 4: Depth-Limited Decomposition

Set a hard limit on how many levels of sub-agent spawning are allowed. An orchestrator can spawn sub-agents (depth 1). Those sub-agents can make tool calls but cannot spawn their own sub-agents (depth 2 is disallowed). This prevents the recursive decomposition that turns a simple task into an exponentially expanding tree of agents.

If a sub-agent determines it needs further decomposition, it returns control to the orchestrator with a request, rather than spawning independently. The orchestrator can then decide — given the current queue depth, budget, and progress — whether to approve the additional decomposition or consolidate.

Pattern 5: Load Shedding with Priority Queues

Not all agent work is equally important. When the system is under pressure — high queue depth, approaching budget limits, elevated error rates — shed low-priority work first. A priority system might look like:

  • P0 (never shed): User-facing responses, safety checks, final output assembly.
  • P1 (shed under heavy load): Enrichment steps, additional context gathering, cross-referencing.
  • P2 (shed aggressively): Nice-to-have elaboration, redundant verification, cosmetic improvements.

The orchestrator checks system pressure before scheduling each task. Under normal conditions, everything runs. Under load, P2 tasks are dropped. Under extreme load, P1 tasks are deferred until pressure subsides.

Instrumentation: Detecting Runaway Expansion Before It Hits Your Bill

You cannot apply backpressure if you cannot see the pressure building. The minimum instrumentation for an agent pipeline includes four metrics:

  • Queue depth over time. A steadily growing queue means your consumer cannot keep up. This is the earliest signal of a backpressure problem.
  • Token consumption rate. Track tokens per minute, not per request. A few large requests can exhaust your budget while staying under request-per-minute limits. Token-based monitoring catches what request counting misses.
  • Agent spawn depth. How many levels of sub-agent nesting currently exist? If this number keeps growing, you have recursive decomposition without termination.
  • Task completion ratio. The ratio of tasks completed to tasks created, measured over a rolling window. A healthy pipeline stays near 1.0. A ratio dropping toward 0.5 means you are creating work twice as fast as you are finishing it — the definition of a system that needs backpressure.

Set alerts on these metrics. A queue depth that doubles in five minutes, a token rate exceeding 60% of your hourly budget in the first ten minutes, or a spawn depth exceeding 3 — any of these should trigger automatic throttling before a human even needs to look at the dashboard.

The Fundamental Tradeoff: Throughput vs. Safety

There is a cost to backpressure. A system that limits concurrency, enforces budgets, and sheds load will produce less total output than one running at full speed with no limits. The agent with no guardrails will, on average, explore more possibilities, gather more context, and attempt more sophisticated plans.

It will also, occasionally, spend $47,000 talking to itself for 11 days.

The engineering discipline here is the same as in any distributed system. You do not run your database at 100% CPU utilization because the first traffic spike will tip it over. You do not fill your message queue to 95% capacity because one slow consumer will cause cascading failures. And you do not let your agent pipeline generate unbounded work because the one time it enters a recursive loop, you will not know until the invoice arrives.

Backpressure is not an optimization. It is a survival mechanism. The teams that ship reliable agent systems are not the ones with the most sophisticated models — they are the ones that treat their agent pipeline with the same engineering rigor they would apply to any distributed system handling production traffic. Bounded queues, budget enforcement, circuit breakers, load shedding, and adaptive concurrency are not new ideas. They are proven ideas, waiting to be applied to the newest class of systems that desperately need them.

References:Let's stay in touch and Follow me for more thoughts and updates