Sequential Tool Call Waterfalls: The Hidden Latency Tax in Agent Loops
If you've profiled an AI agent that felt inexplicably slow, chances are you found a waterfall. The agent called tool A, waited, then called tool B, waited, then called tool C — even though B and C had no dependency on A's result. You just paid 3× the latency for 1× the work.
This pattern is not an edge case. It's the default behavior of virtually every agent framework. The model returns multiple tool calls in a single response, and the execution loop runs them one at a time, in order. Fixing it isn't complicated, but first you need a reliable way to identify which calls are actually independent.
The problem is structurally identical to the N+1 query problem in ORMs. In traditional backends, an inexperienced developer fetches a list of 100 orders, then makes 100 individual database queries to get each order's customer — when a single JOIN would have done the job. The code looks correct; the performance is a disaster. In agent loops, the equivalent is an LLM that generates 5 independent search() calls to answer a comparison question, and your orchestrator dutifully executes them in a chain instead of a fan-out.
Why Agent Frameworks Default to Sequential Execution
The mechanics are straightforward. When a model returns a tool use response containing multiple tool calls, the simplest correct implementation is a for loop: iterate over the list, execute each one, collect results, continue. This is what most frameworks ship by default, and it works — in the sense that it produces correct results.
The problem is that "correct" and "fast" are not the same thing when tool calls are I/O-bound. Each tool call that hits an external API, a database, or a vector store spends most of its time waiting on the network. A sequential loop serializes that wait time:
- Tool A: 300ms of network I/O
- Tool B: 200ms of network I/O
- Tool C: 400ms of network I/O
- Sequential total: 900ms
- Parallel total: 400ms (time of the slowest call)
The real-world impact compounds in multi-turn agent loops where the model makes tool calls across several reasoning steps. Production traces from agents doing research-style tasks routinely show 60–70% of total latency attributable to serialized I/O that could have been parallelized.
Anthropic's internal research on multi-agent systems found up to 90% reductions in task completion time on complex research queries when parallel subagents with concurrent tool execution replaced a single sequential agent. The LLMCompiler architecture, a well-benchmarked academic approach to this problem, shows 1.4x to 3.7x latency improvements across standard benchmarks, and more notably, a 4.65x reduction in cost because you make fewer LLM round-trips when you batch more work into each one.
The Dependency Graph: Your Diagnostic Tool
Before you can parallelize tool calls, you need to know which ones are safe to run concurrently. The answer comes from a simple question about each pair of tool calls in a batch: does the input to call B depend on the output of call A?
If yes, they must run sequentially. If no, they can run in parallel. Represent this as a directed acyclic graph (DAG), where each node is a tool call and each edge represents a dependency. Nodes with no incoming edges (no dependencies) can all execute at the same time. After that batch completes, you execute the next layer of the graph — nodes whose dependencies are now satisfied — and so on.
In practice, constructing this graph doesn't require heavy machinery. Most agentic tasks fall into one of three shapes:
Pure fan-out: Multiple independent calls with no shared dependencies. "Compare weather in five cities" generates five get_weather calls, none of which depend on each other. Run them all at once.
Linear chain: Each call depends on the previous result. "Get the user's account ID, then fetch their order history, then recommend products" is inherently sequential — you can't look up orders without an account ID. No parallelism is available here, and that's correct.
Mixed DAG: A combination of both. "Get company data, then in parallel fetch the CEO profile, revenue figures, and patent filings, then combine for a competitive analysis." The first call is sequential because everything else depends on it. The next three are parallelizable. The final synthesis step is sequential because it depends on all three.
The most important diagnostic habit is learning to distinguish the mixed DAG case from the false linear chain. Agents frequently generate what looks like a linear chain but is actually a mixed DAG in disguise — the LLM just happened to list the calls in dependency order without signaling which ones could overlap.
Identifying Dependencies in Practice
For simple tool calls, dependency analysis reduces to inspecting parameter values. If tool call B uses a literal parameter that was hardcoded in the prompt or context, it's independent. If tool call B uses $result_of_A as a parameter — a placeholder referencing a prior call's output — it's dependent.
The LLMCompiler architecture formalizes this with a three-phase design:
-
Planner phase: The model outputs a structured plan where tool calls explicitly reference prior results using variable syntax (e.g.,
search(query=$1.output)). This makes dependencies machine-readable. -
Task fetching unit: A topological sort of the dependency graph. At each step, it identifies all nodes whose dependencies are already satisfied and schedules them for parallel execution.
-
Joiner: After each execution batch, it feeds results back into the context, resolves placeholder variables, and either returns the final answer or triggers a second planning cycle for cases where the original plan was underspecified.
You don't need to implement LLMCompiler in full to benefit from the core insight. Even a simple check — "does this tool call's parameter list contain any reference to a prior tool call's output?" — is enough to split a batch of calls into independent and dependent groups and execute the independent group concurrently.
Modern framework support has also improved considerably. OpenAI's Agents SDK exposes parallel_tool_calls=True in model settings. AutoGen by default runs parallel tool calls when the model generates multiple in one response. LangGraph supports parallel node execution for graph branches with no shared state. Google's Agent Development Kit ships first-class parallel agent primitives. If you're using any of these frameworks, the capability exists — you may just need to enable it explicitly.
The Coordination Overhead Math
Parallelism isn't free. There are two concrete costs that determine whether it pays off: coordination overhead and resource contention.
Coordination overhead is the work your orchestrator does to fan out calls, track completion, and reassemble results. For a pure fan-out of 5 simple API calls, this overhead is negligible — microseconds of async bookkeeping against hundreds of milliseconds of I/O. The math strongly favors parallelism.
The calculus shifts when you scale to many agents or tools. An orchestrator managing 50 concurrent operations must track status for all of them, route new tasks as dependencies resolve, and maintain a coherent workspace model. At some agent count, the orchestrator itself becomes the bottleneck. Production teams working with multi-agent systems consistently report that the sweet spot for initial deployments is 3–5 concurrent agents. Teams that start with 20+ agents routinely find that the coordination overhead erases much of the theoretical speedup.
The practical takeaway: start with parallel execution at the tool-call level within a single agent before reaching for multi-agent orchestration. A single agent executing 5 tools in parallel is simpler to reason about, cheaper to run, and often just as fast as 5 agents running one tool each.
Resource contention is the other failure mode, and it's subtler. If your 5 parallel tool calls all hit the same rate-limited external API, you've turned a latency optimization into a throttling trigger. Sequential execution at 1 req/s is fine; 5 simultaneous requests at the same API may blow past your rate limit, trigger exponential backoff, and leave you with worse latency than you started with.
The fix is to check whether parallel calls target independent or shared rate-limited resources. Independent resources — different APIs, a vector store and a SQL database, a web search and a file read — parallelize cleanly. Calls to the same rate-limited endpoint need a concurrency cap or request staggering.
When Sequential Is Correct
Not everything is a waterfall. There's a failure mode in the opposite direction: forcing parallelism onto inherently sequential workflows because it feels like an optimization.
The clearest signal that sequential execution is correct: the task is a chain of reasoning where each step materially changes what you'd do in the next step. If you're debugging an error, you look at the stack trace, form a hypothesis, check the relevant code, update your hypothesis based on what you see, then decide what to look at next. Each tool call shapes the intent of the following one in a way that a static dependency graph can't capture. Trying to pre-plan this as a DAG produces brittle plans that break when reality doesn't match the upfront assumptions.
A secondary signal: very tight latency requirements with lightweight tools. If each tool call takes under 20ms (in-process function calls, local cache lookups), the async coordination overhead of parallelism can actually exceed the time saved. Measure before you optimize.
A Decision Framework
When you encounter a batch of tool calls in an agent loop:
-
Check parameter dependencies. Does any call's input reference another call's output? If yes, those two calls must be ordered. If no, they're candidates for parallelism.
-
Check resource overlap. Do the parallel candidates target the same rate-limited API or shared mutable state? If yes, apply a concurrency cap. If no, run them fully in parallel.
-
Check reasoning dependence. Even without hard parameter dependencies, would seeing the result of call A change what you'd want from call B? If yes, run sequentially and let the model replan. This applies most to exploratory or debugging workflows.
-
Measure the shape of your calls. Pure fan-outs get the biggest wins from parallelism. Mixed DAGs get partial wins at each parallel phase. Linear chains get nothing — and that's the correct answer, not a failure.
The N+1 query problem persisted in backend systems for years because the waterfall pattern looked correct in code review and the performance only became obvious under load. Agent tool call waterfalls have the same property: they pass all functional tests, they produce correct answers, and they silently charge you 3× the latency on every request. The fix — dependency analysis and concurrent execution — is not complicated. The prerequisite is recognizing the pattern when you see it in your traces.
- https://www.codeant.ai/blogs/parallel-tool-calling
- https://google.github.io/adk-docs/agents/workflow-agents/parallel-agents/
- https://learn.microsoft.com/en-us/azure/architecture/ai-ml/guide/ai-agent-design-patterns
- https://www.anthropic.com/engineering/multi-agent-research-system
- https://arxiv.org/pdf/2312.04511
- https://agent-patterns.readthedocs.io/en/stable/patterns/llm-compiler.html
- https://airbyte.com/agentic-data/parallel-tool-calls-llm
- https://continue.ghost.io/parallel-tool-calling/
- https://www.anthropic.com/engineering/advanced-tool-use
- https://cookbook.openai.com/examples/agents_sdk/parallel_agents
- https://towardsdatascience.com/why-your-multi-agent-system-is-failing-escaping-the-17x-error-trap-of-the-bag-of-agents/
