Capacity Math for Agent Loops: Why Your Provisioned Throughput Is Half of What You Think
A team I worked with launched what they called a "modest" feature: an internal research assistant for a few hundred analysts. Their capacity model said one user request equals one model call, so they sized provisioned throughput against peak user QPS with the standard 30 percent burst headroom. On launch day they hit 429s within an hour, traffic that should have used 40 percent of their reserved capacity saturated 100 percent, and the postmortem revealed a number nobody had multiplied in: the average request triggered 11 model calls, not one.
This is the most common capacity miss I see in agent rollouts. The math is not subtle and the failure mode is not exotic. The team asked the wrong unit question — they planned in user requests when the meter ticks in model calls — and the reservation they paid real money for evaporated under a load they would have called light if it had been a chat product.
The pattern repeats across every team that grafts an agent onto a request-shaped capacity model. Provisioned throughput contracts (PTU on Azure, GSU on Vertex, provisioned-throughput endpoints on Bedrock and Databricks) are denominated in tokens per second, requests per minute, or some hybrid; capacity engineering on those platforms is mature and well documented. What is not well documented is the multiplier between the user-facing event and the meter-facing event when an agent loop is in the middle. That multiplier is the whole story.
The fan-out factor is not optional, it is the metric
A bare chat call is roughly a one-to-one mapping: one user request, one model call, predictable token counts on both sides. An agent loop is not. A single user request fires a planning step, then a tool call, then a model call to interpret the tool result, then often another planning step, and frequently a final formatting or self-critique pass. Format-correction retries — the model emitted invalid JSON or named a tool that does not exist — add another round trip per retry. Multi-agent handoffs add even more.
The token-cost ratios have been measured. Anthropic's own framing puts a standard chat at 1x token cost, a single-agent loop at roughly 4x, and a multi-agent system around 15x. Loop depth (the count of model calls per user request) is the cleanest single number to put on this. In practice I see production agents land between 5 and 20 model calls per request, with 8 to 12 being typical for retrieval-and-tool-use designs.
Two consequences fall out immediately:
- Provisioned token throughput dimensioned on user QPS over-promises capacity by exactly the loop fan-out factor. If your average loop depth is 11, the throughput you reserved is one-eleventh of the throughput you actually need at the same user load.
- Cost per user request scales with loop depth as well. A feature that pencils out at break-even when you assumed three calls becomes a margin disaster at eleven, and product teams that didn't put loop depth in the unit-economics model find this out in the next billing cycle.
The first consequence is a reliability problem. The second is a P&L problem. They share a root cause: the agent loop's fan-out is the unit of capacity and cost, and treating user requests as the unit hides the actual workload.
Little's Law, applied where it actually matters
The right way to size an agent system is to apply Little's Law to model calls, not user requests. Little's Law says the average number of items in a stable system equals the arrival rate times the average time each item spends in the system. For a chat product the arrival rate is user QPS and the time-in-system is the model latency. For an agent product, the arrival rate at the model endpoint is user QPS multiplied by loop depth, and the time-in-system is the per-call model latency, not the end-to-end agent latency.
Concretely: if a chat product handles 10 user QPS at 2 seconds of model latency, Little's Law tells you to plan for 20 concurrent in-flight model calls and size memory, batch capacity, and reserved tokens accordingly. If an agent product handles 10 user QPS with a loop depth of 11 and 1.5 seconds of per-call model latency, the corresponding number is 10 × 11 × 1.5 = 165 concurrent in-flight model calls. That is the figure that has to fit inside your provisioned throughput envelope, not 20.
This is the simplest possible correction and it changes provisioning by an order of magnitude. The same correction shows up wherever you do throughput math: tokens-per-second budgets are user-token-budget multiplied by loop depth; KV-cache memory pressure scales with concurrent calls, not concurrent users; batch-size requirements track in-flight calls, not in-flight requests.
The capacity-planning question to put on every AI feature spec is not "what is our peak user QPS" but "what is our peak model-call rate," and that requires the team to commit to a loop depth distribution before the feature ships, with both a mean and a tail (P95 loop depth is often 2–3x the mean because of retry tails).
Burstiness is worse than the average suggests
Average loop depth is the friendly number. The mean determines steady-state cost. The tail determines whether the system stays up.
Agent loops are bursty in two compounding ways. First, an individual request's depth is variable: a happy path completes in 5 calls, a path with a retry burns 8, a path that hits a degenerate planner state burns 20 or more before a wall-clock cap kicks in. Second, the calls within a single request arrive at the model endpoint near-simultaneously when the agent fans out tool calls in parallel, then alternate with quiet windows during tool execution. The shape at the inference endpoint is not a smooth flow but a sequence of micro-bursts.
Provisioned throughput windows partially absorb this. Vertex AI's dynamic-window enforcement, for example, lets a 70k-tokens-per-second burst through if the 120-second average stays under quota. Azure's leaky-bucket variant does similar smoothing. But the smoothing window is finite, and a fleet of agents that all enter their planning phase at the same time can spike beyond the window's ability to absorb.
- https://blog.danslimmon.com/2022/06/07/using-littles-law-to-scale-applications/
- https://learn.microsoft.com/en-us/azure/foundry/openai/concepts/provisioned-throughput
- https://docs.cloud.google.com/vertex-ai/generative-ai/docs/provisioned-throughput/measure-provisioned-throughput
- https://www.systemoverflow.com/learn/design-fundamentals/latency-throughput/failure-modes-tail-latency-amplification-queuing-collapse-and-retry-storms
- https://paulcavallaro.com/blog/fanouts-and-percentiles/
- https://agentnativedev.medium.com/the-p99-problem-designing-llm-inference-for-real-users-11deb35bb8d4
- https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
- https://arxiv.org/html/2407.05347v1
- https://acecloud.ai/blog/agentic-ai-load-balancing/
- https://dev.to/mostafa_ibrahim_774fe947b/what-is-agent-observability-traces-loop-rate-tool-errors-and-cost-per-successful-task-bl5
