Skip to main content

Capacity Planning When Every Request Thinks a Different Amount

· 10 min read
Tian Pan
Software Engineer

Classic capacity planning rests on a quiet assumption: requests are roughly interchangeable. A web server handles a login, a search, a checkout — and while those differ, they differ within a band. You measure requests per second, watch p50 and p99 latency, multiply by a safety factor, and provision. The model works because the unit of work — one request — has a stable cost.

Agent workloads break that assumption at the root. One query to your agent resolves in a single short completion: 300 tokens in, 200 out, done in two seconds. The next query, superficially identical, spawns a planning step, fans out to forty tool calls, re-reads its own growing context on every turn, and burns 1.2 million tokens over four minutes. Same endpoint. Same user. Same code path. The cost per request varied by three orders of magnitude, and nothing in the request told you which one you were about to get.

This is not noise you can average away. It is a structural property of the workload, and it means your p50 tells you almost nothing about your p99. If you provision for the median and the tail decides your bill, you will be wrong in both directions at once: over-provisioned for the common case, under-provisioned for the case that actually hurts.

The Unit of Work Is No Longer the Request

For a deterministic service, "one request" is a meaningful capacity unit because its compute is bounded by code you wrote. The branches are finite, the loops terminate, the database query touches a known index. You can reason about the worst case by reading the source.

An agent request has no such bound. The amount of compute it consumes is decided at runtime by the model, based on the task, the tools available, and the path it happens to take through its own reasoning. Industry measurements put agentic workloads at 5–30x the token consumption of a standard chat turn, and that multiplier is itself a distribution, not a constant. The same task, run twice, can differ by 10x in tokens because the model took a longer route the second time.

So the request stops being a useful unit. What you actually consume is tokens and tool calls, and those are what you have to forecast in. A capacity plan denominated in "requests per second" is measuring the wrong thing — it is counting envelopes without weighing them.

The practical move is to stop forecasting request counts and start forecasting two distributions: tokens per request and tool calls per request. Not the means of those distributions — the full shape, especially the upper percentiles. Your infrastructure bill is a sum over the token distribution, and the latency your users feel is governed by the tool-call distribution. Both are fat-tailed, and a fat tail is exactly the regime where the mean is a liar.

Why Averaging Hides the Number That Matters

Take a workload where 95% of requests cost 5,000 tokens and 5% cost 800,000 tokens. The mean is about 45,000 tokens. If you plan around that mean, you have built a number that describes almost none of your actual requests — the bulk are 9x cheaper, the tail is 18x more expensive — and the tail, despite being one request in twenty, accounts for roughly 90% of total token spend.

This is the trap. The average sits in an empty valley between two populations. Provision compute for the average and you under-serve the tail badly enough that those requests queue, time out, or get throttled. Budget for the average and your actual invoice arrives at double or triple your forecast, because the tail you discounted is where the money went.

The fix is to refuse to collapse the distribution to a single number too early. Track p50, p90, p99, and p99.9 of tokens-per-request and tool-calls-per-request as first-class metrics, the same way a latency-sensitive service tracks tail latency. The gap between p50 and p99 is not a measurement artifact to be smoothed — it is the single most important fact about the workload. A 4x gap and a 400x gap call for completely different architectures, and the mean is identical in both.

There is a second-order effect worth naming. The tail is not just expensive; it is correlated with load. When the system is busy, continuous batching packs more sequences onto each GPU, KV-cache pressure rises, and the long-running tail requests are precisely the ones holding cache slots the longest. So the tail gets slower exactly when you have the most of it. The p99 you measured at low load is optimistic about the p99 you will see at peak.

Forecast in Tokens and Tool Calls, Not Requests

Concretely, rebuild the capacity model around two inputs.

The token distribution per feature. Different product surfaces have wildly different shapes. A "summarize this thread" feature might be tight — p50 and p99 within 3x. An open-ended "research and draft" agent might span 100x. Aggregate them and you get a meaningless blended curve. Measure each surface separately, because each one scales differently and a single runaway feature can dominate the whole bill.

The tool-call distribution per feature. Tool calls drive wall-clock latency and concurrency pressure more than raw tokens do, because each call is a round trip — network, an external API, often a queue. An agent averaging three tool calls with a p99 of sixty is a concurrency problem disguised as a token problem. Those sixty-call requests hold a worker, a context window, and a connection for minutes.

With those two distributions per feature, capacity planning becomes a Monte Carlo exercise rather than a multiplication. Sample from the real per-feature distributions at your projected request mix, run it a few thousand times, and read off the resulting distribution of total tokens, peak concurrent tool calls, and GPU-seconds. The output is a range with confidence bounds — which is honest, because the workload genuinely is a range. A point estimate was always a fiction; it just used to be a close enough fiction.

This also reframes autoscaling. CPU utilization is useless here, and GPU utilization is nearly as bad — it pegs near 100% during active batching regardless of real headroom. The signals that actually track serving pressure are queue depth, queue wait time, in-flight sequence count, KV-cache occupancy, and time-to-first-token. Recent serving research is unanimous on this: scale on queue and cache pressure, not on hardware-busy percentages, because the hardware looks equally busy whether you are comfortable or about to fall over.

The Runaway Request Is a Capacity Event, Not Just a Bug

Every team running agents in production eventually meets the request that does not stop. The agent retries a dead endpoint in a loop. It re-plans, re-reads its inflating context, re-plans again. Publicly shared incidents are not subtle: a retry loop that quietly ran up $47,000 over eleven days, an agent loop that burned $30,000, hundreds of thousands of API calls with not a single alert firing.

It is tempting to file this under "bugs." But from a capacity standpoint, the runaway request is the natural conclusion of a workload whose per-request cost is unbounded. If a single request can in principle consume a million tokens, then a single buggy request can consume ten million, and the only thing standing between you and that outcome is an explicit limit. Capacity planning for this workload is not complete until you have decided what the ceiling is — because if you do not set one, the model sets it for you, and the model does not know your budget.

That ceiling has to be enforced outside the agent. An agent asked to respect its own token budget is one bad reasoning step away from skipping the check; the buggy path and the budget-checking path are the same untrusted code. The limit belongs at the gateway — the layer that forwards LLM calls. Cap tokens per request, tool calls per request, wall-clock per request, and consecutive identical errors. When a request crosses a line, the gateway kills it. The agent cannot make a call that violates the policy because the call never leaves.

Treat the ceiling as a real capacity parameter, not a panic button. Your worst-case-per-request limit, multiplied by your worst-case concurrent request count, is your actual provisioning ceiling — the number you size the cluster and the monthly budget against. Without an enforced limit, that product is infinity, and you cannot plan against infinity. With one, the fat tail becomes a fat-but-bounded tail, which is a thing a Monte Carlo model can handle. The limit is what makes the workload plannable at all.

Designing for a Cost You Cannot Know in Advance

The deepest adjustment is accepting that you cannot know a request's cost until it finishes. Deterministic capacity planning leaned hard on foreknowledge — you knew the work before doing it. That lever is gone, and pretending otherwise produces plans that are precise and wrong.

What you can do is design the system to degrade in a controlled way as cost reveals itself:

  • Classify before you commit. A cheap, fast triage step — a small model or a heuristic — can sort incoming requests into "likely short" and "likely long" lanes and route them to separate pools. You will be wrong sometimes, but even a rough split lets the short lane keep its tight latency instead of queueing behind a tail request.
  • Meter as you go. Track token and tool-call spend live, mid-request. A request that has used 80% of its budget and is not converging is information you can act on before the remaining 20% is gone.
  • Make the tail preemptible. Long-running agent requests should be the ones you can pause, checkpoint, or reschedule when a burst of cheap interactive requests arrives. Runtime rescheduling across instances is becoming standard in serving systems precisely because the tail needs to yield to the median under load.
  • Budget per feature, alert per feature. A blended cost dashboard hides the surface that is drifting. Per-feature token and tool-call budgets turn slow creep into an early alert instead of a quarter-end surprise.

None of this makes the workload deterministic. It makes it observable and bounded — which is the realistic goal. You are not going to predict the cost of the next request. You are going to provision for a distribution you have actually measured, cap the tail so it cannot run to infinity, and build the system to shed or defer expensive work gracefully when the tail shows up all at once.

The mindset shift is the whole job. Stop planning for the average request, because there is no average request — there is a short one and a long one wearing the same URL. Plan for the distribution, instrument the percentiles, enforce the ceiling, and let the median and the tail live in the separate worlds they were always in. The teams that get surprised by their agent infrastructure bill are not bad at math. They are doing correct arithmetic on a number — the mean — that was never describing their workload in the first place.

References:Let's stay in touch and Follow me for more thoughts and updates