Skip to main content

GPU Capacity Planning When Demand Is a Cliff, Not a Curve

· 10 min read
Tian Pan
Software Engineer

The first time an agent platform falls over, the postmortem usually contains a sentence that reads something like: "We had eight weeks of headroom on Friday. By Monday afternoon, we were at 140% of provisioned capacity." Nobody is lying. The capacity model was correct, applied to a workload it was never designed for. Classical capacity planning assumes demand grows along a smooth curve where weekly seasonality is the dominant signal and the worst case is a Black Friday you can plan against six months out. Agent workloads break that assumption hard.

The shape of agent demand is not a curve. It is a cliff. Three things produce the cliff and they compound. A single enterprise customer onboarding can shift baseline by 10x overnight on a contractual notice you've already signed. An agent loop can amplify a tiny increase in user activity into a fanout-multiplied surge that hits inference 30x harder than the user-facing graph suggests. A single product change — enabling tool use, lengthening context, switching to a larger model — can move per-task token consumption by an order of magnitude with no change in user count.

If your capacity planning is in QPS and your headroom budget is "75% utilization is healthy," you are not planning. You are gambling that none of those three cliffs lands on the same week.

Why QPS Stops Being The Right Unit

Capacity planning for stateless web services is QPS-shaped because the cost per request is roughly constant. A search query is a search query. The variance is small enough that the smooth curve is a real thing.

LLM inference does not work like that. The cost per request is dominated by tokens — input tokens determine prefill compute, output tokens determine decode compute, and an autoregressive decoder spends roughly 10x more time per output token than per input token at typical context lengths. A single agentic task can consume between 50,000 and 500,000 tokens across dozens of inference calls, depending on how deep the tool-call tree gets. Two requests routed to the same endpoint can differ in cost by two orders of magnitude.

This means the right unit is tokens per second per GPU at the supply side and tokens per task × tasks per user × DAU at the demand side. A capacity model that says "we serve 200 QPS at p95 of 800ms" tells you almost nothing about whether you can survive next quarter, because it does not say what tokens those queries consumed.

A useful working formula:

Required tokens/sec = (input_tokens_per_task + output_tokens_per_task) × tasks_per_user_per_day × DAU / seconds_in_day × peak_factor × fanout_multiplier

Every term in that formula moves independently. Per-task tokens move when product changes. Tasks per user move when behavior shifts. DAU moves with sales. Peak factor moves with geography. Fanout moves with prompt edits. A QPS forecast collapses all of those into one number and loses the structure.

The Fanout Multiplier Is A First-Class Capacity Input

The single most expensive thing teams forget to track is the fanout multiplier — how many model calls each user-visible task generates inside the agent loop.

Anthropic's own data has shown agents consume roughly 4x more tokens than chat, with multi-agent systems pushing that to about 15x. Independent research on cyclic tool composition has measured token consumption rising up to 14.6x for fixed queries, just from how the loop walks the tool graph. None of this shows up on the user-facing dashboard. A user kicks off "summarize this document" once. Inside the agent: a planner call, six retrieval calls, three tool-arg-formation calls, two re-plan calls when a tool returns malformed output, a verifier call, and a summarizer call. Thirteen model calls. Six retrievals. Twelve tool-result payloads injected back into context. Tens of thousands of tokens.

The dangerous property of the fanout multiplier is that it is a function of the prompt, not the user. A senior engineer tightens the planner prompt to "consider three alternatives before tool selection" because it improved the eval score by two points. Fanout goes from 8 to 22 across the fleet. The eval is green. The capacity model has not moved. Three weeks later something else nudges DAU up 5% and the inference tier paginates into errors.

The discipline that catches this is treating fanout as a first-class metric:

  • Measure mean and p95 fanout per agent class, not just per workload.
  • Re-measure on every prompt or tool-registry change, before the change merges to production.
  • Track regressions in fanout the same way you track regressions in latency — with an alert and a budget.
  • Make per-task token consumption a release-gate metric, not a dashboard you check after the fact.

If the only signal you have for "the agent is doing more work" is your inference bill, you find out three weeks late and after capacity has already cliffed.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates