Skip to main content

GPU Capacity Planning When Demand Is a Cliff, Not a Curve

· 10 min read
Tian Pan
Software Engineer

The first time an agent platform falls over, the postmortem usually contains a sentence that reads something like: "We had eight weeks of headroom on Friday. By Monday afternoon, we were at 140% of provisioned capacity." Nobody is lying. The capacity model was correct, applied to a workload it was never designed for. Classical capacity planning assumes demand grows along a smooth curve where weekly seasonality is the dominant signal and the worst case is a Black Friday you can plan against six months out. Agent workloads break that assumption hard.

The shape of agent demand is not a curve. It is a cliff. Three things produce the cliff and they compound. A single enterprise customer onboarding can shift baseline by 10x overnight on a contractual notice you've already signed. An agent loop can amplify a tiny increase in user activity into a fanout-multiplied surge that hits inference 30x harder than the user-facing graph suggests. A single product change — enabling tool use, lengthening context, switching to a larger model — can move per-task token consumption by an order of magnitude with no change in user count.

If your capacity planning is in QPS and your headroom budget is "75% utilization is healthy," you are not planning. You are gambling that none of those three cliffs lands on the same week.

Why QPS Stops Being The Right Unit

Capacity planning for stateless web services is QPS-shaped because the cost per request is roughly constant. A search query is a search query. The variance is small enough that the smooth curve is a real thing.

LLM inference does not work like that. The cost per request is dominated by tokens — input tokens determine prefill compute, output tokens determine decode compute, and an autoregressive decoder spends roughly 10x more time per output token than per input token at typical context lengths. A single agentic task can consume between 50,000 and 500,000 tokens across dozens of inference calls, depending on how deep the tool-call tree gets. Two requests routed to the same endpoint can differ in cost by two orders of magnitude.

This means the right unit is tokens per second per GPU at the supply side and tokens per task × tasks per user × DAU at the demand side. A capacity model that says "we serve 200 QPS at p95 of 800ms" tells you almost nothing about whether you can survive next quarter, because it does not say what tokens those queries consumed.

A useful working formula:

Required tokens/sec = (input_tokens_per_task + output_tokens_per_task) × tasks_per_user_per_day × DAU / seconds_in_day × peak_factor × fanout_multiplier

Every term in that formula moves independently. Per-task tokens move when product changes. Tasks per user move when behavior shifts. DAU moves with sales. Peak factor moves with geography. Fanout moves with prompt edits. A QPS forecast collapses all of those into one number and loses the structure.

The Fanout Multiplier Is A First-Class Capacity Input

The single most expensive thing teams forget to track is the fanout multiplier — how many model calls each user-visible task generates inside the agent loop.

Anthropic's own data has shown agents consume roughly 4x more tokens than chat, with multi-agent systems pushing that to about 15x. Independent research on cyclic tool composition has measured token consumption rising up to 14.6x for fixed queries, just from how the loop walks the tool graph. None of this shows up on the user-facing dashboard. A user kicks off "summarize this document" once. Inside the agent: a planner call, six retrieval calls, three tool-arg-formation calls, two re-plan calls when a tool returns malformed output, a verifier call, and a summarizer call. Thirteen model calls. Six retrievals. Twelve tool-result payloads injected back into context. Tens of thousands of tokens.

The dangerous property of the fanout multiplier is that it is a function of the prompt, not the user. A senior engineer tightens the planner prompt to "consider three alternatives before tool selection" because it improved the eval score by two points. Fanout goes from 8 to 22 across the fleet. The eval is green. The capacity model has not moved. Three weeks later something else nudges DAU up 5% and the inference tier paginates into errors.

The discipline that catches this is treating fanout as a first-class metric:

  • Measure mean and p95 fanout per agent class, not just per workload.
  • Re-measure on every prompt or tool-registry change, before the change merges to production.
  • Track regressions in fanout the same way you track regressions in latency — with an alert and a budget.
  • Make per-task token consumption a release-gate metric, not a dashboard you check after the fact.

If the only signal you have for "the agent is doing more work" is your inference bill, you find out three weeks late and after capacity has already cliffed.

Three Cliffs That Aren't On The Forecast

Three distinct cliff shapes show up in agent infrastructure. Each needs different machinery.

Onboarding cliffs. A new enterprise customer signs in March, goes live in May, and brings 40,000 daily users that didn't exist in your forecast. The contract gave you two weeks of operational notice. Sales knew in November. The capacity team found out from a Slack message. The fix here is contractual and organizational, not technical: the customer onboarding workflow needs to write expected DAU, expected tasks-per-user, and go-live date into the capacity forecasting system as soon as the deal closes. If your finance team can build the revenue forecast from the same signal, your infra team can build the load forecast.

Fanout cliffs. A prompt edit, a tool added to the registry, a model change, or a context-length increase moves per-task token consumption by 2–10x. These are silent in user-visible metrics. They show up only in tokens-per-task and tokens-per-second-per-GPU. The mitigation is a continuous integration check that fails the build if the fanout multiplier on a fixed eval suite moves more than a configured threshold without sign-off.

Behavioral cliffs. A new feature ships and the users who used the agent twice a week start using it twelve times a day. The DAU graph barely moves. Tasks per user moves 6x. The capacity that was sized for 2.4 tasks/user/day is now serving 14.4 tasks/user/day. Behavioral cliffs are the hardest to predict because they are downstream of product decisions that don't always cross the infrastructure team's desk. The defensive move is keeping tasks-per-user as a separately tracked, separately alerted metric — and treating any double-digit week-over-week change as a planning event, not a positive product signal.

Headroom Is The Wrong Question

Traditional capacity planning frames headroom as a utilization percentage: "we run at 70%, we have 30% headroom." That's adequate when the worst case is a normal peak twice the normal trough. It's the wrong frame for cliff risk, because the cliff is not bounded by historical traffic.

The right frame is cliffs-survivable. How many of the three cliffs can your current capacity absorb simultaneously without paging? An onboarding cliff and a fanout cliff in the same week is the realistic failure mode, not "Tuesday at 3pm we briefly hit 85% utilization." Buying smooth-curve utilization on a cliff workload is exactly what produces the launch-day outage — because every dollar you save by running hotter is a dollar you've spent on the assumption that the cliff doesn't land this week.

There is a counterintuitive corollary: low GPU utilization on cliff-shaped workloads is often a feature, not a bug. The team that runs at 45% utilization with a 2x burst budget on a secondary provider survives a 10x cliff with degraded but available service. The team that has been "optimizing utilization to 85%" goes down. Tooling and dashboards that frame utilization as something to maximize are fighting the workload.

This shows up explicitly in finance conversations. The CFO sees underutilized hardware and asks why. The honest answer is that the utilization number is the wrong objective on a workload where the variance dominates the mean — the right objective is what fraction of the realistic cliff scenarios you can absorb without an outage. Reframe the metric and the capacity decisions become legible to people outside the team that has to answer the page.

A Reservation Strategy Built For Cliffs

Once you accept the workload shape, a specific reservation strategy falls out.

Reserve the floor. Long-term committed capacity sized for the steady-state minimum, not the average. Reserved plans typically cut 30–50% off on-demand pricing, with multi-year commits going further. This is the cheapest tier and you should run it close to fully utilized — but only for the part of the load that you are highly confident will be there in twelve months.

Burst on a secondary provider. A second provider or a second region with on-demand capacity, accessed through a routing layer. The price per token is higher. That's fine — you're amortizing the higher price only against tail traffic. The math is: on-demand at 2.5x the reserved price, used for 8% of total tokens, increases blended cost by 12% versus reserved-only. That 12% is your insurance premium against an outage during a cliff. If the cliff lands once a quarter, the premium pays for itself the first time.

Downgrade as a deliberate strategy, not a fallback. When burst pricing crosses a threshold, route non-critical paths to a smaller model. The customer-support summarizer can run on a cheaper model during a surge. The high-stakes contract-analysis agent stays on the flagship. This is not graceful degradation in the abstract — it's a routing rule with a price-per-token threshold and a per-route quality budget.

Circuit breakers, not retries. When a primary provider returns 429s or 5xxs above a threshold (one common rule: if 40% of traffic fails within a 60-second window), bypass retries and route all traffic to the equivalent secondary model for a cooldown window — say 20 minutes — before retesting the primary. Retry storms during a cliff are a way to turn one outage into two.

The shape of this stack — committed floor, on-demand burst, model downgrade, circuit-broken failover — is not a decision tree you build in an incident. It's wired in advance, and the routing layer that implements it is a first-class production service, not a config file.

What This Looks Like When It Lands

A capacity-mature agent platform looks like this from the inside. The capacity model is denominated in tokens, not QPS. The fanout multiplier is a tracked metric per agent class, with a CI check that fails on regression. The sales pipeline writes onboarding load into the capacity forecast at deal-close, not at go-live. The reservation strategy has a committed floor sized for confidence, an on-demand burst tier sized for cliff scenarios, and a routing layer that downgrades non-critical paths when burst pricing crosses a budget. Headroom is reported as cliffs-survivable, not utilization-percent. The dashboards that the on-call engineer watches and the dashboards that the CFO watches have a shared vocabulary.

The architectural realization underneath all of this is that GPU capacity for agent workloads is not a cloud-cost optimization problem. It is a reliability problem that lives next to your error budget. The teams that treat it as a finance problem buy unaffordable headroom or accept surprise outages on the days that matter most. The teams that treat it as a reliability problem do the more boring work: token-shaped capacity math, fanout as a first-class metric, contractual instrumentation of enterprise demand, and a reservation strategy that assumes the cliff will land. They are the ones still serving customers when it does.

References:Let's stay in touch and Follow me for more thoughts and updates