Skip to main content

One post tagged with "gpu-capacity"

View all tags

GPU Capacity Planning When Demand Is a Cliff, Not a Curve

· 10 min read
Tian Pan
Software Engineer

The first time an agent platform falls over, the postmortem usually contains a sentence that reads something like: "We had eight weeks of headroom on Friday. By Monday afternoon, we were at 140% of provisioned capacity." Nobody is lying. The capacity model was correct, applied to a workload it was never designed for. Classical capacity planning assumes demand grows along a smooth curve where weekly seasonality is the dominant signal and the worst case is a Black Friday you can plan against six months out. Agent workloads break that assumption hard.

The shape of agent demand is not a curve. It is a cliff. Three things produce the cliff and they compound. A single enterprise customer onboarding can shift baseline by 10x overnight on a contractual notice you've already signed. An agent loop can amplify a tiny increase in user activity into a fanout-multiplied surge that hits inference 30x harder than the user-facing graph suggests. A single product change — enabling tool use, lengthening context, switching to a larger model — can move per-task token consumption by an order of magnitude with no change in user count.

If your capacity planning is in QPS and your headroom budget is "75% utilization is healthy," you are not planning. You are gambling that none of those three cliffs lands on the same week.