Capacity Planning for AI Workloads: Why the Math Breaks When Tokens Are Your Resource
Your GPU dashboard is lying to you. At 60% utilization, your inference cluster looks healthy. Users are experiencing 8-second time-to-first-token. The on-call engineer checks memory — also fine. Compute — fine. And yet the queue is growing and latency is spiking. This is what happens when you apply traditional capacity planning to LLM workloads: the metrics you trust point to the wrong places, and the actual bottleneck stays invisible until users start complaining.
The root problem is that LLMs consume a fundamentally different kind of resource. CPU services trade compute and memory. LLM services trade tokens — and tokens don't behave like requests.
