Skip to main content

Capacity Planning for AI Workloads: Why the Math Breaks When Tokens Are Your Resource

· 11 min read
Tian Pan
Software Engineer

Your GPU dashboard is lying to you. At 60% utilization, your inference cluster looks healthy. Users are experiencing 8-second time-to-first-token. The on-call engineer checks memory — also fine. Compute — fine. And yet the queue is growing and latency is spiking. This is what happens when you apply traditional capacity planning to LLM workloads: the metrics you trust point to the wrong places, and the actual bottleneck stays invisible until users start complaining.

The root problem is that LLMs consume a fundamentally different kind of resource. CPU services trade compute and memory. LLM services trade tokens — and tokens don't behave like requests.

Why Tokens Are Not Requests

In a conventional web service, you provision based on concurrent requests and response time. Requests are roughly uniform: a typical API call takes tens to hundreds of milliseconds, consumes a bounded slice of CPU, and exits cleanly. Load balancers distribute evenly. Autoscaling on CPU utilization or request rate works.

LLM inference violates all of these assumptions.

First, token consumption is non-linear with input length. A request with a 500-token context processes in a fraction of the time of a request with a 50,000-token context — not 100x less time, but significantly less, because of quadratic attention complexity during the prefill phase. When context windows routinely reach 200K tokens (some models support 1M+), a single long-context request can monopolize a GPU for seconds.

Second, output is generated one token at a time. Unlike a traditional response that flushes a complete payload, LLM output is streamed token-by-token during the decode phase. A user requesting a 2,000-token essay holds a GPU slot for the entire generation duration. The GPU cannot work on another request with that slot. You're not managing request concurrency — you're managing token slot occupancy.

Third, workloads are genuinely bursty in ways that differ from typical web traffic. A batch of document summarization jobs submitted by an automated pipeline can inject 10M tokens into your queue in seconds. An agentic system running multi-step reasoning loops consumes 5–30x more tokens per task than a simple chat completion. You cannot model this with a Poisson distribution and a safety margin.

The practical consequence: token-per-minute quotas enforced by LLM APIs exist for a reason. When you build your own inference infrastructure, you inherit the same capacity problem without the guardrails.

The KV Cache: Where Capacity Planning Actually Lives

Every engineer planning LLM capacity focuses on compute (FLOPS) and misses the real bottleneck: memory bandwidth and KV cache pressure.

During inference, the model maintains a key-value cache for the attention mechanism — one cache entry per token per layer. For a large model like Llama 70B at full precision, each 10K tokens of context requires roughly 200MB of VRAM. With 32 concurrent requests each holding 10K context, that's 6.4GB — just for the cache — before model weights and activations.

The KV cache scales linearly with context length and linearly with concurrent requests. At some point, it exceeds available GPU memory. When that happens, you don't get graceful degradation — you get OOM errors or requests silently blocked at the queue.

But there's a subtler problem that shows up first: memory bandwidth saturation. During the decode phase, every generated token requires reading the entire KV cache from DRAM. For large contexts or large batches, this saturates memory bandwidth before GPU compute reaches 100%. Research measuring large-batch inference found over 50% of attention kernel cycles stalling on DRAM access even when GPU utilization metrics appeared healthy. That's the dashboard lying to you.

This is why GPU utilization is the wrong primary capacity signal for LLM workloads. A server at 60% GPU utilization might be 95% memory-bandwidth-bound. Traditional monitoring frameworks have no concept of this distinction.

How Traditional Provisioning Models Break

Web service capacity planning assumes:

  • Resources (CPU, memory) map linearly to throughput
  • Horizontal scaling distributes load proportionally
  • Utilization percentage is a reliable proxy for remaining headroom

None of these hold for LLM inference.

Non-linear scaling: A 7B-parameter model on one A100 handles a certain request volume. Two A100s don't give you 2x the throughput — the relationship depends on model architecture, batch size, and whether you're parallelizing tensor operations or running separate model replicas. With tensor parallelism, adding GPUs improves per-request latency but not necessarily throughput-per-dollar.

CPU-side bottlenecks: As context lengths grow and requests include tool invocations, tokenization, runtime orchestration, and inter-service communication become significant contributors to latency. Studies of multi-GPU inference clusters found CPU-induced slowdowns becoming material as agentic workloads increased — a failure mode that CPU utilization metrics don't surface until it's severe.

Headroom math fails at p99: If your average request takes 2 seconds and you're at 70% utilization, traditional math says you have 30% headroom. But LLM latency is highly right-tailed. A request that lands at the back of a queue behind several long-context jobs experiences delays that are multiplicative, not additive. p99 latency for LLM serving is routinely 5–10x the median. Planning to 70% average utilization leaves almost no buffer at the tail.

The Correct Capacity Signals

The metrics that actually predict when an LLM serving cluster is running out of headroom:

Queue depth is the most reliable leading indicator. When requests begin queuing, the cluster has already exhausted its concurrent serving capacity. A threshold of 3–5 queued requests per replica should trigger scale-out. This is the signal that Kubernetes KEDA-based autoscalers use with vLLM's vllm:num_requests_waiting metric — not CPU %, not GPU %, not request rate.

KV cache utilization (gpu_cache_usage_perc in vLLM) is the physical capacity signal. When this exceeds 80–85%, new long-context requests will either be rejected or will preempt running requests. Monitor this per replica, not in aggregate.

TTFT and ITL are the latency signals that map directly to user experience. TTFT (time to first token) is dominated by prefill throughput. ITL (inter-token latency) is dominated by decode memory bandwidth. They degrade in different ways under different load patterns. A spike in TTFT under constant ITL signals prefill saturation — you're running too many prompt-heavy requests simultaneously. A spike in ITL under stable TTFT signals decode saturation — you have too many long-running generation jobs holding slots.

Throughput in tokens/second (not requests/second) is the right denominator for capacity. Two requests at 1K tokens and one request at 2K tokens represent equivalent load. Request-rate-based autoscaling will systematically underestimate load from long-context workloads.

Prefill-Decode Disaggregation: The Architecture That Changes the Math

One reason capacity planning was so difficult historically is that prefill and decode compete for the same resources while having opposite optimization requirements:

  • Prefill is compute-bound: it benefits from large batches, high compute utilization, parallel processing.
  • Decode is memory-bandwidth-bound: it benefits from small batches, fast memory access, low concurrency per slot.

When both phases run on the same GPU, you're permanently in a compromise state — too compute-constrained for good decode throughput, too memory-constrained for good prefill throughput.

Prefill-decode disaggregation solves this by running separate GPU pools for each phase. A request enters a prefill pool, its context is processed in parallel, the resulting KV cache is transferred to a decode pool, and generation proceeds. Meta, Perplexity, and LinkedIn have deployed this in production, reporting 2–7x improvements in combined throughput versus unified architectures.

The capacity planning implication is significant: you can now size your prefill and decode pools independently. A workload with long prompts but short outputs (summarization, classification) needs a larger prefill pool. A workload with short prompts but long outputs (creative writing, code generation) needs a larger decode pool. This unlocks per-phase cost optimization that is impossible with a unified architecture.

Forecasting Token Demand

Estimating future capacity requires a different forecasting methodology than conventional services.

Start with token consumption baselines by workload type. A customer support chatbot might average 800 input tokens and 200 output tokens per conversation. A document analysis pipeline might submit 50K input tokens per document. An agentic coding assistant might generate 10 tool calls per task, each consuming 2K tokens. These profiles are distinct and must be tracked separately — aggregating them into a single "request" metric loses the information you need.

Model burstiness explicitly. LLM token consumption follows distributions with high variance, not Poisson. Document processing pipelines often submit work in batches, creating sharp spikes. Agentic workloads generate recursive token consumption that compounds during peak usage. The ratio of peak-to-average token throughput is typically 3–8x for production workloads, not the 1.5–2x that web services exhibit.

Plan for KV cache, not just compute. For each concurrent user tier you want to support, calculate the VRAM required for KV cache at median and p95 context lengths. Work backward from GPU memory capacity to maximum concurrent requests. This gives you the binding constraint — most clusters hit memory limits before compute limits.

Account for GPU scaling lag. Unlike CPU auto-scaling that provisions new instances in seconds, GPU instances take minutes to initialize. More extreme scale-up events involving multi-GPU model loading can take 10–15 minutes. This means reactive scaling is insufficient for handling sharp traffic spikes. Build your forecasting to detect trend inflection points 15–30 minutes before projected saturation, not after.

A Practical Autoscaling Architecture

Given these constraints, a working autoscaling design for LLM inference combines three signals:

  • Queue depth as the trigger: scale out when pending requests exceed threshold (fast, reactive)
  • TTFT at p90 as the SLA enforcer: scale out when latency enters the danger zone even without queuing (catches memory bandwidth saturation)
  • KV cache utilization as the hard ceiling: block new requests or trigger emergency scale-out above 85% (prevents OOM cascades)

Scale-in should be conservative. A cluster that just served a burst of long-context requests will have fragmented KV cache state and elevated decode time for in-flight requests. Scaling in too aggressively during the tail of a burst extends latency for users still receiving generation output.

For multi-tier serving (different SLA tiers for different user segments), request prioritization at the queue layer is more effective than SLA enforcement at the model layer. Route premium-tier requests to dedicated replicas with reserved headroom rather than trying to enforce latency guarantees through model-level scheduling — the latency variance of LLM generation makes the latter unreliable.

What This Means for Team Practices

LLM capacity planning needs to be treated as a first-class engineering practice, not an operations afterthought. That means:

  • Running load tests with realistic token distributions, not uniform synthetic requests. GuideLLM and similar tools simulate production traffic profiles — use them before deployment, not after an incident.
  • Tracking token consumption per feature, per model, per user tier in your billing and observability systems. Aggregate request counts tell you almost nothing useful.
  • Building token budgets into product design. When a feature ships without a maximum context constraint, users and automated pipelines will push token consumption to the available limit. Uncontrolled input length is a denial-of-service vector.
  • Treating KV cache sizing as a first-class infrastructure decision, equivalent to database connection pool sizing or message queue depth. It belongs in your capacity review, not in your incident postmortems.

The engineers who get LLM capacity right are the ones who internalize that they are operating memory-bandwidth-constrained, token-denominated systems with high output variance and slow scaling mechanisms. The mental models from web service infrastructure get you partway there — but the edges where they fail are exactly where your users experience outages.

Conclusion

GPU utilization, request rate, and response time SLOs — the standard toolbox of web service capacity planning — are necessary but not sufficient for LLM inference. The true capacity constraints are KV cache VRAM, memory bandwidth saturation, and token slot occupancy. Queue depth is your leading indicator; TTFT and ITL are your SLA instruments.

As agentic workloads grow in prevalence and context windows expand toward millions of tokens, the gap between traditional provisioning models and what LLM infrastructure actually requires will widen. Teams that build their capacity planning around token consumption profiles, disaggregated phase architectures, and multi-signal autoscaling will operate at lower cost and higher reliability than teams running the same cluster with web-service assumptions baked in.

The math isn't hard — but only once you're measuring the right things.

References:Let's stay in touch and Follow me for more thoughts and updates