Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache
Your Black Friday traffic spike arrives. Conventional API services respond by spinning up more containers. Within 60 seconds, you have three times the capacity. The autoscaler does what it always does, and you sleep through the night.
Run an LLM behind that same autoscaler, and you get a different outcome. The new GPU instances come online after four minutes of model weight loading. By then, your request queues are full, your existing GPUs are thrashing under memory pressure from half-completed generations, and users are staring at spinners. Adding more compute didn't help — the bottleneck isn't where you assumed it was.
AI inference workloads violate most of the assumptions that make reactive autoscaling work for conventional services. Understanding why is the prerequisite to building systems that survive traffic spikes.
Why LLMs Aren't Just Slow APIs
The fundamental difference is that LLM inference is stateful in GPU memory during generation. Every active request holds a live Key-Value (KV) cache — a per-request memory structure that stores the intermediate attention computations for every token generated so far. For a 70B parameter model handling a 4,096-token context, this KV cache consumes several gigabytes of GPU memory per request.
This has a consequence that conventional capacity planning ignores: you can run out of memory while compute is still 40% idle.
With a stateless API, you add instances and requests distribute across them. With LLM inference, new instances don't help in-flight requests. Each active generation holds memory that cannot be migrated, shared, or compressed without interrupting the request. When a burst of long-context requests arrives simultaneously, they collectively fill GPU memory before filling GPU compute. New requests queue not because the GPU is busy, but because there is nowhere to put their KV caches.
The auto-scaling math that worked for your REST API assumes request independence and stateless compute. Neither is true here.
The KV Cache is Your Real Bottleneck
To plan for bursts correctly, you need to think in memory, not request counts.
Each concurrent request holds memory proportional to its sequence length and the model's layer count. For a 13B parameter model, a single 8K-token context might require 1.5–3 GB of KV cache memory. A GPU with 80 GB of VRAM, with model weights loaded, might have 30–40 GB available for KV caches — enough for 10–20 long-context requests simultaneously, regardless of how many spare CUDA cores exist.
Under a traffic spike, this means:
- Request concurrency is bounded by memory, not by FLOPS.
- Adding more GPU compute to existing instances does nothing.
- The queue fills with requests waiting for memory to free up, not for a CPU to become available.
- Latency climbs because queued requests wait for existing generations to complete and release their KV cache allocations.
The input/output token counts have enormous variance in practice. A user chat message might be 50 tokens; a document summarization task might be 12,000 tokens. A single long-context request can consume the memory budget of 80 short ones. This makes burst capacity planning probabilistic rather than deterministic — you are modeling a distribution, not a fixed cost.
What "Capacity" Actually Means for LLM Serving
When planning for spikes, you need to model three separate ceilings:
Memory capacity: Total KV cache memory available across your serving fleet. This determines maximum concurrent request depth at your typical context length distribution.
Compute throughput: Tokens per second the fleet can generate. This determines how quickly in-flight requests complete and release their KV caches, which affects queue drain rate.
Cold-start latency: How long it takes a new instance to become request-ready. For LLMs, this includes container image pull (minutes), model weight loading (minutes), and runtime initialization. Contrast this with a stateless service where a new container is ready in seconds.
The interaction between these three is what makes spikes dangerous. A spike increases queue depth (bounded by memory capacity), slows drain (because queued requests also compete for compute), and triggers autoscaling — which doesn't deliver capacity for four to eight minutes. The system can dig itself into a deeper hole before new capacity arrives.
Contrast this with conventional services: when a stateless API autoscales, new instances are ready in seconds, the queue drains, and the system recovers quickly. With LLM inference, the recovery latency is an order of magnitude longer.
Capacity Planning Math for Predictable Bursts
For predictable traffic events — scheduled batch jobs, product launches, marketing campaigns — you can apply deterministic capacity planning before the event.
Start with your expected token budget. Estimate the burst request volume, multiply by your p95 context length, and compute total tokens to be processed during the burst window. Divide by your GPU fleet's measured tokens-per-second throughput. Add enough headroom for queueing dynamics, typically 1.5–2x to account for uneven arrival patterns and long-tail context lengths.
A worked example: You expect 500 concurrent users during a 15-minute burst window, each generating a 1,000-token prompt and receiving a 500-token response. That's 750,000 total tokens. Your serving fleet produces 20,000 tokens per second across all GPUs. Naive calculation says you need 37 seconds of sustained generation — but that assumes perfectly batched requests. In practice, queuing and memory pressure stretch the window. Plan for 2–3x: size your fleet for 60,000–90,000 tokens per second during the burst window, then scale back after.
- https://introl.com/blog/ai-infrastructure-capacity-planning-forecasting-gpu-2025-2030
- https://aws.amazon.com/blogs/machine-learning/amazon-sagemaker-ai-in-2025-a-year-in-review-part-1-flexible-training-plans-and-improvements-to-price-performance-for-inference-workloads/
- https://www.v2solutions.com/blogs/llm-inference-bottleneck-why-throughput-optimization-fails-at-scale/
- https://aws.amazon.com/blogs/machine-learning/introducing-fast-model-loader-in-sagemaker-inference-accelerate-autoscaling-for-your-large-language-models-llms-part-1/
- https://www.anyscale.com/blog/continuous-batching-llm-inference
- https://lmcache.ai/tech_report.pdf
- https://dl.acm.org/doi/10.1145/3620665.3640411
- https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf
- https://groundy.com/articles/prefill-decode-disaggregation-the-architecture-shift-redefining-llm-serving-at-scale/
- https://arxiv.org/html/2503.05248v1
- https://bentoml.com/llm/inference-optimization/static-dynamic-continuous-batching
- https://gun.io/news/2025/04/scaling-ai-infrastructure-for-llms/
