Skip to main content

Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache

· 10 min read
Tian Pan
Software Engineer

Your Black Friday traffic spike arrives. Conventional API services respond by spinning up more containers. Within 60 seconds, you have three times the capacity. The autoscaler does what it always does, and you sleep through the night.

Run an LLM behind that same autoscaler, and you get a different outcome. The new GPU instances come online after four minutes of model weight loading. By then, your request queues are full, your existing GPUs are thrashing under memory pressure from half-completed generations, and users are staring at spinners. Adding more compute didn't help — the bottleneck isn't where you assumed it was.

AI inference workloads violate most of the assumptions that make reactive autoscaling work for conventional services. Understanding why is the prerequisite to building systems that survive traffic spikes.

Why LLMs Aren't Just Slow APIs

The fundamental difference is that LLM inference is stateful in GPU memory during generation. Every active request holds a live Key-Value (KV) cache — a per-request memory structure that stores the intermediate attention computations for every token generated so far. For a 70B parameter model handling a 4,096-token context, this KV cache consumes several gigabytes of GPU memory per request.

This has a consequence that conventional capacity planning ignores: you can run out of memory while compute is still 40% idle.

With a stateless API, you add instances and requests distribute across them. With LLM inference, new instances don't help in-flight requests. Each active generation holds memory that cannot be migrated, shared, or compressed without interrupting the request. When a burst of long-context requests arrives simultaneously, they collectively fill GPU memory before filling GPU compute. New requests queue not because the GPU is busy, but because there is nowhere to put their KV caches.

The auto-scaling math that worked for your REST API assumes request independence and stateless compute. Neither is true here.

The KV Cache is Your Real Bottleneck

To plan for bursts correctly, you need to think in memory, not request counts.

Each concurrent request holds memory proportional to its sequence length and the model's layer count. For a 13B parameter model, a single 8K-token context might require 1.5–3 GB of KV cache memory. A GPU with 80 GB of VRAM, with model weights loaded, might have 30–40 GB available for KV caches — enough for 10–20 long-context requests simultaneously, regardless of how many spare CUDA cores exist.

Under a traffic spike, this means:

  • Request concurrency is bounded by memory, not by FLOPS.
  • Adding more GPU compute to existing instances does nothing.
  • The queue fills with requests waiting for memory to free up, not for a CPU to become available.
  • Latency climbs because queued requests wait for existing generations to complete and release their KV cache allocations.

The input/output token counts have enormous variance in practice. A user chat message might be 50 tokens; a document summarization task might be 12,000 tokens. A single long-context request can consume the memory budget of 80 short ones. This makes burst capacity planning probabilistic rather than deterministic — you are modeling a distribution, not a fixed cost.

What "Capacity" Actually Means for LLM Serving

When planning for spikes, you need to model three separate ceilings:

Memory capacity: Total KV cache memory available across your serving fleet. This determines maximum concurrent request depth at your typical context length distribution.

Compute throughput: Tokens per second the fleet can generate. This determines how quickly in-flight requests complete and release their KV caches, which affects queue drain rate.

Cold-start latency: How long it takes a new instance to become request-ready. For LLMs, this includes container image pull (minutes), model weight loading (minutes), and runtime initialization. Contrast this with a stateless service where a new container is ready in seconds.

The interaction between these three is what makes spikes dangerous. A spike increases queue depth (bounded by memory capacity), slows drain (because queued requests also compete for compute), and triggers autoscaling — which doesn't deliver capacity for four to eight minutes. The system can dig itself into a deeper hole before new capacity arrives.

Contrast this with conventional services: when a stateless API autoscales, new instances are ready in seconds, the queue drains, and the system recovers quickly. With LLM inference, the recovery latency is an order of magnitude longer.

Capacity Planning Math for Predictable Bursts

For predictable traffic events — scheduled batch jobs, product launches, marketing campaigns — you can apply deterministic capacity planning before the event.

Start with your expected token budget. Estimate the burst request volume, multiply by your p95 context length, and compute total tokens to be processed during the burst window. Divide by your GPU fleet's measured tokens-per-second throughput. Add enough headroom for queueing dynamics, typically 1.5–2x to account for uneven arrival patterns and long-tail context lengths.

A worked example: You expect 500 concurrent users during a 15-minute burst window, each generating a 1,000-token prompt and receiving a 500-token response. That's 750,000 total tokens. Your serving fleet produces 20,000 tokens per second across all GPUs. Naive calculation says you need 37 seconds of sustained generation — but that assumes perfectly batched requests. In practice, queuing and memory pressure stretch the window. Plan for 2–3x: size your fleet for 60,000–90,000 tokens per second during the burst window, then scale back after.

For KV cache memory planning: multiply your maximum concurrent request target by your p95 per-request KV cache footprint. This gives you the minimum GPU memory you need dedicated to KV cache. Subtract model weight memory from your total GPU memory, and that's your concurrency ceiling.

Pre-Warming: The Strategy Teams Skip

The most reliable way to survive a known burst is to have capacity warm and ready before requests arrive.

Pre-warm KV caches for common system prompts. If your application uses a fixed or templated system prompt, you can process it ahead of time and cache the resulting KV pairs. Incoming requests with the same system prompt prefix skip re-computation and go directly to generating output. This significantly reduces time-to-first-token under load because the expensive prefill phase is already done.

Reserve capacity from your provider. Major cloud providers offer inference capacity reservations for predictable workloads. Rather than relying on the spot market — which dries up precisely when you need burst capacity — a reserved block guarantees GPU availability at the moment you need it. The economic trade-off is worthwhile for predictable burst events with known revenue impact.

Stage scaled-out instances before load arrives. Because cold-start latency is minutes rather than seconds, your autoscaling trigger needs to fire well before the burst hits peak. Model your expected traffic ramp and set scale-out thresholds conservatively early. An instance that comes online with two minutes to spare is useful; one that comes online two minutes after peak provides zero benefit.

Pre-load containers on warm hosts. AWS SageMaker's Fast Model Loader introduced weight streaming to reduce this latency, but even with improvements, model loading takes non-trivial time. Container image pre-loading on standby hosts reduces the wall-clock delta from "autoscale trigger" to "serving traffic."

Graceful Degradation When You Run Out

Even well-planned systems hit unexpected spikes. The difference between a graceful degradation and an outage is what you do when your capacity ceiling is reached.

Implement admission control at the gateway. Rather than allowing all requests to queue indefinitely (which causes cascading latency for everyone), set explicit queue depth limits. When the queue is full, return a 503 with a Retry-After header rather than holding connections open. Clients can retry at an exponential backoff; this prevents the queue from growing unboundedly while keeping response times predictable for requests that do get admitted.

Route by query complexity. Not all requests need your largest model. Under capacity pressure, implement a routing layer that classifies incoming queries and directs shorter, simpler requests to a smaller (cheaper, faster) model tier. A question with a 200-token prompt and expected 100-token response puts almost no KV cache pressure; it can run on a smaller model without users noticing. Reserve your large-model capacity for requests that genuinely require it.

Implement request priority lanes. Separate your inference traffic into lanes by SLO tier. Interactive user requests in one queue; background batch processing in another. Under capacity pressure, the batch queue sheds load first. Users get degraded throughput; background jobs catch up later. This prevents low-value batch traffic from competing with high-value interactive requests during spikes.

Set maximum sequence length limits during degradation. When you detect that KV cache memory utilization is approaching your ceiling, temporarily lower the maximum context length you'll accept for new requests. Long-context requests are disproportionate consumers of KV cache memory. Capping them during a spike frees memory for more short-context requests and keeps overall throughput higher.

Prefill-Decode Disaggregation as Architectural Foundation

The shift that changed production LLM serving most significantly — adopted across Meta, LinkedIn, Hugging Face, and every major open-source framework by mid-2025 — is separating the prefill phase from the decode phase onto different GPU pools.

Prefill is compute-intensive and parallelizable: you process all input tokens simultaneously. Decode is memory-bandwidth-intensive: you generate one token at a time, reading the full KV cache each step.

These have different hardware characteristics and burst behavior. A traffic spike increases both prefill demand and decode demand, but not in lockstep. By disaggregating them, you can independently scale each pool to match its actual bottleneck. DistServe (OSDI 2024) demonstrated 7.4x higher throughput or 12.6x tighter SLO adherence compared to co-located systems.

For burst planning, disaggregation means you can identify which pool is becoming the bottleneck during a spike and scale it independently. A burst of short-prompt, long-output requests saturates the decode pool; a burst of long-document summarization requests saturates the prefill pool. Treating them as a single monolithic service masks the actual constraint and leads to over-provisioning the wrong resource.

Measuring What Matters

Most teams instrument tokens per second at the aggregate system level, which is necessary but not sufficient. For burst capacity planning, you need visibility into:

  • KV cache memory utilization per GPU, not just overall GPU memory. The delta between total memory and model weights gives you your concurrency headroom.
  • Queue depth and queue wait time as leading indicators, not lagging symptoms. A queue depth above 10 is a warning; above 50 is an emergency.
  • Time-to-first-token (TTFT) distribution, especially p95 and p99. TTFT spikes before overall throughput degrades — it's your earliest signal that a burst is stressing the system.
  • Per-request context length distribution. A sudden shift toward longer contexts eats your KV cache budget faster than a request volume increase at constant length.

Build your dashboards around these, not aggregate RPS or CPU utilization. The metrics that predicted capacity issues for stateless services are the wrong ones for LLM inference.

The Pre-Scale Mindset

The core adjustment burst capacity planning requires is abandoning the reactive autoscaling mindset. Systems that auto-scale on CPU utilization hitting 80% work because the scale-out latency (seconds) is much smaller than the spike duration (minutes). For LLM inference, scale-out latency is minutes, and your spike may be over before new capacity is ready.

Effective LLM capacity planning is pre-emptive and predictive. Use traffic forecasting to project load windows. Pre-warm capacity before you need it. Size for your p99 event, not your p50. And treat graceful degradation — admission control, query routing, context length limits — as first-class operational concerns, not afterthoughts for when things go wrong.

The teams that survive the LLM equivalent of Black Friday are the ones who planned for it in August, not the ones who fired up the autoscaler at midnight on Thanksgiving.

References:Let's stay in touch and Follow me for more thoughts and updates