Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache
Your Black Friday traffic spike arrives. Conventional API services respond by spinning up more containers. Within 60 seconds, you have three times the capacity. The autoscaler does what it always does, and you sleep through the night.
Run an LLM behind that same autoscaler, and you get a different outcome. The new GPU instances come online after four minutes of model weight loading. By then, your request queues are full, your existing GPUs are thrashing under memory pressure from half-completed generations, and users are staring at spinners. Adding more compute didn't help — the bottleneck isn't where you assumed it was.
AI inference workloads violate most of the assumptions that make reactive autoscaling work for conventional services. Understanding why is the prerequisite to building systems that survive traffic spikes.
