Warm Pools and Cold Truths: The Hidden Latency Floor of Serverless LLM Inference
Autoscaling your GPU inference to zero looks like obvious cost discipline. The GPU is the most expensive line item on the bill, traffic is bursty, and the idle hours are pure waste. So you turn on scale-to-zero, watch the cloud invoice drop, and congratulate yourself.
Then a user shows up after a quiet stretch, and their first request takes sixty seconds to return a single token. Production deployments running serverless LLM inference routinely report cold starts exceeding 40 seconds before the first token appears — against roughly 30 milliseconds per token once the model is warm. That is a thousand-fold latency gap between the cold path and the warm path, and it is entirely a function of how idle your traffic happens to be.
This is the trade nobody puts on the slide. Scale-to-zero does not eliminate cost; it converts a steady dollar cost into a spiky latency cost, and then hides that latency cost in the p99 tail where the dashboard rarely looks.
The reason this is worse than the serverless cold start you already know is structural. A stateless function cold start pays for a container pull and a runtime boot — a few hundred milliseconds to a couple of seconds. An LLM cold start pays for all of that plus moving the model weights into VRAM, plus CUDA context initialization, plus kernel compilation and CUDA graph capture, plus KV-cache allocation. The model weights are not a side effect of the cold state. The model weights are the cold state. You cannot lazy-load your way out of a 16 GB checkpoint, because the checkpoint is the thing the request needs.
Anatomy of a Cold Start
When a serverless platform scales a GPU worker up from zero, the user's request is blocked behind a pipeline of sequential stages:
- GPU allocation and scheduling. The platform finds a physical GPU, attaches it, and schedules your container onto it. On a busy region this alone can queue.
- Container pull and boot. Your image — frequently several gigabytes once CUDA, PyTorch, and the inference server are baked in — is fetched and started.
- Weight fetch. The model checkpoint is read from object storage or a network volume into host memory. For a 70B-class model in half precision this is over 100 GB of transfer.
- Weight load into VRAM. Host memory is copied across the PCIe bus into the GPU.
- Warm-up. CUDA context creation,
torch.compileif you use it, kernel autotuning, CUDA graph capture, and KV-cache pre-allocation.
The important property here is that these stages are mostly sequential and dominated by one term. For large models, weight movement swamps everything else. That is good news and bad news. Bad news: you cannot optimize the small stages and expect a meaningful win. Good news: there is exactly one bottleneck worth attacking, and the entire mitigation toolkit is aimed at it.
The Cost-Versus-Latency Frontier Nobody Models Honestly
Before reaching for mitigations, do the arithmetic, because the honest version of it kills a lot of scale-to-zero deployments outright.
Scale-to-zero is a bet that idle cost saved exceeds latency cost incurred. The idle cost is easy to see: it is the GPU-hour price multiplied by the hours you would otherwise sit unused. The latency cost is harder, because it is paid by users, not by your finance team, and it shows up as abandoned sessions and bad reviews rather than a line on the invoice.
There is a break-even traffic rate, and most teams never compute it. The logic is simple. If your requests arrive far enough apart that the platform scales down between them, every single request is a cold start. Consider a workload where actual inference takes five seconds but the cold start takes thirty: 85% of the wall-clock time — and 85% of the metered GPU seconds you pay for — goes to waiting rather than processing. You are not saving money. You are paying a premium for the privilege of being slow.
The general guidance that has emerged from 2025–2026 production experience is blunt: if your average GPU utilization exceeds roughly 40–50%, dedicated always-on capacity is cheaper than serverless, not just faster. Serverless wins for genuinely bursty or sporadic workloads — internal tools, batch jobs, a new feature still hunting for product-market fit — where idle hours dominate and the occasional cold start is tolerable. It loses badly for steady, latency-sensitive traffic, which is exactly the traffic a user-facing chat or voice product generates.
- https://acecloud.ai/blog/cold-start-latency-llm-inference/
- https://regolo.ai/scale-to-zero-cold-start-latency-why-serverless-gpu-breaks-real-time-ai-and-how-to-fix-it/
- https://arxiv.org/html/2411.15664v1
- https://arxiv.org/pdf/2502.15524
- https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/
- https://modal.com/blog/gpu-mem-snapshots
- https://blog.vllm.ai/2025/10/26/sleep-mode.html
- https://www.spheron.network/blog/ai-inference-cost-economics-2026/
- https://www.clarifai.com/blog/serverless-vs-dedicated-gpu
- https://machinelearningatscale.substack.com/p/tackling-the-llm-cold-start-problem
