Skip to main content

2 posts tagged with "serverless"

View all tags

Warm Pools and Cold Truths: The Hidden Latency Floor of Serverless LLM Inference

· 9 min read
Tian Pan
Software Engineer

Autoscaling your GPU inference to zero looks like obvious cost discipline. The GPU is the most expensive line item on the bill, traffic is bursty, and the idle hours are pure waste. So you turn on scale-to-zero, watch the cloud invoice drop, and congratulate yourself.

Then a user shows up after a quiet stretch, and their first request takes sixty seconds to return a single token. Production deployments running serverless LLM inference routinely report cold starts exceeding 40 seconds before the first token appears — against roughly 30 milliseconds per token once the model is warm. That is a thousand-fold latency gap between the cold path and the warm path, and it is entirely a function of how idle your traffic happens to be.

This is the trade nobody puts on the slide. Scale-to-zero does not eliminate cost; it converts a steady dollar cost into a spiky latency cost, and then hides that latency cost in the p99 tail where the dashboard rarely looks.

The Cold Start Tax on Serverless AI Agents

· 11 min read
Tian Pan
Software Engineer

A standard Lambda function with a thin Python handler cold-starts in about 250ms. Your AI agent, calling the same runtime with a few SDK imports added, cold-starts in 8–12 seconds. Add local model inference and you're at 40–120 seconds. The first user to hit a scaled-down deployment waits the length of a TV commercial before the agent responds. That gap — not latency per inference token, not throughput, but the initial startup cost — is where most serverless AI deployments quietly fail their users.

The problem isn't unique to serverless, but serverless makes it visible. When you run agents on always-on infrastructure, you pay for idle capacity and cold starts never happen. When you embrace scale-to-zero to cut costs, every period of low traffic becomes a trap waiting for the next request.