The 90-Second Cold Start for Production Agents: When the LLM Isn't the Slow Part
A user clicks the button. Ninety seconds later they get their first token. The team's response, almost reflexively, is to ask the model vendor for a faster TTFT — and the vendor's TTFT is 800 milliseconds. The model was never the slow part. The request waited 30 seconds for a tool registry to load, 20 seconds for a vector store client to negotiate its first connection, 15 seconds for the prompt cache to prime on a fresh container, and another 10 seconds for an agent framework to validate every tool schema in its registry against a JSON schema validator that was loading on first use.
This is the agent cold start, and it has almost nothing to do with the model. Teams that profile only the LLM call are optimizing the part of their request that wasn't slow. Worse, the cold start is invisible in steady state — load tests against a warm pool look great, dashboards plotted on the median look great, and the people who notice are the users who hit the first request after a deploy, an autoscaling event, or a low-traffic stretch where everything got recycled.
The pattern is consistent enough that it deserves its own discipline. "Agent latency" decomposes into far more components than the model call, and each one has a cold-start cost that compounds with the others. The fix isn't a single optimization — it's a set of architectural choices made before the first cold start, because the time to discover them is not in production at 3am.
The Anatomy of an Agent Cold Start
A containerized agent on a fresh node walks through a startup sequence that no individual engineer designed end-to-end. Each component is reasonable in isolation; the composition is what hurts.
The container image pulls. For a typical agent image with embedded model weights, embedding libraries, and framework dependencies, this can be 8–15GB. Standard container registries weren't designed for this kind of throughput; pull-bound cold starts often dominate the budget, and the pull phase alone can be 30–60 seconds before the runtime starts.
The runtime initializes. The model client connects to the provider, negotiates auth, fetches available models, and warms its HTTP connection pool. If the agent uses a local embedding model, that loads into memory or VRAM — for a small embedding model, 2–5 seconds; for a larger reranker, 15–30 seconds. None of this shows up in tracing tools that start measuring at the first user request.
The tool registry materializes. Modern agents discover tools at startup: an MCP server connection, a function-calling registry, a Spring Bean registration pass. Each tool comes with a JSON schema that gets parsed and validated. Recent benchmarks of MCP servers in production found that schema validation issues account for 38% of all reliability failures, and that the validation work itself is non-trivial — median tool-call latency is 320ms, but P95 jumps to 1,840ms and P99 to 6,200ms. A registry of 50 tools that validates serially on first request can spend 10–15 seconds before the agent can even consider what to call.
The vector store handle establishes. The first query against pgvector with HNSW indexing runs in 5–8ms once warm — but the first connection negotiates TLS, authenticates, and on managed services may trigger a serverless warmup of its own that adds 200–800ms. Pinecone adds 10–20ms of network latency per query in steady state, but cold connections see 1–3 seconds. Connection pools that idle out under low traffic re-pay this cost on the next request.
The prompt cache primes. Anthropic's prompt cache has a 5-minute TTL by default and a 60-minute extended TTL. Cache creation for large system prompts takes 2–4 seconds and costs 1.25× a normal write. A naive implementation that fires off 10 parallel requests against a fresh container produces 10 cache writes, 0 cache reads, and a bill 5–10× what was modeled. Hit rates as low as 4% have been measured for teams that parallelize without warming first.
Add it up. A poorly-tuned agent on a fresh container can spend 60–120 seconds before the first user request gets a response, during which none of the tracing showed anything because tracing started after initialization completed.
Why Profiling the LLM Misses Most of It
The instrumentation defaults built into popular agent frameworks measure model latency, tool-call latency, and end-to-end request latency. They do not, by default, measure the time between container start and first-request-ready. They do not measure schema validation time, embedding model load time, or first-connection latency for downstream services. They start the clock when the request arrives.
This is a measurement artifact with operational consequences. The dashboards that engineering reviews in the daily standup show the steady-state latency — which is fine, because steady-state latency is what most users see. But "most users" isn't the same as "the user who tweeted the screenshot." The cold-start tail lives in a different distribution: it's not a fat tail of the request latency histogram, it's a separate mode that only fires under specific conditions (a deploy, a scale-up event, a traffic gap that exceeded the cache TTL).
The fix isn't more aggressive aggregation; it's a separate metric. Treat cold-start latency — defined as the wall clock from container start to first-request-ready — as a first-class SLO. Page on it. Dashboard it next to TTFT. Run it in CI on a fresh container, not against a warm staging environment that has been up for hours.
A useful decomposition for this metric: image pull time, runtime init time, framework init time, downstream warmup time (embedding model, vector store, prompt cache), and time-to-first-ready. Each of these has a different fix, and lumping them into a single number guarantees the wrong fix gets prioritized.
The Patterns That Hide the Latency
Once the metric is in place, the engineering work is recognizable. None of it is novel — the patterns come from a decade of serverless and FaaS work — but the LLM ecosystem has been slow to apply them.
Warm-pool sizing tied to traffic. The cheapest fix to cold starts is to not have them. Maintain a pool of pre-warmed replicas sized to the 95th-percentile traffic burst rather than the median. The cost is straightforward to model: marginal idle compute vs. the user-visible latency of cold-starting under load. For a chat product, the calculus almost always favors the warm pool. For a batch agent that runs every six hours, scale-to-zero is fine.
Lazy tool registration with first-use materialization. A tool registry that validates 50 schemas serially at startup is paying for capability the agent may never use in this request. Lazy registration parses tool definitions on demand, with the first call paying the validation cost and subsequent calls hitting the cache. The trade-off is that the first call to each tool is slower; the win is that the cold start completes in seconds instead of minutes.
Snapshot-and-restore container images. This is the technique that has the largest single impact and is the least adopted. CRIU (Checkpoint/Restore In Userspace) lets you snapshot a container that has already loaded its model, established its connections, and primed its caches — then restore from the snapshot on the next cold start. ServerlessLLM has measured this approach at 6× faster than PyTorch and 3.6× faster than safetensors for OPT-2.7B, and 8.2× faster for LLaMA-2-70B. For framework-heavy agents whose startup cost is dominated by Python imports and JSON schema parsing, the same technique delivers similar wins without GPU specifics.
Sticky routing so warm replicas keep serving. Once a replica is warm, the cheapest next request is the one routed back to it. Prefix-cache aware routing — as implemented in projects like llm-d v0.5 — keeps the request on the replica that already has the prompt prefix in its KV cache, achieving order-of-magnitude TTFT reduction over round-robin. The same logic applies one layer up: route the user's follow-up requests back to the replica that already has their session context warm, even if it's slightly more loaded than a cold alternative.
Dedicated cache warmup before parallel work. For prompt caching specifically, the rule is "one warmup, then the parallel fan-out." A prewarm_cache() call with max_tokens=0 and the static prefix establishes the cache entry; subsequent requests read it. Without the warmup call, parallel requests race to write the same cache entry, every request pays for cache creation, and hit rates collapse to near zero in the first minute after deploy.
The Eval Discipline That Catches It
Evals have absorbed the lesson that quality regressions need pre-deploy gates. Cold-start regressions need the same treatment, and almost no team runs cold-start as part of CI.
A useful CI gate looks like this: spin up a fresh container in the test environment, send a single representative request, measure wall clock from container start to response complete, and fail the build if it exceeds a threshold. Not P95 over a load test — a single cold-start measurement. The variance will be noisy, so use a multi-run median with a wide enough threshold to absorb the noise, but keep it strict enough to catch a 30-second regression.
The patterns that this eval catches are exactly the ones that production tracing misses: a new dependency that adds 8 seconds of import time, a tool registry that grew from 30 to 80 entries and now serializes longer, a vector store client that switched to a TLS configuration that takes 600ms to negotiate, an embedding model that got upgraded to a larger variant. Each of these is invisible at steady state and obvious on a fresh container.
The same eval should run after every dependency upgrade, not just code changes. Dependency upgrades are where most cold-start regressions originate, because most dependencies don't advertise their startup cost — and the team that upgrades a Python framework rarely measures whether the framework's import-time cost grew by 5 seconds in the new minor version.
What Changes When You Treat Cold Start as a First-Class Concern
The architectural realization that follows from all of this is that the agent runtime is an operating system, not a function. It has a boot sequence, a steady-state, and a shutdown. The boot sequence is the part the user feels on the first request after any disruption — which, for a service with normal traffic patterns, happens many times a day.
Teams that have internalized this stop talking about agent latency as a single number. They have a steady-state TTFT (the model vendor's number, plus a few hundred milliseconds of routing). They have a cold-start latency (the wall clock for a fresh container to serve the first request). They have a recovery latency (the time to restore service after a downstream blip). Each gets its own SLO, its own dashboard, and its own engineering owner.
The vendors building agent infrastructure are starting to ship the right primitives — checkpoint/restore at the framework level, prefix-cache aware routing at the router level, scale-to-zero with warm-pool fallback at the orchestration level. But the discipline of measuring cold start, naming it as a problem, and gating on it in CI is something each team has to build for itself. The team that ships a feature without that discipline isn't shipping a slow agent — they're shipping an agent whose worst minute is invisible until a user finds it.
The model is rarely the slow part. The team that only profiles the LLM is optimizing the wrong system.
- https://techcommunity.microsoft.com/blog/linuxandopensourceblog/dissecting-llm-container-cold-start-where-the-time-actually-goes/4508831
- https://krylox.ai/blog/llm-cold-start-optimization
- https://developer.nvidia.com/blog/reducing-cold-start-latency-for-llm-inference-with-nvidia-runai-model-streamer/
- https://www.usenix.org/system/files/osdi24-fu.pdf
- https://arxiv.org/html/2502.15524v2
- https://www.digitalocean.com/community/conceptual-articles/hidden-cost-cold-starts-serverless-ai-workloads
- https://www.digitalapplied.com/blog/mcp-server-reliability-100-server-stress-test-study
- https://oneuptime.com/blog/post/2026-03-26-how-to-instrument-mcp-servers-with-opentelemetry/view
- https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- https://llm-d.ai/blog/llm-d-v0.3-expanded-hardware-faster-perf-and-igw-ga
- https://developer.nvidia.com/blog/deploying-disaggregated-llm-inference-workloads-on-kubernetes/
