Skip to main content

The 90-Second Cold Start for Production Agents: When the LLM Isn't the Slow Part

· 10 min read
Tian Pan
Software Engineer

A user clicks the button. Ninety seconds later they get their first token. The team's response, almost reflexively, is to ask the model vendor for a faster TTFT — and the vendor's TTFT is 800 milliseconds. The model was never the slow part. The request waited 30 seconds for a tool registry to load, 20 seconds for a vector store client to negotiate its first connection, 15 seconds for the prompt cache to prime on a fresh container, and another 10 seconds for an agent framework to validate every tool schema in its registry against a JSON schema validator that was loading on first use.

This is the agent cold start, and it has almost nothing to do with the model. Teams that profile only the LLM call are optimizing the part of their request that wasn't slow. Worse, the cold start is invisible in steady state — load tests against a warm pool look great, dashboards plotted on the median look great, and the people who notice are the users who hit the first request after a deploy, an autoscaling event, or a low-traffic stretch where everything got recycled.

The pattern is consistent enough that it deserves its own discipline. "Agent latency" decomposes into far more components than the model call, and each one has a cold-start cost that compounds with the others. The fix isn't a single optimization — it's a set of architectural choices made before the first cold start, because the time to discover them is not in production at 3am.

The Anatomy of an Agent Cold Start

A containerized agent on a fresh node walks through a startup sequence that no individual engineer designed end-to-end. Each component is reasonable in isolation; the composition is what hurts.

The container image pulls. For a typical agent image with embedded model weights, embedding libraries, and framework dependencies, this can be 8–15GB. Standard container registries weren't designed for this kind of throughput; pull-bound cold starts often dominate the budget, and the pull phase alone can be 30–60 seconds before the runtime starts.

The runtime initializes. The model client connects to the provider, negotiates auth, fetches available models, and warms its HTTP connection pool. If the agent uses a local embedding model, that loads into memory or VRAM — for a small embedding model, 2–5 seconds; for a larger reranker, 15–30 seconds. None of this shows up in tracing tools that start measuring at the first user request.

The tool registry materializes. Modern agents discover tools at startup: an MCP server connection, a function-calling registry, a Spring Bean registration pass. Each tool comes with a JSON schema that gets parsed and validated. Recent benchmarks of MCP servers in production found that schema validation issues account for 38% of all reliability failures, and that the validation work itself is non-trivial — median tool-call latency is 320ms, but P95 jumps to 1,840ms and P99 to 6,200ms. A registry of 50 tools that validates serially on first request can spend 10–15 seconds before the agent can even consider what to call.

The vector store handle establishes. The first query against pgvector with HNSW indexing runs in 5–8ms once warm — but the first connection negotiates TLS, authenticates, and on managed services may trigger a serverless warmup of its own that adds 200–800ms. Pinecone adds 10–20ms of network latency per query in steady state, but cold connections see 1–3 seconds. Connection pools that idle out under low traffic re-pay this cost on the next request.

The prompt cache primes. Anthropic's prompt cache has a 5-minute TTL by default and a 60-minute extended TTL. Cache creation for large system prompts takes 2–4 seconds and costs 1.25× a normal write. A naive implementation that fires off 10 parallel requests against a fresh container produces 10 cache writes, 0 cache reads, and a bill 5–10× what was modeled. Hit rates as low as 4% have been measured for teams that parallelize without warming first.

Add it up. A poorly-tuned agent on a fresh container can spend 60–120 seconds before the first user request gets a response, during which none of the tracing showed anything because tracing started after initialization completed.

Why Profiling the LLM Misses Most of It

The instrumentation defaults built into popular agent frameworks measure model latency, tool-call latency, and end-to-end request latency. They do not, by default, measure the time between container start and first-request-ready. They do not measure schema validation time, embedding model load time, or first-connection latency for downstream services. They start the clock when the request arrives.

This is a measurement artifact with operational consequences. The dashboards that engineering reviews in the daily standup show the steady-state latency — which is fine, because steady-state latency is what most users see. But "most users" isn't the same as "the user who tweeted the screenshot." The cold-start tail lives in a different distribution: it's not a fat tail of the request latency histogram, it's a separate mode that only fires under specific conditions (a deploy, a scale-up event, a traffic gap that exceeded the cache TTL).

The fix isn't more aggressive aggregation; it's a separate metric. Treat cold-start latency — defined as the wall clock from container start to first-request-ready — as a first-class SLO. Page on it. Dashboard it next to TTFT. Run it in CI on a fresh container, not against a warm staging environment that has been up for hours.

A useful decomposition for this metric: image pull time, runtime init time, framework init time, downstream warmup time (embedding model, vector store, prompt cache), and time-to-first-ready. Each of these has a different fix, and lumping them into a single number guarantees the wrong fix gets prioritized.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates