Hot-Path vs. Cold-Path AI: The Architectural Decision That Decides Your p99
Every AI feature you ship makes an architectural choice before it makes a product one: does this model call live inside the user's request, or does it run somewhere the user isn't waiting for it? The choice is usually made by whoever writes the first prototype, never revisited, and silently determines your p99 latency for the rest of the feature's life. When the post-mortem asks why a shipping dashboard became unusable at 10 a.m. every Monday, the answer is almost always that something which should have been cold-path got welded into the hot path — and a model that is fine at p50 becomes catastrophic at p99 when traffic fans out.
The hot-path / cold-path distinction is older than LLMs. CQRS, streaming architectures, lambda architectures — they all draw the same line between "must respond now" and "can arrive eventually." What's different about AI workloads is that the cost of crossing the line in the wrong direction is an order of magnitude higher than it used to be. A synchronous database query that takes 50 ms turning into 200 ms is a regression. A synchronous LLM call that takes 1.2 s at p50 turning into 11 s at p99 is a business decision you didn't know you made.
This post is about making that decision on purpose. Four pieces: the framework for placing any given AI interaction on the hot/cold spectrum, the starvation failure mode where cold work creeps into the hot path, the migration playbook for moving across the boundary when traffic or product demands change, and the monitoring that keeps you honest after you've made the call.
What actually makes something hot or cold
The framework is four questions, answered honestly for each interaction:
- User-perceived urgency. Is the user actively staring at the spinner, or have they moved on? A chat reply is hot — leaving the user at a blank bubble for 4 s destroys the experience. A weekly summary email is cold — nobody is timing it.
- Tolerable staleness. If the answer is 30 seconds out of date, does it still work? Search suggestions tolerate staleness well because the user types again. A fraud decision during checkout tolerates nothing because the transaction is atomic.
- Determinism requirements. Hot paths want tight output-shape contracts. If your caller needs a strict JSON schema for downstream routing, LLM variance in the hot path is a bug-producing machine. Cold paths can afford validator loops, rejects-and-retries, and human review.
- Cost sensitivity under burst. At 100× traffic, does this call still need to happen every time? Cold paths can drop on overload. Hot paths have to serve something — even if it's a degraded response.
The useful sharpening: a call is hot only if all four answers push it there. If any dimension tolerates delay, you have options. Most teams skip this check and put everything in the hot path because it's easier to prototype that way. The tax comes due months later when a competitor launches a feature that your latency budget cannot afford.
A concrete walk-through: an AI "explain this error" feature in a developer tool. Urgency is high — the developer is stuck. Staleness is fine — the error was logged 5 minutes ago and explaining it at 500 ms vs. 3 s won't change the diagnosis. Determinism is loose — the output is narrative text. Cost sensitivity under burst is moderate. Three of four push toward cold. So the right architecture is probably: fire the explanation request immediately, render a skeleton in the UI, stream the result when it arrives, and accept that some users will switch tabs before it returns. That's not "cold path" in the classic batch sense — it's the async interaction pattern that hot-path thinking would have missed.
The hot-path starvation pattern
The most expensive failure mode in AI architecture isn't a slow call in the cold path. It's a cold-path workflow that leaks into the hot path without anyone noticing, and then drags every unrelated user request down with it.
Here's how it happens. Version 1 of the feature has a clean separation: chat responses go through a small fast model synchronously, document summarization runs nightly in a batch queue. A PM asks for "summarize this document when the user opens it." The engineer, reasonably, reuses the chat endpoint — it already has the streaming plumbing, the auth, the rate limiter. A quarter later, the chat endpoint is serving both 2 KB chat turns and 200 KB document summaries. The long summaries hold worker slots. The rate limiter fills with heavyweight requests. And the p99 on chat — which has nothing to do with summarization — starts climbing.
The mechanism is queue pressure. LLM gateways and inference servers serve concurrent requests from a shared pool, and when a long-tail request arrives, it occupies a slot that short requests cannot use. Tail latency in a mixed workload is dominated by the longest request in the batch, not the median. vLLM operators know this pattern: when you monitor vllm:num_requests_waiting, you see the queue building before you see latency rise. By the time p99 alerts fire, the outage has already propagated.
The fix is boring and unambiguous: do not mix workload shapes on the same inference pool. Hot-path traffic gets its own dedicated pool — sized for short requests, with aggressive queue timeouts. Cold-path traffic goes to a separate pool optimized for throughput. If the same model backs both, you still want two separate client pools with two separate concurrency budgets. The lift is a few hours of plumbing. The savings is that a product decision from a neighboring team cannot eat your latency budget.
The organizational version of this problem is worse. If a shared infrastructure team owns a single LLM gateway and every product team drops any workload onto it, the gateway becomes a noisy-neighbor tragedy. The usual remedy is per-feature rate-limit buckets with priority queues, but at a certain scale the real answer is a platform team that owns workload isolation as a product — not a side responsibility.
The migration playbook: moving across the boundary
Traffic shape changes. A feature that was cold-path when it launched to 1,000 beta users is hot-path at 1,000,000. A feature that was hot-path at launch might need to move cold when you discover that 80% of requests could tolerate async completion. The migration is always harder than a greenfield build because it lives inside production.
A migration that actually works looks like this:
- Instrument the current state. Before moving anything, capture p50/p95/p99 latency, per-user cost, cache hit rate, and user abandonment rate for the feature as it exists. These numbers are the pre/post comparison that decides whether the migration succeeded.
- Introduce an async wrapper alongside the sync call. Don't flip the default yet. Build the cold-path execution — typically a durable workflow on something like Temporal, Inngest, or a roll-your-own queue — and have it produce results to a store the UI can poll or subscribe to. The existing sync path keeps running.
- Dual-write and compare. For a sample of traffic, invoke both paths. The async path is discarded initially; you just want to know that it produces equivalent outputs and that the infrastructure doesn't fall over under real traffic. This step catches the surprises: ordering issues, idempotency bugs, timeouts that only surface under real concurrency.
- Shift traffic by slice. Route a small cohort — typically internal users, then a 1% cohort, then 10% — to the async path. Watch the abandonment rate and support tickets, not just latency dashboards. If the UX changes in a way users dislike (a spinner replaced with a banner they dismiss), you'll see it in downstream conversion metrics before you see it in your traces.
- Decommission with a kill switch. Leave the sync path in place behind a flag for at least one full incident cycle. Cold-path infrastructure introduces new failure modes — queue backlog, consumer lag, poison-message storms. The kill switch is the emergency brake that gets you back to a known-good state while you diagnose.
The reverse migration — moving something from cold to hot — is rarer but worth naming. The trigger is usually that product discovers the feature works better when it's immediate. Reverse migration is easier than forward because you're shrinking the state space, but the trap is that the cold-path model may have been prompted for batch-friendly outputs. A prompt that says "return a JSON array of findings" works fine in cold processing but creates a painful UI when a user is waiting character-by-character for that array to stream. Reshape the prompt for the new path; don't assume the old one generalizes.
The monitoring that tells you whether you got it right
Hot/cold placement is not a decision you verify in code review. You verify it in production telemetry. The four metrics that actually tell you whether the architecture is holding:
- Hot-path p99 decoupled from feature shipment rate. If every new feature ships by adding one more synchronous model call to the request path, your hot-path p99 is a function of your product team's velocity. That's the wrong coupling. A well-architected hot path has a latency budget enforced by infrastructure, not negotiated by PMs.
- Cold-path completion-time percentile. For async work, p99 of time-to-completion from request-enqueued to result-available is the metric that predicts user satisfaction. If p50 is 3 s and p99 is 180 s, you have a queue starvation problem, not an average problem.
- Abandonment rate by latency bucket. This is the metric nobody tracks and everybody should. For each 500 ms bucket of response time, what fraction of users abandoned the interaction? The cliff is almost never where teams assume it is. Product decisions about hot/cold boundaries should be made against this curve, not against intuition.
- Retry rate per hop. Each additional hop in a chained agent workflow multiplies the tail. A 3-step workflow with 5% per-step retry rate burns ~16% extra tokens at the margin; a 7-step workflow closer to 35%. If your cold path is silently eating twice the compute you modeled because of retry amplification, your cost/unit is lying to you.
Together these turn the hot/cold decision from a one-shot architectural call into an observable property of the system. If a team later migrates something across the boundary, the same dashboards tell you whether the migration earned the move or introduced a regression.
The forward-looking takeaway
The next wave of agent frameworks is collapsing the operational distinction between hot and cold paths by making durable workflows feel as cheap as function calls. That's good — durable execution platforms like Temporal, Inngest, DBOS, and Cloudflare Workflows remove most of the boilerplate that made async AI workflows painful, and they make suspend-and-resume cheap enough that "wait 10 seconds for the user to confirm, then continue" stops being an infrastructure project. But they don't remove the architectural decision. They make the decision easier to implement — not easier to make correctly.
The durable part of your system is not the right place for work that the user is waiting on. And the synchronous part is not the right place for anything that isn't. Internalize the framework, instrument the boundary, and your p99 stops being a thing that happens to you.
The AI features that win over the next few years will not be the ones with the best prompts or the fanciest models. They'll be the ones whose architects drew the hot/cold line carefully the first time — and who had the discipline to redraw it when traffic shape changed.
- https://blog.langchain.com/how-do-i-speed-up-my-agent/
- https://agentnativedev.medium.com/the-p99-problem-designing-llm-inference-for-real-users-11deb35bb8d4
- https://martinfowler.com/bliki/CQRS.html
- https://learn.microsoft.com/en-us/azure/architecture/patterns/cqrs
- https://temporal.io/solutions/ai
- https://www.inngest.com/blog/durable-execution-key-to-harnessing-ai-agents
- https://render.com/articles/durable-workflow-platforms-ai-agents-llm-workloads
- https://www.dbos.dev/blog/durable-execution-crashproof-ai-agents
- https://blog.cloudflare.com/workflows-v2/
- https://docs.anyscale.com/llm/serving/benchmarking/metrics
- https://particula.tech/blog/fix-slow-llm-latency-production-apps
