GPU Starvation: How One Tenant's Reasoning Prompt Stalls Your Shared Inference Endpoint
Your dashboard says the GPU is healthy. Utilization hovers around 80%, throughput in tokens-per-second looks fine, cold starts are rare, and the model is the one you asked for. Yet your pager is going off because p99 latency has tripled, a handful of users are timing out, and support tickets all describe the same thing: "the app froze for twenty seconds, then came back." You pull a trace and find an unrelated customer's 28,000-token reasoning request sitting in the same batch as every stalled call. One tenant's deep-think prompt just ate everyone else's turn.
This is head-of-line blocking, and it is the failure mode that ruins shared LLM inference the moment reasoning models enter the traffic mix. The pattern is not new — storage systems and network stacks have fought it for decades — but it takes a specific shape on GPUs because of how continuous batching and KV-cache pinning work. Most teams design for average load and discover too late that "shared inference is cheaper" stops being true the instant request sizes stop being similar.
What makes this so subtle is that none of your standard metrics show the problem. GPU utilization is high, which feels like a good thing. Tokens-per-second across the endpoint is respectable. No single request is "slow" — the reasoning prompt finishes in the time it would have taken alone. It is the short requests that suffer, waiting behind the giant, and your metrics were never broken down by request class to show you that.
Why Continuous Batching Trades Head-of-Line Blocking for Throughput
Modern LLM servers like vLLM, TGI, and SGLang run a scheduler loop that picks a working set of sequences, runs one token-generation iteration across the entire batch on the GPU, then immediately reconsiders. Finished sequences leave the batch, new ones join, and the GPU never idles waiting for the slowest member to finish. This is called continuous or iteration-level batching, and it is the single biggest reason inference throughput has climbed so fast since 2023.
The catch is that throughput is not latency. Continuous batching assumes the sequences in a batch have roughly comparable shapes. When one sequence in the batch has a 30,000-token prompt that must be prefilled before it can start generating, every other sequence in that batch waits for that prefill to finish. The prefill phase is compute-bound and can consume the entire GPU for hundreds of milliseconds at a time — long enough to eat your p99 budget before a token of response comes back to anyone.
Reasoning models make this worse in two ways. Their prompts tend to be long because users paste in code, documents, or prior conversation turns. And their outputs are long too — a chain-of-thought trace of several thousand tokens is normal. A single reasoning request occupies KV-cache slots and scheduler attention for an order of magnitude longer than a typical chat completion. Batch it in with a stream of short classification calls and you get the worst of both: the short calls get held up by prefill, and the long call keeps its KV-cache pinned so nothing can preempt it cleanly.
The Diagnostic Signal Nobody Watches
Teams running into this first notice it as user-reported jank. By the time they open dashboards, the averages have already smoothed over the spike. The signal to watch is not average latency or utilization — it is the joint distribution of request shape and p99 latency, bucketed by request class.
Concretely: split your traffic into size buckets (prompt length, expected output length, max-tokens setting) and plot p99 latency per bucket over time. A healthy endpoint shows stable p99 in the short bucket regardless of what the long bucket is doing. An unhealthy one shows the short bucket's p99 tracking the long bucket — short requests paying for the long ones they happened to batch with. That correlation is the fingerprint of head-of-line blocking, and it will never show up in a single aggregate latency chart.
Two more signals help confirm the diagnosis. Queue wait time at the scheduler, separated from on-GPU time, tells you whether requests are stuck in admission or stuck in execution. And KV-cache occupancy as a percentage of capacity tells you whether you are preemption-bound — when occupancy pegs near 100%, the server is thrashing, evicting and recomputing prefixes, and latency spikes are about to go nonlinear.
Four Mitigations, in Order of Invasiveness
Once you see the pattern, the fix depends on how much you control of the stack. In rough order of how much disruption they cause:
- Enable chunked prefill. Most modern runtimes support splitting long prompts into fixed-size chunks that interleave with decode steps for other requests in the batch. The long prompt takes slightly longer to produce its first token, but short requests in the queue stop being blocked. This is usually a one-flag change on vLLM and gives you most of the win for short-tail requests at the cost of slightly worse TTFT on the long-tail ones.
- Add priority-aware scheduling. Most runtimes let you tag requests with a priority and either preempt lower-priority decode or deprioritize lower-priority admission. Tiered SaaS products use this to protect paid traffic from free-tier bursts. The failure mode to watch for is priority inversion: a low-priority request already holding a large KV-cache slot can still block high-priority admissions even if you think you preempted it.
- Impose per-tenant caps on in-flight tokens. Rate limiting by request-per-second is almost useless for LLM inference because one reasoning request can consume what a thousand short requests would. Cap tenants by concurrent tokens in flight — both prompt tokens being prefilled and decode tokens being generated — and you get a fairness model that actually matches execution cost. Newer inference platforms call this a "token pool" or "token-budget" model.
- Isolate request classes onto separate pools. At some traffic scale, the cross-subsidy stops pencilling out. Route reasoning traffic and long-context traffic to a dedicated pool — same model, separate replicas — and keep short-latency-sensitive traffic on a pool sized for its shape. You pay for more idle capacity in exchange for bounded blast radius. Most teams resist this because it seems wasteful, but the math flips the moment the cost of p99 violations exceeds the cost of a second pool.
A more invasive option worth mentioning: prefill/decode disaggregation. Research systems like DistServe and production stacks at the largest labs separate the prefill phase onto dedicated workers that hand off KV cache to decode workers over a fast interconnect. This gives you physical isolation between the two phases and eliminates the worst form of head-of-line blocking — long prompts in the prefill queue dragging down active decoders. It is architecturally expensive and probably not where a small team should start, but it is where the road leads if you grow into genuinely heavy reasoning traffic.
When "Shared Inference Is Cheaper" Stops Being True
The pull toward shared, multi-tenant inference is strong and mostly correct. GPUs are expensive, models are large, cold starts are painful, and you want as few replicas as possible keeping weights hot. For homogeneous traffic, the sharing works beautifully.
The inflection point is request-shape variance. Once your traffic mix includes a long-context workload or a reasoning workload alongside your normal traffic, the fairness guarantees of a shared pool erode. You start paying for the shared-pool savings in the currency of p99 latency, and the users footing that bill have no way to tell you — they just leave.
A useful rule of thumb: if the ratio of max request size to median request size in a pool exceeds about 10x, start planning to split. The exact number depends on your SLO tolerance and how much spare capacity you run, but 10x is where continuous batching's assumptions start breaking for most workloads I have seen. Above 50x and you are effectively running two different services on one pool and pretending it is one.
Build the Metrics Before You Build the Mitigations
The instinct when a team first hits this is to reach for a mitigation — flip on chunked prefill, add a priority tier, cap some tenant. That can work, but it is also how you ship a half-fix and congratulate yourself while the real problem hides behind an unmeasured bucket.
The durable fix is to make request-class-aware latency a first-class metric before you tune anything. Every request gets tagged at admission with its size class, its tenant, and its priority tier. Every completion records the request's on-GPU time, queue wait, and TTFT separately. Dashboards show p99 per class, not just p99 overall. Alerts fire when a class's p99 crosses a threshold even if the aggregate is healthy.
With that instrumentation in place, the mitigations are no longer guesses. You can tell whether chunked prefill helped by watching the short-class p99 decouple from the long-class p99. You can tell whether a priority tier is working by watching that tier's p99 stay flat during a noisy-neighbor event. You can tell when it is time to split pools by watching the cross-class correlation stop responding to tuning.
The Decision You Will Eventually Make
The trajectory of every serving platform looks the same. First the team ships on a shared pool because it is simple and cheap. Traffic grows and a reasoning workload lands on top of the chat workload. The first p99 incident arrives. The team turns on chunked prefill and buys themselves six months. Then a customer with a long-context workload onboards and the same incident returns at a new scale. The team adds per-tenant token pools and buys themselves another six months. Then the traffic mix changes again, and this time the only answer is to split the pool.
You can skip to the end of this trajectory by front-loading the instrumentation and designing admission around request class from day one. That does not mean running separate pools on day one — it means being ready to split them with a config change when the signal tells you to. The teams that suffer most are the ones whose admission path assumes homogeneity so deeply that splitting requires a ground-up rewrite.
Shared inference is cheaper when shapes are similar. Request-class isolation is cheaper when they are not. The mistake is not picking the wrong answer — it is refusing to notice that the right answer changed.
- https://arxiv.org/html/2504.20828v2
- https://docs.vllm.ai/en/stable/configuration/optimization/
- https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html
- https://huggingface.co/blog/continuous_batching
- https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests
- https://huggingface.co/blog/tngtech/llm-performance-request-queueing
- https://www.usenix.org/system/files/osdi24-zhong-yinmin.pdf
- https://arxiv.org/pdf/2511.04791
- https://www.usenix.org/system/files/osdi24-fu.pdf
- https://ennanzhai.github.io/pub/sosp25-aegaeon.pdf
- https://arxiv.org/html/2602.16603
