Skip to main content

GPU Starvation: How One Tenant's Reasoning Prompt Stalls Your Shared Inference Endpoint

· 9 min read
Tian Pan
Software Engineer

Your dashboard says the GPU is healthy. Utilization hovers around 80%, throughput in tokens-per-second looks fine, cold starts are rare, and the model is the one you asked for. Yet your pager is going off because p99 latency has tripled, a handful of users are timing out, and support tickets all describe the same thing: "the app froze for twenty seconds, then came back." You pull a trace and find an unrelated customer's 28,000-token reasoning request sitting in the same batch as every stalled call. One tenant's deep-think prompt just ate everyone else's turn.

This is head-of-line blocking, and it is the failure mode that ruins shared LLM inference the moment reasoning models enter the traffic mix. The pattern is not new — storage systems and network stacks have fought it for decades — but it takes a specific shape on GPUs because of how continuous batching and KV-cache pinning work. Most teams design for average load and discover too late that "shared inference is cheaper" stops being true the instant request sizes stop being similar.

What makes this so subtle is that none of your standard metrics show the problem. GPU utilization is high, which feels like a good thing. Tokens-per-second across the endpoint is respectable. No single request is "slow" — the reasoning prompt finishes in the time it would have taken alone. It is the short requests that suffer, waiting behind the giant, and your metrics were never broken down by request class to show you that.

Why Continuous Batching Trades Head-of-Line Blocking for Throughput

Modern LLM servers like vLLM, TGI, and SGLang run a scheduler loop that picks a working set of sequences, runs one token-generation iteration across the entire batch on the GPU, then immediately reconsiders. Finished sequences leave the batch, new ones join, and the GPU never idles waiting for the slowest member to finish. This is called continuous or iteration-level batching, and it is the single biggest reason inference throughput has climbed so fast since 2023.

The catch is that throughput is not latency. Continuous batching assumes the sequences in a batch have roughly comparable shapes. When one sequence in the batch has a 30,000-token prompt that must be prefilled before it can start generating, every other sequence in that batch waits for that prefill to finish. The prefill phase is compute-bound and can consume the entire GPU for hundreds of milliseconds at a time — long enough to eat your p99 budget before a token of response comes back to anyone.

Reasoning models make this worse in two ways. Their prompts tend to be long because users paste in code, documents, or prior conversation turns. And their outputs are long too — a chain-of-thought trace of several thousand tokens is normal. A single reasoning request occupies KV-cache slots and scheduler attention for an order of magnitude longer than a typical chat completion. Batch it in with a stream of short classification calls and you get the worst of both: the short calls get held up by prefill, and the long call keeps its KV-cache pinned so nothing can preempt it cleanly.

The Diagnostic Signal Nobody Watches

Teams running into this first notice it as user-reported jank. By the time they open dashboards, the averages have already smoothed over the spike. The signal to watch is not average latency or utilization — it is the joint distribution of request shape and p99 latency, bucketed by request class.

Concretely: split your traffic into size buckets (prompt length, expected output length, max-tokens setting) and plot p99 latency per bucket over time. A healthy endpoint shows stable p99 in the short bucket regardless of what the long bucket is doing. An unhealthy one shows the short bucket's p99 tracking the long bucket — short requests paying for the long ones they happened to batch with. That correlation is the fingerprint of head-of-line blocking, and it will never show up in a single aggregate latency chart.

Two more signals help confirm the diagnosis. Queue wait time at the scheduler, separated from on-GPU time, tells you whether requests are stuck in admission or stuck in execution. And KV-cache occupancy as a percentage of capacity tells you whether you are preemption-bound — when occupancy pegs near 100%, the server is thrashing, evicting and recomputing prefixes, and latency spikes are about to go nonlinear.

Four Mitigations, in Order of Invasiveness

Once you see the pattern, the fix depends on how much you control of the stack. In rough order of how much disruption they cause:

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates