Skip to main content

2 posts tagged with "serving"

View all tags

GPU Starvation: How One Tenant's Reasoning Prompt Stalls Your Shared Inference Endpoint

· 9 min read
Tian Pan
Software Engineer

Your dashboard says the GPU is healthy. Utilization hovers around 80%, throughput in tokens-per-second looks fine, cold starts are rare, and the model is the one you asked for. Yet your pager is going off because p99 latency has tripled, a handful of users are timing out, and support tickets all describe the same thing: "the app froze for twenty seconds, then came back." You pull a trace and find an unrelated customer's 28,000-token reasoning request sitting in the same batch as every stalled call. One tenant's deep-think prompt just ate everyone else's turn.

This is head-of-line blocking, and it is the failure mode that ruins shared LLM inference the moment reasoning models enter the traffic mix. The pattern is not new — storage systems and network stacks have fought it for decades — but it takes a specific shape on GPUs because of how continuous batching and KV-cache pinning work. Most teams design for average load and discover too late that "shared inference is cheaper" stops being true the instant request sizes stop being similar.

Continuous Batching: The Single Biggest GPU Utilization Unlock for LLM Serving

· 11 min read
Tian Pan
Software Engineer

Most LLM serving infrastructure failures in production aren't model failures—they're scheduling failures. Teams stand up a capable model, load test it, and discover they're burning expensive GPU time at 35% utilization while users wait. The culprit is almost always static batching: a default inherited from conventional deep learning that fundamentally doesn't fit how language models generate text.

Continuous batching—also called iteration-level scheduling or in-flight batching—is the mechanism that fixes this. It's not a tuning knob; it's an architectural change to how the serving loop runs. The difference between a system using it and one that isn't can be 4–8x in throughput for the same hardware.