Skip to main content

Your P99 Is Following a Stranger's Traffic: The Noisy-Neighbor Tax in Hosted LLM Inference

· 10 min read
Tian Pan
Software Engineer

Your dashboards are clean. The deployment from yesterday rolled back cleanly. The model version is pinned. The prompt didn't change. But your TTFT p99 just doubled, your customer success channel is on fire, and the only honest answer you can give is "it's the provider." That answer feels small — like a shrug — and it usually leads to a follow-up question that nobody on your team can answer: prove it.

This is the part of hosted LLM inference that the marketing pages do not discuss. When you call a frontier model API, you are sharing a GPU, a PCIe fabric, a continuous batch, and a KV-cache budget with workloads you cannot see. Your p99 is a function of their bursts. The economics of large-scale inference depend on multiplexing tenants tightly enough that hardware utilization stays north of 60-70%, which means your tail latency is structurally coupled to the largest, jankiest, lumpiest tenant on the same shard. You are not buying capacity; you are buying a slice of a queue that someone else is also standing in.

The frustrating part is that the median experience is fine. P50 latencies on hosted Claude, GPT, and Gemini APIs are competitive and often impressive. The problem is that the gap between p50 and p99 is where production lives. Independent measurements over the last twelve months show p95/p50 ratios ranging from about 1.8x for the best-behaved providers up to 4-5x at peak hours for shared serverless GPU platforms. A 200ms p50 with a 1,400ms p99 is not the same product as a 200ms p50 with a 350ms p99 — even though the marketing chart will display them with the same dot.

Why Hosted Inference Shares More Than You Think

The mental model most engineers carry into hosted inference comes from cloud databases or CDNs: shared on the data plane, isolated on the per-request path. LLM serving breaks that model in three places.

The first break is continuous batching. Modern inference servers like vLLM, TGI, and the proprietary stacks behind frontier APIs do not run requests one at a time. They pack requests into rolling micro-batches that share a forward pass on the GPU. Your decode step sits on the same kernel launch as a stranger's. If their context is long, the kernel takes longer, and your token-per-second drops. There is no fair-share scheduler on the inside of a transformer block — there is just whatever else got bin-packed into your batch.

The second break is KV-cache contention. Every request in flight occupies KV-cache memory proportional to its context length. When the cache fills, the scheduler has to evict, preempt, or refuse — and the policy chosen by the provider determines whether your request waits or someone else's gets recomputed. Long-context tenants on the same shard are not just using more memory; they are reshaping the queueing dynamics for everybody. Recent research on KV-cache-aware routing reports cache hit rates of around 87% for properly partitioned workloads — implying that absent that work, a meaningful fraction of latency comes from cache thrash you did not cause.

The third break is the PCIe fabric and memory bandwidth beneath the GPU. Multi-tenant GPU platforms — even those using NVIDIA MIG to slice a card — share the host PCIe lanes for weight loading, KV transfers, and inter-GPU collectives. A noisy neighbor doing aggressive prefill while you are doing slow streaming decode can starve your bandwidth in ways that show up as inflated TTFT with no obvious upstream cause. Recent fabric-aware scheduling work shows TTFT p99 improvements of 10-15% just from PCIe-aware placement — meaning the original baseline was burning 10-15% of tail latency on bus contention alone.

These three sources of contention compose multiplicatively, not additively. A bad neighbor that does long-context, high-prefill, throughput-greedy generation hits all three at once. That is the workload class — coding agents, long-document RAG, multi-turn voice — that has been growing fastest. Your noisy neighbor is getting noisier.

"It Was Fine Yesterday" Is The Most Expensive Sentence In Inference Ops

The reason this category of incident is so painful is not the magnitude of the latency spike. A 2x p99 regression on a 400ms baseline is rarely a system-down event. The pain comes from the inability to attribute it.

When your own service degrades, you have logs, traces, deploy markers, and the option to run a bisect. When a provider's shared infrastructure degrades because somebody else launched a batch job, you have nothing local to look at. The provider's status page is either green ("no incident") or so coarse-grained that "elevated latency on chat completions" is the only signal — true for ten minutes per week and useless for proving causation in your specific incident window.

Without attribution, the conversations that follow are predictable and unproductive. Product asks engineering why the feature got slower. Engineering points at the API. Procurement asks for evidence so they can escalate. Engineering does not have evidence. Two weeks later the team is rewriting prompts and "optimizing" code paths that were never the problem, while the actual fix — calling a different shard, switching to provisioned throughput, or routing around the affected region — never happens because nobody could prove that was the right move.

The fix for this is not hope. It is the provider-health observability layer — a small, continuous experiment running alongside your production traffic.

The Observability Pattern That Proves It Wasn't You

A working provider-health layer has three pieces. None of them are exotic, but I rarely see all three deployed together.

Synthetic probes against the provider, on a fixed prompt, at a fixed cadence. The prompt should be small, deterministic, and identical across runs — five fixed inputs, fixed temperature, fixed model version. Run it every 15-30 seconds from at least two regions and one out-of-band network path (so you can distinguish provider degradation from your own egress). Record TTFT, TBT (time-between-tokens), full latency, and any provider-side metadata returned in headers. The probe is not allowed to share rate-limit buckets with your production traffic — use a separate API key. Plot the synthetic probe latency on the same dashboard as your production p95/p99. If the probe spikes when your prod spikes, the provider is the cause. If only your prod spikes, you have an internal problem.

Cohort-stratified latency. A single global p99 number aggregates across model, region, prompt length, output length, customer tier, and feature path. That aggregation hides the noisy-neighbor signal. Stratify your latency dashboards by at least: model+version, region, prompt-length bucket, output-length bucket, and feature surface. Noisy-neighbor degradations almost always concentrate in one or two cohorts (usually the region+model that happens to share a shard with the offending workload). A flat global p99 graph might be hiding a region-specific 5x spike that is invisible until you slice it.

Request-ID correlation with provider events. Every provider returns a request ID in headers — x-request-id, anthropic-request-id, etc. Log it on every call alongside your own trace ID. When the provider posts a status incident, you want to be able to ask "which of our requests landed during the incident window, and what was their request-ID range?" — because that is the only artifact that survives in the provider's logs. Without that correlation, "we were affected" is a guess. With it, you have a list of N affected user sessions you can show to procurement, customer success, or the provider's account team.

A team with these three pieces in place can answer "is it us or them?" inside the first ten minutes of an incident, instead of the first ten hours. That speed-of-attribution is the actual product of the observability work — not the dashboards themselves.

Mitigations: From Cheap To Structural

Once you can see the noisy-neighbor problem, you have a graded set of responses, ordered by cost:

  • Retry with jitter to a different region or shard. Cheap, requires only client-side work. Effective when the contention is regional rather than global. Watch out: blind retries against an overloaded provider can amplify the problem (the retry-amplification failure mode), so use exponential backoff and a circuit breaker.
  • Degrade gracefully when probes flag the provider. Use the synthetic probe as a feature flag — when probe latency exceeds a threshold for N consecutive samples, switch to a smaller model, a cached response, or a degraded UI ("results may take longer than usual"). This buys headroom for users without forcing a full failover.
  • Multi-provider routing for the affected request class. If the noisy-neighbor problem concentrates in one model on one provider, you can shift that traffic to a comparable model on another provider. The cost here is real — multi-provider reliability costs more than 2x because of per-provider prompt tuning, eval calibration, and quirk-handling — so reserve this for traffic where the SLO actually pays for the engineering.
  • Provisioned throughput / dedicated capacity. AWS Bedrock's Provisioned Throughput, Azure's PTUs, and the equivalent dedicated-capacity tiers from frontier API providers eliminate the noisy-neighbor problem by giving you reserved model units with no shared rate limits and predictable latency. The catch is the price: provisioned throughput is significantly more expensive than on-demand and requires a 1-month or 6-month commitment. The math works when sustained utilization stays above ~65-70% of the reserved capacity, or when the SLO violation cost (lost revenue, contractual penalties) exceeds the premium.
  • Dedicated GPU inference, self-managed. The endgame for teams whose latency requirements justify it. H100 cloud prices fell into the $2-4/hour range through 2025, which made the math close to viable for medium-traffic workloads — but you also pay in operational headcount: vLLM tuning, KV-cache budget management, autoscaling, model upgrades, eval infrastructure. Most teams underestimate this cost by 3-5x in the first year.

The honest decision tree is: synthetic probes and cohort dashboards first (they pay for themselves the first time you have an incident), graceful degradation second (it protects revenue and is mostly client-side), and the structural moves — multi-provider, provisioned, or self-hosted — only when the steady-state p99 violation rate is high enough to financially justify them. Most teams jump to "we should self-host" before they have any data on whether the contention is even consistent enough to be worth eliminating.

Stop Treating Latency As Your Problem Alone

The deeper shift required is conceptual. Hosted inference latency is not a property of your code. It is a property of a shared queue, observed through a thin API layer, in which your traffic is one workload among many. Treating tail latency as something you debug entirely from inside your codebase is the same mistake as treating database performance as a property of your ORM — true at the surface, wrong underneath.

The teams that handle this well do three things differently. They invest in provider-health observability before they need it, so attribution is a five-minute lookup instead of a two-week investigation. They keep an explicit decision threshold for when to escalate from on-demand to dedicated capacity, expressed in dollars-per-SLO-violation rather than vague "performance" goals. And they accept that the answer to many incidents is "the provider is having a bad afternoon" — and have the data to say so confidently, plan around it, and move on without a witch hunt.

"It was fine yesterday" is the most expensive sentence in inference ops because it ends investigations that should be starting. The fix is to make sure that "fine yesterday, broken today" is a sentence you can immediately decompose into "fine for whom, broken for whom, and against which provider region" — without having to file a ticket to find out.

References:Let's stay in touch and Follow me for more thoughts and updates