LLM Latency in Production: What Actually Moves the Needle
Most LLM latency advice falls into one of two failure modes: it focuses on the wrong metric, or it recommends optimizations that are too hardware-specific to apply unless you're running your own inference cluster. If you're building on top of a hosted API or a managed inference provider, a lot of that advice is noise.
This post focuses on what actually moves the needle — techniques that apply whether you control the stack or not, grounded in production data rather than benchmark lab conditions.
Latency Has Two Very Different Problems
Before optimizing anything, you need to understand that LLM inference has two distinct phases with fundamentally different bottlenecks.
The prefill phase processes your input prompt. The model reads every token you send and builds the context it needs to start generating. This phase is compute-bound — GPUs are doing heavy matrix multiplications in parallel, and it completes in a single forward pass. Its duration maps directly to Time to First Token (TTFT): how long users wait before they see anything at all.
The decode phase generates output tokens one at a time. Each step requires loading the entire model's key-value cache from GPU memory. This is memory-bandwidth-bound, not compute-bound, which is why even the fastest GPUs can only push so many tokens per second. An H100 SXM5 has 3.35 TB/s of memory bandwidth; an A6000 has 768 GB/s — that difference dominates decode speed far more than raw compute specs.
These are different problems. Optimizing one doesn't automatically help the other. Teams often pour effort into reducing TTFT while ignoring inter-token latency, then wonder why long-form outputs still feel sluggish.
Streaming Is the Cheapest Win You're Not Taking
If your application currently waits for the full response before displaying anything, fixing that is the single highest-leverage change you can make. Streaming outputs tokens incrementally, and the effect on perceived latency is dramatic — often 10-100x improvement in how fast users feel the system responds.
The reason is psychological, not technical: humans process text as they read it. A response that starts appearing in 400ms and streams for 3 seconds feels dramatically faster than one that delivers the complete text 3.4 seconds later. Code completion tools set an even tighter bar — under 100ms TTFT is the threshold where suggestions feel native to the editor rather than disruptive.
Target TTFT benchmarks by use case:
- Chatbots: under 500ms
- Code completion: under 100ms
- Batch pipelines: up to 30 seconds is acceptable
Streaming doesn't reduce total generation time — in fact it adds a tiny overhead. But for any interactive use case, it's the first optimization to implement, and it requires nothing beyond enabling the streaming flag in your API call.
KV Cache: The Optimization with Multiple Layers
The key-value (KV) cache stores intermediate attention computations so the model doesn't recompute them on every token generation step. This is what makes decode feasible at all. But there are several distinct caching strategies layered on top of each other, and confusing them leads to missed optimization opportunities.
Per-request KV cache is the baseline — every inference engine does this. It keeps attention keys and values in GPU memory across the decode phase for a single request. The memory footprint is substantial: for Llama-3-70B at standard precision, a single 4,096-token context consumes roughly 1.3 GB of VRAM.
Prefix caching extends this across requests. When multiple requests share a common prompt prefix — like all calls to the same system prompt — the engine can compute that prefix once and reuse its KV states across all requests. In practice, this reduces TTFT by 60-80% for requests that hit the cache. In agentic workflows where hundreds of requests start with the same tool definitions and system context, this is significant.
If you're using a hosted API rather than running your own inference engine, you can still benefit from prefix caching through prompt caching at the API level. The key is structuring your requests so static content (system instructions, background context, tool definitions) comes before dynamic content (user messages), and placing cache breakpoints at the boundary between static and dynamic content. The common mistake is marking the wrong block — if your cache breakpoint moves with every request, you never get a hit.
A useful rule for placement: the cache breakpoint should mark the end of content that stays identical across requests, not the end of the last message. Monitor cache_read_input_tokens in the response to confirm you're actually hitting the cache.
Semantic caching is a different layer entirely. Rather than caching exact prompt prefixes, it embeds queries and matches semantically similar requests against a vector cache. If someone asks "what's the return policy?" and another user asks "how do I return an item?" — same intent, different phrasing — semantic caching can serve both from the same cached response. Studies have shown up to 90% compute reduction in production deployments with repetitive query patterns. This is most valuable in customer-facing applications where many users ask variations of the same questions.
Speculative Decoding: Worth It for Long Outputs
Speculative decoding addresses the fundamental bottleneck of sequential token generation. A small, fast "draft" model generates several candidate tokens (typically 3-12) in parallel. The larger target model then verifies all candidates in a single forward pass — accepting correct ones in bulk and rejecting the first wrong token. When the draft is right, you get multiple tokens for approximately the cost of one target model step.
Real-world speedups of 2-3x are well-documented on generation-heavy workloads. The catch is that acceptance rate — how often the target model agrees with the draft — varies significantly. Domain-specific tasks where the draft model has seen similar text tend to achieve 70-90% acceptance rates. Generic or highly variable tasks score lower.
If you're running your own inference stack, monitor spec_decode_draft_acceptance_length in your serving engine. An acceptance rate below 0.5 tokens per step is a sign of poor draft-target pairing and speculative decoding may not be worth the added complexity. If you're using a managed inference provider, this is often handled automatically for supported model pairs.
Quantization: The Infrastructure Unlock
Model quantization — reducing weight precision from 16-bit to 8-bit or 4-bit — is less about latency and more about infrastructure economics that indirectly enable latency wins.
The numbers are striking. Llama-3-70B in BF16 occupies roughly 140GB of VRAM, requiring two H100 80GB GPUs at around $2.69/hour each. The same model in 4-bit AWQ fits on dual RTX A6000s with 96GB total VRAM at about $0.49/hour per GPU — over 80% cost savings with minimal quality degradation for most tasks.
The latency connection: fitting a model on fewer, cheaper GPUs means you can run more replicas for the same budget, reducing queuing time. FP8 quantization on H100s provides roughly 50% VRAM reduction with up to 1.6x throughput improvement on generation-heavy workloads. With more VRAM freed up, you can also hold larger KV caches, which improves cache hit rates.
For production deployments: quantize first, before spending on more expensive hardware. The quality tradeoff is usually acceptable for inference tasks, and the infrastructure flexibility is significant.
Batching: Where Single-Tenant Thinking Kills Performance
Static batching — waiting until you have a full batch before processing — introduces unnecessary head-of-line blocking. A long request stalls all short requests behind it. Throughput goes up but tail latency skyrockets.
Continuous batching (also called in-flight batching) solves this by inserting new requests into the decode loop as soon as processing slots open. When a sequence finishes generating, a new one takes its place immediately rather than waiting for the entire batch to complete. Production data shows GPU utilization of 60-85% under steady traffic with continuous batching, versus the low utilization common with naive static serving.
The batch size impact on latency is non-trivial: one benchmark shows latency dropping from 976ms at batch size 1 to 126ms at batch size 8 due to better GPU utilization. At very high batch sizes, latency climbs again as queuing pressure builds. Dynamic batch sizing — scaling based on current queue length and latency targets — is the production pattern rather than a fixed setting.
If you're not running your own serving stack, this is handled by your inference provider. The relevant question becomes whether the provider uses continuous batching and how they handle queue management — this affects your P99 latency significantly under load.
What to Measure (and What to Ignore)
Average latency is the least useful metric in production. It masks the tail latency that determines whether your P95 or P99 users are having a miserable experience. A system averaging 300ms TTFT can have P99 latency of 4 seconds — the kind of outlier that triggers support tickets and churn.
Track these instead:
- P50 TTFT: What typical users experience
- P95 TTFT: Near-worst-case for latency SLOs
- P99 TTFT: True tail latency — the number your SLA is tested against
- Inter-token latency: Especially important for long-form generation
- Time per output token (TPOT):
(Total latency - TTFT) / (output tokens - 1), the decode phase metric
If you're self-hosting with vLLM, monitor vllm:gpu_cache_usage_perc and vllm:num_requests_waiting to distinguish whether you're hitting KV cache limits (memory-constrained) or compute limits. These point to different fixes: cache pressure calls for quantization or KV cache optimization, while compute pressure calls for better batching or additional replicas.
Avoid optimizing against synthetic benchmarks with fixed prompt/output lengths. Production traffic has high variance in both dimensions, and fixed-length benchmarks systematically underestimate tail latency.
The Practical Priority Order
Given limited engineering bandwidth, here's the order that tends to deliver the most impact for the effort:
- Enable streaming — immediate UX improvement, minimal code change
- Implement prefix/prompt caching correctly — structure your prompts to maximize cache hits on static content
- Quantize your models if self-hosting — unlocks infrastructure flexibility before buying more GPUs
- Set up percentile-based monitoring — you can't improve what you're not measuring correctly
- Evaluate continuous batching in your serving engine — SGLang for agentic/structured output workloads, vLLM for general chat
- Add speculative decoding — high payoff on generation-heavy workloads, requires careful draft model selection
Speculative decoding and advanced batching tend to be last because they require more infrastructure investment and tuning. Streaming and caching are effective regardless of whether you control the inference stack.
Conclusion
LLM latency isn't a single number to optimize — it's two separate problems (TTFT and decode) that require different interventions, measured in ways that most monitoring setups handle poorly. The teams that consistently ship fast AI applications aren't necessarily running the most exotic inference infrastructure. They've enabled streaming, structured their prompts to hit cache, and built monitoring that surfaces tail latency rather than averages. That foundation handles most of what users perceive as "fast" or "slow" before you ever need to think about speculative decoding or custom serving engines.
- https://www.clarifai.com/blog/llm-inference-optimization/
- https://www.runpod.io/blog/llm-inference-optimization-techniques-reduce-latency-cost
- https://bentoml.com/llm/inference-optimization/llm-inference-metrics
- https://www.baseten.co/blog/how-we-built-production-ready-speculative-decoding-with-tensorrt-llm/
- https://platform.claude.com/docs/en/docs/build-with-claude/prompt-caching
- https://research.aimultiple.com/llm-latency-benchmark/
- https://mljourney.com/latency-optimization-techniques-for-real-time-llm-inference/
- https://docs.anyscale.com/llm/serving/benchmarking/metrics
