LLM Latency Decomposition: Why TTFT and Throughput Are Different Problems
Most engineers building on LLMs treat latency as a single dial. They tune something — a batch size, a quantization level, an instance type — observe whether "it got faster," and call it done. This works until you hit production and discover that your p50 TTFT looks fine while your p99 is over 3 seconds, or that the optimization that doubled your throughput somehow made individual users feel the system got slower.
TTFT and throughput are not two ends of the same slider. They are caused by fundamentally different physics, degraded by different bottlenecks, and fixed by different techniques. Treating them as interchangeable is the root cause of most LLM inference incidents I've seen in production.
The Two Phases Hidden Inside Every LLM Request
Every LLM request executes in two sequential phases, and they could not be more different computationally.
The prefill phase processes the entire input prompt in a single forward pass. The model reads every token you sent, computes attention over all of them, and builds a KV (key-value) cache that will be used during generation. This phase is compute-bound: it's dominated by matrix multiplications over your prompt length, and it scales roughly as O(n²) with attention over longer inputs. More FLOPs = faster prefill. Time-to-first-token (TTFT) is largely determined here — the model cannot emit a single output token until this pass completes.
The decode phase is where generation happens. The model produces one token at a time, autoregressively. Each step loads all model weights from GPU memory to compute a single token. This phase is memory-bandwidth-bound: the bottleneck is how fast you can read gigabytes of weights from HBM, not how many FLOPs you can execute. Throughput — how many tokens the system generates per second — lives here.
This is the core tension: prefill wants compute, decode wants memory bandwidth. They compete for the same GPU resources. Every major serving optimization in 2024–2025 — chunked prefill, prefill-decode disaggregation, speculative decoding — is an attempt to resolve this competition without sacrificing both metrics simultaneously.
The total latency formula is:
End-to-End Latency = TTFT + (TPOT × output_tokens)
Where TPOT (time per output token) is the per-step decode latency. TTFT and TPOT have different causes and different fixes.
When Each Metric Actually Matters
Before optimizing anything, get clear on which metric your users actually experience.
Optimize for TTFT when:
- Users are watching a cursor or blank screen — chatbots, coding assistants, voice agents
- You're running agentic loops where each tool-call response adds to total task latency multiplicatively
- Latency directly maps to money, as in financial or trading applications
Human perception research gives a clear threshold: TTFT under 500ms feels responsive; above 1,000ms users notice; above 2,000ms frustration sets in. MLCommons codified this in their MLPerf 5.1 interactive scenario: TTFT ≤ 500ms, TPOT ≤ 30ms (roughly 33 tokens/second, which matches typical reading speed). Below 30ms TPOT, further decode speed improvements are imperceptible — users cannot read faster than the tokens arrive anyway.
Optimize for throughput when:
- No user is watching: document summarization pipelines, data labeling, nightly report generation
- Cost per token drives the business case — maximize GPU utilization to minimize spend
- You're generating synthetic training data or running bulk embedding jobs
In batch workloads, accepting a TTFT of several seconds is often fine. The optimization objective shifts entirely to maximizing tokens per dollar.
The mistake most teams make is running interactive-feeling products on infrastructure tuned for batch throughput, and vice versa.
The Metrics You Should Actually Track in Production
Never use averages for TTFT SLO management. A system with a 200ms average TTFT can simultaneously have a p99 of 3,000ms — meaning 1% of your users wait 15× longer than the average suggests. Always track p50, p95, and p99 separately. A degrading p99 while p50 holds steady is an early warning for queue buildup or memory pressure.
The second measurement failure mode is testing at the wrong concurrency level. TTFT at 1 concurrent user is essentially just network latency plus a single forward pass — it tells you almost nothing about production behavior. Test at your expected p95 concurrent request count. Artificial Analysis, which does independent API benchmarking, switched to 10k-token default prompts specifically because 1k-token prompts masked real-world behavior.
The third metric worth tracking is goodput: requests per second where both TTFT and TPOT meet your SLO targets. A system achieving 100 req/s where 70% of requests fail the TTFT SLO has an effective goodput of 30 req/s. Raw throughput numbers look great in press releases; goodput tells you what users actually experience.
Optimization Techniques That Target TTFT
KV cache prefix reuse is the single highest-leverage TTFT optimization for most production workloads. If your system prompt is 10,000 tokens, every request that hits a warm cache avoids the entire compute cost of processing those tokens. Glean's production system saw TTFT drop from 4.3s to 0.6s on cache-warm requests — a 7× improvement with no model changes. LMCache demonstrated 6.7× faster TTFT (1.2s → 0.18s) alongside 80% higher throughput from better cache utilization. The key is instrumenting your cache hit rates; teams that deploy prefix caching without measuring hit rates capture a fraction of the potential benefit.
Prefill-decode (PD) disaggregation runs prefill and decode on separate GPU pools, eliminating the phase interference at the root of the problem. Decode GPUs never stall waiting for a prefill to finish. DistServe, published at OSDI 2024, demonstrated up to 7.66× TTFT improvement and 7× QPS improvement on Llama-3.1-405B. By mid-2025, essentially every major framework — vLLM, SGLang, NVIDIA Dynamo, LMCache — supports PD disaggregation for large-scale deployments. One caveat: the optimal ratio of prefill to decode workers is workload-dependent. Default configurations can produce a 20–30% performance regression; you need to tune resource allocation to your actual input/output length distribution.
Chunked prefill, introduced in the Sarathi-Serve paper at OSDI 2024, splits large prompts into 256–512 token chunks and interleaves them with decode iterations. This eliminates "decode stalls" — where a single large prefill blocks ongoing decode steps for seconds. The tradeoff is slightly higher average TTFT in exchange for dramatically lower tail latency and higher throughput. Sarathi-Serve demonstrated 2.6× higher serving capacity for Mistral-7B and 3.7× for Yi-34B versus baseline vLLM.
Prompt length reduction is the unsexy but often overlooked lever. Shorter prompts directly reduce prefill compute. In RAG workloads, improving retrieval quality — returning fewer, more relevant chunks — has a direct TTFT impact before you touch a single infrastructure knob.
Optimization Techniques That Target Throughput
Continuous batching (the Orca algorithm, now universal in production serving) dynamically adds newly arrived requests to in-flight batches as generation slots free up, rather than waiting for a full static batch to complete. Combined with PagedAttention — which manages KV cache in non-contiguous memory blocks to eliminate fragmentation — this is the baseline configuration every team should be running. vLLM popularized both; they're now table stakes.
Quantization trades a small amount of model quality for large throughput gains. FP8 weights and activations deliver roughly 2× throughput on H100/Blackwell with minimal quality loss. INT4/GPTQ pushes 3–4× gains with more quality sensitivity. NVIDIA's NVFP4 KV cache format, introduced in 2025, achieves 20% higher cache hit rates and 3× lower latency at large batch sizes compared to FP8 KV cache.
Speculative decoding exploits idle compute at low-to-moderate concurrency by running a small draft model (1–7B parameters) that generates 3–12 candidate tokens per step. The target model verifies them in a single parallel pass, yielding multiple tokens for the cost of one decode step when the draft is correct. TensorRT-LLM reports up to 3.6× throughput improvement; vLLM with DeepSeek models shows up to 50% per-token latency reduction. But there's a critical inversion: at high concurrency, the draft model competes with real inference requests for GPU resources, and speculative decoding hurts throughput. It's primarily a low-concurrency latency optimization. The concurrency threshold at which it flips is model- and hardware-specific — measure it for your workload.
Data parallelism (replicating the model across GPU instances behind a load balancer) is the blunt but effective strategy for scaling throughput horizontally. It scales linearly with GPU count and costs linearly — there's no free lunch, but it works.
Framework and Provider Selection
The choice of serving framework has a larger impact on TTFT versus throughput tradeoffs than most teams realize.
vLLM remains the de facto standard for self-hosted serving: broad model support, continuous batching and PagedAttention by default, active community. It's the sensible baseline. HuggingFace TGI v3 often beats vLLM on TTFT at low concurrency — up to 13× faster on long-context workloads with prefix caching enabled — but vLLM generally wins at high concurrency throughput (2–24× advantage in some benchmarks). TensorRT-LLM delivers the highest raw throughput on NVIDIA hardware but requires model compilation and is NVIDIA-only. SGLang has emerged as the best option for reasoning models and complex agentic workflows, particularly for MoE architectures like DeepSeek.
For cloud APIs, the range is wide. Groq's LPU-based inference consistently benchmarks with sub-100ms TTFT and 750–1,580 tokens/second on open-weight models — roughly 20× faster throughput and 3–4× lower TTFT than GPU-based providers. The constraint is model coverage: open-weight only, no fine-tuning. GPU cloud providers on H100 typically land at 200–800ms TTFT for 10k-token prompts at low concurrency. Independent benchmarking (Artificial Analysis) puts the range across providers at 0.27s to 4.5s TTFT and 313 to 900+ TPS for output speed — an order of magnitude variation.
The Decision Matrix
Here is the practical decision framework:
If you're building an interactive product — chat, coding assistant, voice agent, agentic loop — your primary target is TTFT < 500ms at your p95 concurrency level. Start with prefix caching hit rates; fix those first. Then benchmark at realistic load to find your actual bottleneck (queue depth, prefill cost, or decode stalls). PD disaggregation and chunked prefill are the architectural answers when serving throughput requirements are high and interactive latency is non-negotiable simultaneously.
If you're running a batch workload — pipelines, annotation, synthesis — maximize GPU utilization and output tokens per dollar. Continuous batching with large batch sizes, quantization, and ample KV cache budget. Accept higher TTFT. Goodput at your SLO is meaningless if you don't have an SLO.
If you're buying from a cloud API provider, TTFT under your expected prompt lengths is something you can benchmark before committing. Artificial Analysis and similar services publish independent measurements; use them.
What Most Teams Get Wrong
The mistakes fall into patterns. Optimizing average latency instead of p99 is the most common — the system looks fine until a large prompt or a queue spike reveals the actual tail. Benchmarking at one concurrent user is the second: the number is useless unless your production traffic is literally one user.
Treating raw throughput as a proxy for user experience is subtler. A system achieving 1,000 tokens/second where most requests violate the TTFT SLO is not performing well for users. Goodput is the right metric; track it explicitly.
Hardware selection often gets cargo-culted. H100 is not always the right answer. For smaller models at moderate concurrency, A10G or L4 instances frequently deliver better cost efficiency. H100's HBM3 bandwidth advantage matters for memory-bandwidth-bound decode; for compute-bound prefill, the cost premium often isn't worth it.
Finally: enable speculative decoding, observe TTFT improve, declare success — then miss that you've quietly degraded throughput at peak load. The draft model doesn't run for free.
The Core Insight
Prefill is compute-bound. Decode is memory-bandwidth-bound. These facts are immutable, given current GPU architectures. Every optimization you apply either reduces the work in one phase, separates the phases to prevent interference, or exploits idle resources in one phase to accelerate the other.
Understanding which phase is your bottleneck, at your actual production concurrency, tells you which class of optimization to reach for. Reach for the wrong class and you will spend engineering time on improvements that don't materialize in production — or worse, that trade the metric users feel for one that only looks good in a benchmark.
- https://docs.anyscale.com/llm/serving/benchmarking/metrics
- https://www.ibm.com/think/topics/time-to-first-token
- https://bentoml.com/llm/inference-optimization/llm-inference-metrics
- https://bentoml.com/llm/inference-optimization/prefill-decode-disaggregation
- https://www.clarifai.com/blog/ttft-vs-throughput
- https://arxiv.org/pdf/2401.09670
- https://www.usenix.org/system/files/osdi24-agrawal.pdf
- https://developer.nvidia.com/blog/tensorrt-llm-speculative-decoding-boosts-inference-throughput-by-up-to-3-6x/
- https://www.glean.com/blog/glean-kv-caches-llm-latency
- https://blog.lmcache.ai/2025-03-31-eurosys/
- https://modal.com/blog/vllm-vs-tgi-article
- https://arxiv.org/html/2410.14257v1
- https://groq.com/blog/artificialanalysis-ai-llm-benchmark-doubles-axis-to-fit-new-groq-lpu-inference-engine-performance-results
- https://llm-d.ai/blog/kvcache-wins-you-can-see
- https://www.runpod.io/articles/guides/llm-inference-optimization-playbook
