LLM Latency Decomposition: Why TTFT and Throughput Are Different Problems
Most engineers building on LLMs treat latency as a single dial. They tune something — a batch size, a quantization level, an instance type — observe whether "it got faster," and call it done. This works until you hit production and discover that your p50 TTFT looks fine while your p99 is over 3 seconds, or that the optimization that doubled your throughput somehow made individual users feel the system got slower.
TTFT and throughput are not two ends of the same slider. They are caused by fundamentally different physics, degraded by different bottlenecks, and fixed by different techniques. Treating them as interchangeable is the root cause of most LLM inference incidents I've seen in production.
