Skip to main content

2 posts tagged with "kv-cache"

View all tags

What Your Inference Provider Is Hiding From You: KV Cache, Batching, and the Latency Floor

· 11 min read
Tian Pan
Software Engineer

You're running an LLM-powered application and your p99 latency is 4 seconds. You've tuned your prompts, reduced output length, and switched to streaming. The number barely moves. The problem is not your code — it's physics and queuing theory operating inside a black box you don't own.

Every inference provider makes dozens of architectural decisions that determine your application's performance ceiling before your first API call. KV cache eviction policy, continuous batching schedules, chunked prefill chunk size — none of this is in the docs, none of it is configurable by you, and all of it shapes the latency and cost curve you're stuck with.

This post explains what's actually happening inside inference infrastructure, why it creates an unavoidable latency floor, and the handful of things you can actually do about it.

GPU Memory Math for Multi-Model Serving: Why Most Teams Over-Provision by 3x

· 9 min read
Tian Pan
Software Engineer

Most teams running LLM inference treat GPU provisioning like a guessing game. They see a model needs "140 GB at FP16," panic, requisition four A100-80GB cards, and call it done. What they don't calculate is how KV cache, concurrency, and quantization interact to determine the actual memory footprint — and that miscalculation typically means they're paying 3x more than necessary.

The math isn't complicated. But almost nobody does it before signing the cloud contract. This article walks through the exact formulas, shows where the hidden memory sinks live, and explains the bin-packing strategies that let you serve four models on hardware budgeted for one.