What Your Inference Provider Is Hiding From You: KV Cache, Batching, and the Latency Floor
You're running an LLM-powered application and your p99 latency is 4 seconds. You've tuned your prompts, reduced output length, and switched to streaming. The number barely moves. The problem is not your code — it's physics and queuing theory operating inside a black box you don't own.
Every inference provider makes dozens of architectural decisions that determine your application's performance ceiling before your first API call. KV cache eviction policy, continuous batching schedules, chunked prefill chunk size — none of this is in the docs, none of it is configurable by you, and all of it shapes the latency and cost curve you're stuck with.
This post explains what's actually happening inside inference infrastructure, why it creates an unavoidable latency floor, and the handful of things you can actually do about it.
What the KV Cache Is and Why It's Eating Your Budget
When a transformer model generates the next token, it computes attention over every prior token in the context. Without caching, generating a 2,000-token response over a 10,000-token prompt would require recomputing attention for those 10,000 tokens for each of the 2,000 decode steps — a quadratic blowup that would make LLMs practically unusable.
The KV cache solves this by storing the key and value tensors from the attention layers after each token is processed. On subsequent decode steps, the model reads from cache rather than recomputing. This is why generation is fast once it starts: each new token only requires computing attention for itself against the cached sequence.
The catch is memory. A single cached sequence for a 70B parameter model with a 128K context consumes roughly 40GB of high-bandwidth GPU memory (HBM). HBM is the fastest and most limited memory on a GPU — there isn't much of it, and it's shared across all concurrent requests. When the cache fills up, something has to be evicted. What gets evicted, and when, determines whether your next request starts with a warm cache or a cold one.
How providers decide what to evict is where things get interesting. Basic LRU (least recently used) is the floor — evict whatever was accessed least recently. But production systems layer priority scores, estimated reuse probability, and tail-latency optimization on top. Research on tail-optimized eviction policies shows up to 38.9% reduction in SLO violations compared to standard LRU. The algorithm your provider uses directly affects your p95 and p99 numbers.
Prefix Caching: The 90% Cost Reduction You're Probably Not Getting
The productized version of KV cache reuse is called prefix caching or prompt caching. Instead of only caching within a single request, the provider can cache the KV state for a shared prefix across multiple requests. If your application sends the same 5,000-token system prompt on every API call, a smart provider only has to process those 5,000 tokens once — subsequent requests that share that prefix reuse the cached state.
The economics are dramatic. On Anthropic's API, cache reads cost $0.30 per million tokens while fresh input processing costs $3.00 per million — a 90% discount. OpenAI offers 50% savings with automatic caching on prompts over 1,024 tokens. A chat application with a stable system prompt and per-session document context can cache 70%+ of its input tokens, cutting costs by more than half overnight.
Production data bears this out. Google Vertex AI reported a 40% reduction in time-to-first-token and 43% improvement in p50 latency from scheduling optimizations that treat cache awareness as a first-class constraint. Anthropic customers report 85% latency reduction for long prompts when cache hits occur.
But prefix caching has a structure requirement most developers violate without knowing it: the prefix must be identical, byte-for-byte, from the start of the prompt. Any variation early in the sequence busts the cache for everything that follows. This is why putting a timestamp, user ID, or session token at the beginning of your system prompt can silently cost you 10-50x more than a properly structured prompt.
The Three Batching Modes and How They Affect Your Requests
Your inference request does not execute alone. Providers batch multiple requests together to saturate GPU compute — the economics of LLM inference only work at high utilization. How batching is implemented has direct consequences for your latency.
Static batching is the naive approach: collect a batch of requests, run them all in parallel, wait for every request in the batch to finish before starting the next batch. It's simple but wasteful. If your batch contains one 50-token request and one 2,000-token request, the GPU sits partially idle for the 1,950 tokens of difference.
Continuous batching (also called in-flight batching) fixes this by operating at the token level. When any request in a batch generates its final token, the scheduler immediately evicts it and slots in a waiting request for the next decode step. GPU utilization stays high regardless of output length variance. vLLM demonstrated 10-20x throughput improvements over static batching with this approach; Anyscale reported 23x in their benchmarks.
Chunked prefill addresses a different problem. Processing a long prompt (the "prefill" phase) is compute-intensive and blocks all decode operations for other requests until it finishes. A 100K-token prefill can stall shorter requests for hundreds of milliseconds. Chunked prefill splits long prompts into fixed-size chunks and interleaves them with decode steps for other requests. Your long-prompt request takes slightly longer to start generating, but short requests in the queue stop getting blocked.
The tradeoff: chunked prefill improves overall fairness and p99 latency at the cost of higher time-to-first-token for the chunked request itself. There's no free lunch. If your application requires both sub-500ms TTFT and high throughput on large contexts, the only real solution is prefill-decode disaggregation — running prefill and decode on separate hardware pools. This is what providers like Together AI and large-scale deployments implement, but it's invisible to you as an API caller.
The Latency Floor You Cannot Code Your Way Out Of
The fundamental constraint is memory bandwidth, not compute. During the decode phase — generating tokens one by one — the GPU must read the entire KV cache from HBM for every single token it generates. This is memory-bound, not compute-bound. Adding more GPU cores doesn't help because the bottleneck is how fast you can shuttle data from memory, not how fast you can multiply matrices.
This creates several hard floors:
- Sequence length scales quadratically with memory. Doubling context length requires 4x the memory and compute. This is why long-context requests are dramatically more expensive and why providers charge more for them.
- Quantization doesn't rescue the KV cache. You can quantize model weights to reduce memory footprint, but the KV cache stores attention activations, not weights. INT4-quantized models face the same KV cache scaling as FP16 models.
- Batch size trades latency for throughput. Larger batches improve GPU utilization and reduce cost per token, but they increase per-request latency because each request waits longer for its slot. The provider's batch size decision reflects their optimization target — which may not match yours.
- Queue depth drives tail latency. When the provider is under load, requests queue. A single long request ahead of yours can add seconds to your TTFT. This is not something you can predict or route around at the API level.
Typical production TTFT numbers from 2025 benchmarks: GPT-4o mini averages 0.7–1.4 seconds, Gemini 2.5 Flash around 0.3 seconds, Llama 70B on optimized infrastructure 180–350ms. For Llama 405B, providers often set p99 TTFT targets around 6 seconds. These numbers reflect hardware, model size, and load — not your application code.
The Mistakes That Silently Destroy Your Cache Hit Rate
Research on production LLM applications found that busting the prefix cache is one of the most common and costly deployment mistakes — silently increasing costs by 10-50x in the worst cases.
Dynamic content early in the prompt. Your cache works left-to-right. Anything that changes request-to-request invalidates the cache for everything that follows it. A current timestamp in position one of your system prompt means zero cache hits, ever. The fix: structure prompts so all static content precedes all dynamic content. System prompt first, stable context second, dynamic user content last.
Per-user system prompt personalization. Including a user's name, preferences, or personalized instructions in the system prompt prevents cross-user cache sharing. Each user gets their own uncached prefix. For applications with millions of users, this compounds into enormous unnecessary compute.
Forgetting to mark cache breakpoints on Anthropic. Unlike OpenAI's automatic caching, Anthropic requires explicit cache_control markers at each point where you want the cache to break. Without them, no prefix caching occurs at all. Minimum 1,024 tokens per checkpoint, up to four checkpoints per request.
Prompt template version drift. Even small wording changes to a system prompt bust the cache for every in-flight request that hasn't yet completed its 5-minute cache window. Treat prompt templates like versioned code artifacts and avoid changes during high-traffic periods.
The 62% problem. Studies on production deployments found that most teams don't realize they're reprocessing identical system prompts thousands of times per day. The cost appears as normal token usage — there's no "you're doing it wrong" signal in provider dashboards unless you explicitly track cache hit rates.
The Knobs You Actually Control
The honest list is shorter than you'd like:
Prompt structure. Arrange content to maximize prefix stability — static before dynamic, stable before variable. This is the highest-leverage change available to you with no API cost.
Explicit cache markers. On providers that require them, mark cache breakpoints wherever you have a stable, long-running prefix. This is table stakes for any application sending more than a few hundred requests per day.
Streaming vs. non-streaming. Streaming doesn't change inference latency — the tokens generate at the same rate. But perceived latency drops dramatically because users see output within the TTFT window rather than waiting for the full response. For interactive applications, enabling streaming is one of the highest-ROI changes available. The difference between "watching it think" and "watching it type" is the difference between a 4-second wait and an acceptable experience.
Output length constraints. Shorter outputs mean fewer decode steps, shorter KV cache growth, and faster completion. max_tokens is a ceiling, not a target — if you're consistently hitting it, you're generating more tokens than you likely need.
Provider selection. Different providers run different hardware, different scheduling algorithms, and different batch size defaults. For latency-critical workloads, benchmark rather than assume. The same model on different infrastructure can vary 5-10x in p99 TTFT under load.
Request timeout and retry strategy. A request stuck in a long queue will eventually time out. Aggressive timeouts with retry-on-timeout (not retry-on-everything) can surface queue depth problems as latency spikes rather than silent hangs. Some providers expose queue depth or estimated wait time in response headers — use them.
What This Means for Architecture
The practical implication is that your inference provider is a co-designer of your application's performance profile. Understanding their infrastructure lets you design with, not against, their scheduling model.
A few principles that follow from this:
Treat your system prompt as infrastructure, not configuration. It should be stable, versioned, and structured to maximize cache hit rate. Changes should be deployed with the same care as code deploys.
Separate request types by latency profile. A 100K-context batch summarization job and a sub-second user-facing completion request should go to different endpoints, different providers, or at least different queue priorities. Most provider APIs offer batch endpoints with different SLA targets and pricing.
Monitor cache hit rate as a first-class metric. Your cost and latency both depend on it. If your provider exposes cache hit statistics, alert on drops. If they don't, you can infer it from per-token cost trends.
Measure latency under load, not at zero concurrency. Most developers benchmark their application in isolation. Production adds queue depth, batching variance, and cache eviction pressure. p99 latency under 10x load may be 5x worse than your benchmark number — find this out before users do.
The latency floor is real, and it's not yours to lower. But the ceiling — how much worse than the floor you perform — is largely determined by how well your application cooperates with the infrastructure it runs on.
The underlying mechanisms described here — KV cache, continuous batching, chunked prefill — are implemented in open-source systems like vLLM and TensorRT-LLM. Running your own inference gives you full control over these knobs at the cost of operational complexity. For most teams, the right move is to understand the constraints, design around them, and pick providers whose optimization targets match your latency requirements.
- https://magazine.sebastianraschka.com/p/coding-the-kv-cache-in-llms
- https://docs.vllm.ai/en/stable/design/prefix_caching/
- https://arxiv.org/html/2601.06007
- https://www.anyscale.com/blog/continuous-batching-llm-inference
- https://developer.nvidia.com/blog/streamlining-ai-inference-performance-and-deployment-with-nvidia-tensorrt-llm-chunked-prefill/
- https://haoailab.com/blogs/distserve/
- https://nvidia.github.io/TensorRT-LLM/latest/features/kvcache.html
- https://arxiv.org/html/2510.15152v1
- https://platform.claude.com/docs/en/build-with-claude/prompt-caching
- https://developers.openai.com/api/docs/guides/prompt-caching
- https://blog.hathora.dev/a-deep-dive-into-llm-inference-latencies/
- https://introl.com/blog/prompt-caching-infrastructure-llm-cost-latency-reduction-guide-2025
- https://llm-d.ai/blog/predicted-latency-based-scheduling-for-llms
- https://blog.vllm.ai/2024/09/05/perf-update.html
- https://www.together.ai/guides/best-practices-to-accelerate-inference-for-large-scale-production-workloads
