Inference Is Faster Than Your Database Now
Open any 2024-era AI feature's trace and the model call is the whale. Eight hundred milliseconds of generation surrounded by a thin crust of retrieval, auth, and a database lookup rounding to nothing. Every architecture decision that year — the caching, the prefetching, the streaming UX — was designed around hiding that whale.
Now pull the same trace for the same feature running on a 2026 inference stack. The whale is a dolphin. A cached prefill returns the first token in 180ms. Decode streams at 120 tokens per second. The model is no longer the slow node. Your own infrastructure is, and most of it hasn't noticed.
This reordering is the most important performance shift of the year, and it's the one teams keep under-reacting to. The p99 floor on an AI request is now set by the feature store call, the auth middleware, and the Postgres lookup that was always that slow — nobody just cared when the model was taking nine-tenths of the budget.
How the ordering flipped
Three forces compounded in the last eighteen months, and any one of them would have moved the needle. Together they collapsed the model's share of end-to-end latency by more than half for most interactive workloads.
Prompt caching went mainstream. Cached prefill drops time-to-first-token by up to 80% compared with cold inference. A 10,000-token prompt that took 4.3 seconds to prefill on a cold path now returns the first token in 600ms. Provider-side caching is table stakes; every major inference stack — vLLM, SGLang, TensorRT-LLM — ships prefix-cache support, and llm-d-style distributed KV caching is pulling the same gains into multi-replica deployments.
Speculative decoding moved from paper to production. Draft-model speculative decoding now ships as a default feature in vLLM and delivers 1.4–1.6x speedup on general workloads and far more on narrow ones. On gpt-oss-class models, speculative decoding has reduced per-token latency by roughly 40% in benchmark runs; specialized code-merging models reach 10,000+ tokens per second. Decode, once the inescapable serial bottleneck, is now a lane where three tokens emerge in the time it used to take one.
API pricing collapsed, and providers competed on latency. API prices dropped approximately 80% between 2024 and 2026. Providers stopped competing purely on model quality and started competing on TTFT. Across major providers today, TTFT for the same model class varies by 3–5x under similar conditions — which means a provider swap can shave a full second off the user-visible floor without touching your own code.
The practical consequence: the LLM call's share of the trace has shrunk from 70–90% of end-to-end latency to something closer to 30–50% for cached interactive workloads. Everything else held still.
The new long tail
Pull the latency stack on an actual interactive AI feature today and the composition looks nothing like the screenshot on the infra team's dashboard from last year.
A vector search against a managed retriever adds 50–300ms to the response pipeline. A feature-store lookup against a provisioned DynamoDB table runs 15–40ms per feature at p50, and p99 can spike to hundreds of milliseconds under throttling. A Postgres query that joins three tables because the product surface asked for denormalized output can sit at 80ms p50 and 600ms p99. Auth middleware — the kind that validates a JWT, checks a session cache, and refreshes a profile — adds its own 50–150ms if the cache missed. Then the model runs.
None of these components got slower. They stayed flat while the model shrank around them. The architecture they sit inside was designed assuming the model was the expensive part, and it treated everything else as free. That assumption is no longer true, and every place it's encoded — the serial ordering, the lack of parallelism, the pooled resources sized for the old profile — is now a latency bug.
The failure mode is cultural as much as technical. Dashboards still show "model: 90%" because they were built three architectures ago and nobody has re-run the analysis. On-call engineers still reach for "the model got slow" as the default hypothesis when a p99 alarm fires. The team lead still says "we can't make it faster, the model is what it is" — while the model is the one part of the stack that got 2x faster this year.
What stops parallelizing when the model shrinks
The most important architectural consequence of the shift is that many AI features have a latency graph that was topologically correct for the old model and topologically broken for the new one.
Here's the classic shape. The request enters; auth and session lookup happen serially; a retrieval step runs against a vector store; the retrieved context is assembled into a prompt; the model call fires; the response streams back. That DAG is legal but not efficient. It was efficient when the model took 80% of the wall clock, because parallelizing a 20% segment doesn't move the needle — Amdahl wins, nobody bothers, life goes on.
Flip the model's share to 30% and the math changes. Parallelizing auth and retrieval against each other — both independent of the user's message for most queries — shaves 100–300ms off a request where the model only takes 400ms. Prefetching retrieval based on the last assistant turn, so it's already warm when the user sends the next message, wins another 100–300ms. Promoting frequently-read context into the prompt-cache boundary so the cached prefill stays warm across turns turns what used to be a cold-path penalty into a steady-state free lunch.
None of these are new techniques. They were discussed, ignored, and implemented halfheartedly for years because the model was the dominant cost and these optimizations were rounding error. Now they're the whole game. The teams who win the latency race in 2026 are the ones willing to treat their request graph as an optimization problem again, the way backend engineers did in 2018 before everyone collectively forgot.
- https://www.morphllm.com/llm-inference
- https://www.morphllm.com/tokens-per-second
- https://artificialanalysis.ai/methodology/performance-benchmarking
- https://www.databricks.com/blog/llm-inference-performance-engineering-best-practices
- https://dev.to/lukas_brunner/the-rise-of-inference-optimization-the-real-llm-infra-trend-shaping-2026-4e4o
- https://arxiv.org/html/2412.11854v1
- https://dl.acm.org/doi/full/10.1145/3695053.3731093
- https://huggingface.co/blog/tngtech/llm-performance-prefill-decode-concurrent-requests
- https://developers.openai.com/cookbook/examples/prompt_caching_201
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
- https://llm-d.ai/blog/kvcache-wins-you-can-see
- https://aws.amazon.com/blogs/database/build-an-ultra-low-latency-online-feature-store-for-real-time-inferencing-using-amazon-elasticache-for-redis/
- https://last9.io/blog/postgresql-performance/
- https://redis.io/blog/p99-latency/
- https://developers.redhat.com/articles/2026/04/16/performance-improvements-speculative-decoding-vllm-gpt-oss
- https://www.tecton.ai/blog/combining-online-stores-for-real-time-serving/
- https://milvus.io/ai-quick-reference/what-is-an-acceptable-latency-for-a-rag-system-in-an-interactive-setting-eg-a-chatbot-and-how-do-we-ensure-both-retrieval-and-generation-phases-meet-this-target
