Skip to main content

Inference Is Faster Than Your Database Now

· 10 min read
Tian Pan
Software Engineer

Open any 2024-era AI feature's trace and the model call is the whale. Eight hundred milliseconds of generation surrounded by a thin crust of retrieval, auth, and a database lookup rounding to nothing. Every architecture decision that year — the caching, the prefetching, the streaming UX — was designed around hiding that whale.

Now pull the same trace for the same feature running on a 2026 inference stack. The whale is a dolphin. A cached prefill returns the first token in 180ms. Decode streams at 120 tokens per second. The model is no longer the slow node. Your own infrastructure is, and most of it hasn't noticed.

This reordering is the most important performance shift of the year, and it's the one teams keep under-reacting to. The p99 floor on an AI request is now set by the feature store call, the auth middleware, and the Postgres lookup that was always that slow — nobody just cared when the model was taking nine-tenths of the budget.

How the ordering flipped

Three forces compounded in the last eighteen months, and any one of them would have moved the needle. Together they collapsed the model's share of end-to-end latency by more than half for most interactive workloads.

Prompt caching went mainstream. Cached prefill drops time-to-first-token by up to 80% compared with cold inference. A 10,000-token prompt that took 4.3 seconds to prefill on a cold path now returns the first token in 600ms. Provider-side caching is table stakes; every major inference stack — vLLM, SGLang, TensorRT-LLM — ships prefix-cache support, and llm-d-style distributed KV caching is pulling the same gains into multi-replica deployments.

Speculative decoding moved from paper to production. Draft-model speculative decoding now ships as a default feature in vLLM and delivers 1.4–1.6x speedup on general workloads and far more on narrow ones. On gpt-oss-class models, speculative decoding has reduced per-token latency by roughly 40% in benchmark runs; specialized code-merging models reach 10,000+ tokens per second. Decode, once the inescapable serial bottleneck, is now a lane where three tokens emerge in the time it used to take one.

API pricing collapsed, and providers competed on latency. API prices dropped approximately 80% between 2024 and 2026. Providers stopped competing purely on model quality and started competing on TTFT. Across major providers today, TTFT for the same model class varies by 3–5x under similar conditions — which means a provider swap can shave a full second off the user-visible floor without touching your own code.

The practical consequence: the LLM call's share of the trace has shrunk from 70–90% of end-to-end latency to something closer to 30–50% for cached interactive workloads. Everything else held still.

The new long tail

Pull the latency stack on an actual interactive AI feature today and the composition looks nothing like the screenshot on the infra team's dashboard from last year.

A vector search against a managed retriever adds 50–300ms to the response pipeline. A feature-store lookup against a provisioned DynamoDB table runs 15–40ms per feature at p50, and p99 can spike to hundreds of milliseconds under throttling. A Postgres query that joins three tables because the product surface asked for denormalized output can sit at 80ms p50 and 600ms p99. Auth middleware — the kind that validates a JWT, checks a session cache, and refreshes a profile — adds its own 50–150ms if the cache missed. Then the model runs.

None of these components got slower. They stayed flat while the model shrank around them. The architecture they sit inside was designed assuming the model was the expensive part, and it treated everything else as free. That assumption is no longer true, and every place it's encoded — the serial ordering, the lack of parallelism, the pooled resources sized for the old profile — is now a latency bug.

The failure mode is cultural as much as technical. Dashboards still show "model: 90%" because they were built three architectures ago and nobody has re-run the analysis. On-call engineers still reach for "the model got slow" as the default hypothesis when a p99 alarm fires. The team lead still says "we can't make it faster, the model is what it is" — while the model is the one part of the stack that got 2x faster this year.

What stops parallelizing when the model shrinks

The most important architectural consequence of the shift is that many AI features have a latency graph that was topologically correct for the old model and topologically broken for the new one.

Here's the classic shape. The request enters; auth and session lookup happen serially; a retrieval step runs against a vector store; the retrieved context is assembled into a prompt; the model call fires; the response streams back. That DAG is legal but not efficient. It was efficient when the model took 80% of the wall clock, because parallelizing a 20% segment doesn't move the needle — Amdahl wins, nobody bothers, life goes on.

Flip the model's share to 30% and the math changes. Parallelizing auth and retrieval against each other — both independent of the user's message for most queries — shaves 100–300ms off a request where the model only takes 400ms. Prefetching retrieval based on the last assistant turn, so it's already warm when the user sends the next message, wins another 100–300ms. Promoting frequently-read context into the prompt-cache boundary so the cached prefill stays warm across turns turns what used to be a cold-path penalty into a steady-state free lunch.

None of these are new techniques. They were discussed, ignored, and implemented halfheartedly for years because the model was the dominant cost and these optimizations were rounding error. Now they're the whole game. The teams who win the latency race in 2026 are the ones willing to treat their request graph as an optimization problem again, the way backend engineers did in 2018 before everyone collectively forgot.

The metrics your dashboards are missing

If the team's observability still rolls LLM calls into one opaque span with a duration attribute, the post-mortem will go in a circle. The interesting variance lives inside the span, and none of it surfaces.

The tracing surface that needs to exist now, at minimum:

  • Prefill and decode as separate events. Prefill is compute-bound and cacheable. Decode is bandwidth-bound and serial. Treating them as one number obscures every interesting fact about a slow request, including whether the slowness came from a cold cache or a bad batch position.
  • Cache-hit percentage on every span. If it's not there, the team can't distinguish a latency regression caused by a cache invalidation from one caused by a provider-side capacity event. Those require entirely different responses; mis-routing between them costs days of investigation.
  • Batch position and queue wait as attributes. On a shared endpoint, the user's request is competing for GPU time against other tenants. A p99 spike that correlates with batch position is a provisioning signal; one that doesn't is a prompt-shape signal. The dashboard has to show both.
  • Upstream provider request IDs. When the slowness is on the provider's side, the only way to file a useful support ticket is to reconcile your trace ID with theirs. Propagating their request ID through the span cost a few minutes to set up and saves hours every incident.

Teams who stood up these primitives in 2025 are now routinely finding two-thirds of their AI-feature latency debt outside the model — in places that never showed up when the model dominated the trace. Teams that haven't are still spending three days debugging "the model got slow" when the model, again, didn't move at all.

The architecture-team conversation

There's a specific conversation that most AI platform teams need to have with their infra team in 2026, and most haven't.

The screenshot on the infra wiki showing "LLM inference: 90% of request budget" was taken in 2024 and hasn't been re-run. Multiple planning decisions are still being made against that screenshot — which services get optimization headcount, which components are treated as "worth caching," whether the prompt-construction path gets re-architected or left alone. The current distribution looks radically different, and the org hasn't internalized it.

The argument to present, concretely:

  1. Run the stacked latency bar over the last 18 months. Show the model's share shrinking and everything else holding flat. This is the single most effective chart for shifting the internal narrative; until the team sees it, they will keep optimizing the part that already got optimized for them.
  2. Re-scope the optimization backlog. The feature-store latency work that was deprioritized because "we'd only save 20ms" is now saving 20ms out of a 400ms budget instead of 20ms out of a 4000ms budget. Relative impact went up 10x without any code change.
  3. Challenge the assumption that the model is the variance source. When a p99 alarm fires, the first hypothesis should no longer be "the provider is having a bad minute." It should be "something in our own stack got slower." The default hypothesis has to change or the time-to-resolution will keep blowing through SLOs.

That conversation goes poorly the first time. It gets labeled as the AI team sandbagging, or as the infra team trying to take credit for the provider's work. The team that pushes through anyway — ideally with data — gets the next year's latency budget back.

The shift is permanent, and it's going to keep moving

The trendline isn't stopping here. Inference providers are iterating on speculative decoding, distributed KV caching, and prefix-sharing faster than your database team is iterating on query plans. The gap between model latency and infra latency is going to keep widening until your own stack becomes the overwhelming majority of the budget.

The architectural discipline to adopt now:

  • Treat retrieval, auth, and feature lookups as the hot path, not the garnish. They are the hot path. They always were; now the numbers force the team to admit it.
  • Parallelize aggressively. Everything that doesn't depend on the user's current message can run before the message arrives or alongside auth. Prefetching is no longer premature optimization.
  • Measure by component, not by vendor. "The model is slow today" is an outdated excuse. The real question is which component of the request graph regressed, and the answer is almost never the component the team was paying most attention to.
  • Rebuild the mental model annually. The distribution will change again next year. Teams who re-run their latency analysis every two quarters will catch the next shift before it becomes an incident. Teams who don't will spend a quarter chasing ghost regressions.

The model stopped being the bottleneck while the dashboards weren't looking. The teams who notice first get a free round of latency wins. The teams who don't will spend 2026 explaining why their AI features got faster on paper but slower in the product.

References:Let's stay in touch and Follow me for more thoughts and updates