Skip to main content

Latency Budgets for AI Features: How to Set and Hit p95 SLOs When Your Core Component Is Stochastic

· 11 min read
Tian Pan
Software Engineer

Your system averages 400ms end-to-end. Your p95 is 4.2 seconds. Your p99 is 11 seconds. You committed to a "sub-second" experience in the product spec. Every metric in your dashboard looks fine until someone asks what happened to 5% of users — and suddenly the average you've been celebrating is the thing burying you.

This is the latency budget problem for AI features, and it's categorically different from what you've solved before. When your core component is a database query or a microservice call, p95 latency is roughly predictable and amenable to standard SRE techniques. When your core component is an LLM, the distribution of response times is heavy-tailed, input-dependent, and partially driven by conditions you don't control. You need a different mental model before you can set an honest SLO — let alone hit it.

Why LLM Latency Breaks Your Existing Intuitions

In a typical service, latency variance comes from queueing, GC pauses, and the occasional slow disk. The distribution is roughly log-normal; your p95 is maybe 3–5x your median. You can set a timeout, watch your percentiles, and page when the p99 drifts.

LLM latency doesn't work this way. Two factors make it structurally different:

Output length is input to the latency calculation, not a constant. A request that generates 50 tokens finishes in a fraction of the time of one generating 800 tokens — and you often don't know which it will be until generation is complete. This creates a distribution that isn't just wide; it's bimodal. Short queries cluster at low latency, long-form generations cluster at high latency, and the shape of your distribution shifts based on what users are actually asking.

You're renting capacity from an upstream provider. Unlike a self-hosted database, your LLM inference capacity is shared with other tenants. Cold queues, burst traffic, and provider incidents introduce latency spikes that are correlated — when the provider is slow, everyone's slow at once, exactly when you need reliability most. A system averaging 300ms TTFT under normal load can hit 4 seconds during peak periods, not because your code changed, but because the queue depth changed.

The practical consequence: if you set your SLO based on your median, you are making a promise to the median user while silently abandoning the rest. And unlike traditional services, you can't simply over-provision your way out of this — you don't own the inference compute.

Measure First: Breaking Down the Latency Stack

Before setting an SLO, you need to decompose what you're actually measuring. LLM latency has three distinct components:

Time to First Token (TTFT) — the delay from when you send the request until the first token streams back. This is what determines whether the UI feels responsive. For conversational interfaces, users notice TTFT above 200–300ms. For non-interactive features (background summarization, async classification), it matters much less. TTFT is driven by prompt processing, queue depth, and model prefill time.

Time Per Output Token (TPOT) — the token generation rate after the first token. For long outputs, this dominates total generation time. A model generating at 30 tokens/second will take 10 seconds for a 300-token response regardless of how fast TTFT was. TPOT is largely determined by model architecture and GPU memory bandwidth.

End-to-end latency — the full round trip including your application server, any RAG retrieval, tool calls, post-processing, and network overhead. This is what your user actually experiences.

Each component needs its own SLO, and they need to be set independently. A blanket "p95 < 2s" target is almost certainly wrong for at least one of them and gives you no signal about which part of your stack to fix.

A practical starting point:

  • TTFT p95 < 500ms for user-facing conversational features
  • TTFT p95 < 2s for non-conversational features (document analysis, report generation)
  • End-to-end p95 < 5s for complex agent tasks with tool use
  • Batch/async features: throughput SLO, not latency SLO

These numbers need calibration against your actual usage data. Measure your real distribution for two weeks before committing to an SLO.

Setting the Latency Budget: Who Owns What

Once you have separate SLOs for TTFT and end-to-end, you need to assign a budget to each layer in your stack. This is where teams usually go wrong — they treat the provider's latency as the entire budget and forget that their application code, orchestration overhead, and retrieval steps all consume time too.

A realistic budget for a RAG-based chat feature hitting an SLO of p95 < 1.5s end-to-end might look like:

  • Network round trip (client to server): 50ms
  • Application routing and auth: 30ms
  • Embedding + vector search: 150ms
  • Context assembly and prompt construction: 50ms
  • LLM TTFT: 400ms
  • Remaining generation (streaming to client): unbounded, but first content visible by 480ms
  • Post-processing and response formatting: 20ms
  • Total budget to first meaningful content: ~700ms p95

The moment you lay this out, two things become obvious. First, there is no slack. Every layer needs to hit its sub-budget or the top-level SLO fails even if the model is fast. Second, the model is not the only variable — embedding latency, vector search at p95, and your application server's cold paths all matter and are individually measurable.

Treat this budget as a contract between teams. The platform team owns the LLM TTFT budget. The application team owns the RAG and orchestration budgets. When you violate the top-level SLO, you know exactly which component burned the budget.

Strategies That Actually Work

Once you have the SLO and the budget, you have three categories of tactics: reduce expected latency, reduce tail latency, and hide latency from users.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates