Latency Budgets for AI Features: How to Set and Hit p95 SLOs When Your Core Component Is Stochastic
Your system averages 400ms end-to-end. Your p95 is 4.2 seconds. Your p99 is 11 seconds. You committed to a "sub-second" experience in the product spec. Every metric in your dashboard looks fine until someone asks what happened to 5% of users — and suddenly the average you've been celebrating is the thing burying you.
This is the latency budget problem for AI features, and it's categorically different from what you've solved before. When your core component is a database query or a microservice call, p95 latency is roughly predictable and amenable to standard SRE techniques. When your core component is an LLM, the distribution of response times is heavy-tailed, input-dependent, and partially driven by conditions you don't control. You need a different mental model before you can set an honest SLO — let alone hit it.
Why LLM Latency Breaks Your Existing Intuitions
In a typical service, latency variance comes from queueing, GC pauses, and the occasional slow disk. The distribution is roughly log-normal; your p95 is maybe 3–5x your median. You can set a timeout, watch your percentiles, and page when the p99 drifts.
LLM latency doesn't work this way. Two factors make it structurally different:
Output length is input to the latency calculation, not a constant. A request that generates 50 tokens finishes in a fraction of the time of one generating 800 tokens — and you often don't know which it will be until generation is complete. This creates a distribution that isn't just wide; it's bimodal. Short queries cluster at low latency, long-form generations cluster at high latency, and the shape of your distribution shifts based on what users are actually asking.
You're renting capacity from an upstream provider. Unlike a self-hosted database, your LLM inference capacity is shared with other tenants. Cold queues, burst traffic, and provider incidents introduce latency spikes that are correlated — when the provider is slow, everyone's slow at once, exactly when you need reliability most. A system averaging 300ms TTFT under normal load can hit 4 seconds during peak periods, not because your code changed, but because the queue depth changed.
The practical consequence: if you set your SLO based on your median, you are making a promise to the median user while silently abandoning the rest. And unlike traditional services, you can't simply over-provision your way out of this — you don't own the inference compute.
Measure First: Breaking Down the Latency Stack
Before setting an SLO, you need to decompose what you're actually measuring. LLM latency has three distinct components:
Time to First Token (TTFT) — the delay from when you send the request until the first token streams back. This is what determines whether the UI feels responsive. For conversational interfaces, users notice TTFT above 200–300ms. For non-interactive features (background summarization, async classification), it matters much less. TTFT is driven by prompt processing, queue depth, and model prefill time.
Time Per Output Token (TPOT) — the token generation rate after the first token. For long outputs, this dominates total generation time. A model generating at 30 tokens/second will take 10 seconds for a 300-token response regardless of how fast TTFT was. TPOT is largely determined by model architecture and GPU memory bandwidth.
End-to-end latency — the full round trip including your application server, any RAG retrieval, tool calls, post-processing, and network overhead. This is what your user actually experiences.
Each component needs its own SLO, and they need to be set independently. A blanket "p95 < 2s" target is almost certainly wrong for at least one of them and gives you no signal about which part of your stack to fix.
A practical starting point:
- TTFT p95 < 500ms for user-facing conversational features
- TTFT p95 < 2s for non-conversational features (document analysis, report generation)
- End-to-end p95 < 5s for complex agent tasks with tool use
- Batch/async features: throughput SLO, not latency SLO
These numbers need calibration against your actual usage data. Measure your real distribution for two weeks before committing to an SLO.
Setting the Latency Budget: Who Owns What
Once you have separate SLOs for TTFT and end-to-end, you need to assign a budget to each layer in your stack. This is where teams usually go wrong — they treat the provider's latency as the entire budget and forget that their application code, orchestration overhead, and retrieval steps all consume time too.
A realistic budget for a RAG-based chat feature hitting an SLO of p95 < 1.5s end-to-end might look like:
- Network round trip (client to server): 50ms
- Application routing and auth: 30ms
- Embedding + vector search: 150ms
- Context assembly and prompt construction: 50ms
- LLM TTFT: 400ms
- Remaining generation (streaming to client): unbounded, but first content visible by 480ms
- Post-processing and response formatting: 20ms
- Total budget to first meaningful content: ~700ms p95
The moment you lay this out, two things become obvious. First, there is no slack. Every layer needs to hit its sub-budget or the top-level SLO fails even if the model is fast. Second, the model is not the only variable — embedding latency, vector search at p95, and your application server's cold paths all matter and are individually measurable.
Treat this budget as a contract between teams. The platform team owns the LLM TTFT budget. The application team owns the RAG and orchestration budgets. When you violate the top-level SLO, you know exactly which component burned the budget.
Strategies That Actually Work
Once you have the SLO and the budget, you have three categories of tactics: reduce expected latency, reduce tail latency, and hide latency from users.
Reduce Expected Latency
Prompt caching is the highest-leverage optimization most teams under-use. If your system prompt is long and static, caching the KV representation at the inference layer can reduce TTFT by 40–60%. This doesn't require code changes — it requires understanding that prompt caching is a feature you need to opt into and verify is working.
Output length control is the most underrated lever. Every additional output token costs latency. Explicit constraints on response format, maximum token limits set on the API call, and output templates that front-load the critical information all reduce total generation time without sacrificing quality for the actual task. Don't leave max_tokens at its default.
Smaller models for classification steps — if your pipeline has any step that makes a binary decision (route this request, check this condition, classify this sentiment), run that step on a small fast model. A classification step that routes to a large model for the actual task doesn't need GPT-4-class reasoning. The latency reduction is often 5–10x for these sub-tasks.
Reduce Tail Latency
Hedged requests are the single most effective technique for reducing p99 without reducing mean latency proportionally. The pattern, based on Google's "Tail at Scale" work, works like this: send your request, and if it hasn't responded by the time you hit your p95 threshold for that type of request, send a second request to an alternate replica or provider. Return whichever response arrives first, cancel the other. Production deployments report p99 latency reductions of 75–96% with only 5–10% additional cost overhead from the duplicate requests.
For LLMs specifically, hedging works well when:
- You have access to multiple inference endpoints (different regions, providers, or replicas)
- The request is idempotent (same prompt, same result expected)
- You don't have stateful tool calls that would be dangerous to duplicate
For agent tasks with tool calls that have side effects, hedging is not safe. Know which of your request types are idempotent before wiring this up.
Speculative execution takes hedging further: instead of sending duplicate requests reactively, you precompute likely paths proactively. If your flow has a classification step followed by one of two branches, start both branches simultaneously and discard the one that loses. This works particularly well when one branch is dramatically more likely — if 90% of requests go to the fast path, the wasted compute on the 10% you discard is small relative to the latency you save for all users.
Timeout tiering replaces the single static timeout that teams typically set. Instead of "fail if no response in 10 seconds," set separate timeouts for each latency tier:
- A "fast timeout" that triggers hedging (sending the duplicate request)
- A "slow timeout" that returns a degraded response (show what you have, or fall back to a simpler answer)
- A "final timeout" that hard-fails with an error
This keeps the system responsive even when the model is slow. Users see partial or degraded responses rather than a spinner followed by an error.
Hide Latency from Users
Streaming is the most effective single technique for perceived latency reduction, even when it doesn't reduce actual end-to-end latency at all. Sending the first tokens to the client as soon as they're generated means users see a response appearing within your TTFT budget, even if the full response takes several more seconds. For most conversational features, this cuts perceived latency to under a second.
Front-loading the answer in your prompt design complements streaming. If you instruct the model to put the most important information first and the supporting detail second, users who read the stream as it appears get value immediately. The model naturally tends toward elaborate setup before conclusion — you need to explicitly prompt against this.
Progress indicators for non-streaming contexts — for features where streaming isn't practical (batch exports, document analysis, complex report generation), show meaningful progress signals. "Analyzing 47 data points" is better than a spinner. Accurate progress signals change the user's experience even when the underlying latency is unchanged.
Operating Under These Constraints
Setting an SLO is a one-time exercise. Operating against it is a continuous one.
Track burn rate, not just current p95. A burn rate alert tells you that you're consuming your error budget at 14x the sustainable rate and will exhaust it in two hours, before your current p95 ever crosses the threshold. Traditional latency alerts fire when you're already in violation; burn rate alerts fire when you're heading toward one.
Separate real-time and batch SLOs. Batch workloads don't need the same latency targets as real-time user interactions, and conflating them inflates your apparent p95. Track them independently and set distinct objectives. This also lets you make deliberate quality/latency tradeoffs: route non-urgent requests to cheaper, slower endpoints when real-time capacity is constrained.
Freeze deployments when error budgets drop below 10%. This isn't a policy you'll want in every org, but for teams that have experienced cascading failures triggered by deployments during high-error-budget-burn periods, it's a forcing function that prevents bad timing from turning an incident into a production crisis.
Monitor TTFT and TPOT separately in production. Alerts on end-to-end latency alone don't tell you whether the problem is prefill (prompt too long, provider queue backed up) or generation (model running slow, output unexpectedly long). The fix is different in each case, and you need the signal to route to the right team.
The Uncomfortable Admission
The techniques above will meaningfully improve your p95. They will not make your p99 look like a deterministic service's p99. There is irreducible variance in LLM latency that no application-layer technique fully eliminates — provider incidents, long outputs, cold starts, and model updates all introduce spikes that are difficult to fully hedge away.
The honest engineering approach is to acknowledge this in your SLO tier design. Some features genuinely cannot offer a p95 < 1s guarantee with current technology, and forcing them to do so means either degrading quality (smaller models, shorter outputs) or lying (calling a p80 your p95). Better to define a two-tier SLO — a tighter one for the common case and an honest tail budget for the rest — and design your degradation strategy explicitly.
Users who understand that a complex task takes up to 10 seconds are better served than users who were promised "instant" and get a 30-second timeout error. Set the SLO you can actually hit, instrument everything, and iterate from a basis of truth.
The hard part of AI performance engineering isn't the optimization techniques — it's building the measurement infrastructure to know whether you're winning. Start there. Everything else follows from having an honest view of your actual distribution.
- https://blog.alexoglou.com/posts/hedging/
- https://aws.amazon.com/blogs/database/how-global-payments-inc-improved-their-tail-latency-using-request-hedging-with-amazon-dynamodb/
- https://medium.com/swlh/hedged-requests-tackling-tail-latency-9cea0a05f577
- https://arxiv.org/html/2510.15152v1
- https://medium.com/@sumanta.boral/strategies-for-reducing-llm-inference-latency-and-making-tradeoffs-lessons-from-building-9434a98e91bc
- https://developers.openai.com/api/docs/guides/latency-optimization
- https://techvzero.com/slo-scalability-best-practices-ai-systems/
- https://docs.anyscale.com/llm/serving/benchmarking/metrics
- https://particula.tech/blog/fix-slow-llm-latency-production-apps
- https://medium.com/@connect.hashblock/top-7-hedged-request-patterns-to-tame-the-tails-1cb74a58bc8e
