Skip to main content

One post tagged with "slo"

View all tags

Latency Budgets for AI Features: How to Set and Hit p95 SLOs When Your Core Component Is Stochastic

· 11 min read
Tian Pan
Software Engineer

Your system averages 400ms end-to-end. Your p95 is 4.2 seconds. Your p99 is 11 seconds. You committed to a "sub-second" experience in the product spec. Every metric in your dashboard looks fine until someone asks what happened to 5% of users — and suddenly the average you've been celebrating is the thing burying you.

This is the latency budget problem for AI features, and it's categorically different from what you've solved before. When your core component is a database query or a microservice call, p95 latency is roughly predictable and amenable to standard SRE techniques. When your core component is an LLM, the distribution of response times is heavy-tailed, input-dependent, and partially driven by conditions you don't control. You need a different mental model before you can set an honest SLO — let alone hit it.