Load Testing LLM Applications: Why k6 and Locust Lie to You
You ran your load test. k6 reported 200ms average latency, 99th percentile under 800ms, zero errors at 50 concurrent users. You shipped to production. Within a week, users were reporting 8-second hangs, dropped connections, and token budget exhaustion mid-stream. What happened?
The test passed because you measured the wrong things. Conventional load testing tools were designed for stateless HTTP endpoints that return a complete response in milliseconds. LLM APIs behave like nothing those tools were built to model: they stream tokens over seconds, charge by the token rather than the request, saturate GPU memory rather than CPU threads, and respond completely differently depending on whether a cache is warm. A k6 script that hammer-tests your /chat/completions endpoint will produce numbers that look like performance data but contain almost no signal about what production actually looks like.
The Fundamental Mismatch
Classic load testing has one job: ramp up concurrent requests until the server breaks, then record latency percentiles and error rates. The mental model is a web server returning HTML — fast, stateless, response size measured in kilobytes.
LLMs break every assumption in this model.
Responses are not atomic. A single request streams hundreds or thousands of tokens over several seconds. A test tool that records "request duration" is recording wall time from first byte to last byte. That number conflates two completely different performance characteristics: how long it took the model to start generating (Time to First Token, TTFT) and how fast it generated once it started (tokens per second). A system with a slow TTFT but fast generation feels broken to a user typing in a chat interface. A system with fast TTFT but slow generation is fine for conversational use but unusable for generating long documents. Average latency captures neither.
Request cost is not uniform. On a conventional API, a "request" is roughly a request. On an LLM API, one request might consume 150 tokens and another 4,000 — a 26× difference in GPU time, memory pressure, and API cost. If your load test sends the same short greeting to the same endpoint repeatedly, you are testing a workload that does not exist in production. Real traffic has a distribution: short queries, long context retrievals, multi-turn sessions with accumulated history. Testing with uniform synthetic prompts produces numbers that only apply to that specific synthetic workload.
The bottleneck is GPU memory, not CPU threads. Concurrency in traditional load testing means threads or connections competing for CPU and network. In LLM inference, concurrency means KV cache slots competing for GPU memory. As concurrency climbs, throughput in tokens per second grows until the GPU's KV cache fills, then degrades sharply rather than gradually. The saturation curve is steep and nonlinear — 40 concurrent users might be fine, 45 might be catastrophic. Standard tools show you where the error rate increases; they do not show you where GPU memory pressure starts building five minutes before the errors appear.
Cache state determines everything. A warm prefix cache can cut TTFT by 85% and reduce cost by 90% for long shared prompts (system prompts, document context, few-shot examples). A cold cache test looks like a completely different system from a warm cache test. If your load test sends fresh requests from the start, you are testing cold-cache behavior only. If you have any prefix caching enabled in production, your test results and your production behavior will diverge significantly.
The Metrics That Actually Matter
Before discussing tooling, establish what you are measuring.
Time to First Token (TTFT) is the primary user experience metric for interactive applications. It measures the gap between sending the request and receiving the first streaming token. For a chat interface, target under 500ms; for a coding copilot providing real-time completions, under 200ms is typically needed. This metric reflects prefill latency — how long the model took to process your input before it started generating.
Inter-Token Latency (ITL) measures the time between consecutive tokens in a streaming response. Above 100ms creates visible stuttering; above 50ms is noticeable to attentive users. ITL reflects decoding speed and, under load, queue depth. When a server is overloaded, ITL spikes before TTFT does — making it an early warning indicator.
Tokens per second (TPS) is your throughput metric, not requests per second. Because response length varies, RPS is a misleading proxy. A system serving 1,000 RPS of 10-token responses is very different from one serving 1,000 RPS of 1,000-token responses. Track total TPS across the cluster and per-request TPS to understand generation speed.
Goodput is the fraction of requests that complete within your SLO latency targets. A system at 95% goodput under load is meaningfully different from one at 99% goodput. This is the metric you should be tracking in your capacity planning model, not average latency.
Cost-per-request distribution belongs in your load test output, not just in your billing dashboard. Under load, retries, timeouts, and abandoned requests consume tokens without producing value. A load test that ignores token economics will produce throughput numbers that look healthy while hiding the fact that 20% of tokens are being wasted on requests that time out before users see a response.
Why Conventional Tools Fall Short
k6 is excellent for what it was built to do. Its Go-based concurrency handles high request volumes efficiently, and it integrates cleanly into CI pipelines. But it treats each request as a unit, records total response time, and has no native understanding of streaming. A k6 script measuring LLM latency is measuring the time from request to final byte — which tells you total generation time but nothing about the shape of the user experience.
Locust has an additional problem specific to LLM workloads: Python's Global Interpreter Lock. Accurately measuring token-level performance requires tokenization (or at least byte-stream analysis) of streaming responses. This is CPU-intensive work that runs under the GIL, which means the tokenization process competes with the request generation process. Under heavy concurrency, the tokenization backlog skews measurements — your benchmark starts reporting artificially inflated inter-token latency not because the server is slow but because your test client is bottlenecked.
Both tools compound the problem with prompt uniformity. It is easy to write a script that sends the same prompt in a loop. Real production traffic is diverse: different prompt lengths, different context sizes, different output length distributions. A test with uniform prompts tests a single point on the token distribution, not the full workload.
Building an LLM Load Test That Actually Tells You Something
Use a realistic prompt corpus. Sample from your actual production logs (after scrubbing PII) to build a test dataset that reflects the true distribution of input lengths and types. If production logs are not available, build a synthetic dataset with representative variance: short queries (50-100 tokens), medium context requests (500-1000 tokens), and long context requests (2000+ tokens) in proportions matching your expected workload.
Separate TTFT and generation throughput measurements. Instrument your test client to record the timestamp of the first byte received separately from the final byte. You need both, and they tell you different things about different parts of the system.
Test warm and cold cache separately, then together. Run a dedicated warm-up phase that primes prefix caches before your measurement window starts. Then run a cold-cache test, a warm-cache test, and a mixed test where you control cache hit rate explicitly. If there is a large gap between warm and cold performance, your production behavior will depend heavily on traffic patterns that vary by time of day, query diversity, and caching configuration — and your capacity plan needs to account for all of them.
Run concurrency sweeps with fine granularity around the saturation point. The KV cache saturation curve is steep. Do not jump from 10 to 50 to 100 concurrent users. Step through 10, 20, 30, 40, 45, 50 — the difference between 40 and 45 may be the difference between 400ms TTFT and 4000ms TTFT. Once you find the saturation point, test sustained load at 80% of it to verify it is a stable operating point.
Run soak tests at least 4 hours long. Gradual memory leaks, KV cache fragmentation, and connection pool exhaustion all manifest over time. A 10-minute load test will not catch them. Periodic TTFT spikes and slowly growing inter-token latency are the canonical symptoms.
The Capacity Planning Math
Once you have your metrics, the planning model is straightforward but frequently skipped.
Take your peak TPS target (total tokens per second across all users at peak load), divide by the TPS your system delivers at 80% of its saturation concurrency, and that gives you the number of replicas or GPU instances you need for peak. Add a headroom multiplier (typically 1.5–2×) for burst handling. This number is your minimum provisioned capacity.
The part that trips teams up is the difference between input tokens and output tokens. Input token processing (prefill) is parallel and fast; output token generation (decoding) is sequential and slow. They have different throughput characteristics on the same hardware. Your capacity model needs separate estimates for prefill throughput (input tokens per second) and decode throughput (output tokens per second), and you need to know which is the binding constraint for your workload. Document-heavy workloads with long inputs are often prefill-bound; chat applications with short inputs and long outputs are often decode-bound.
Rate limits from API providers add a third variable. Tokens-per-minute (TPM) and requests-per-minute (RPM) quotas operate independently and can both be hit simultaneously. Your load test should validate not just that your system behaves correctly under self-imposed load, but that your retry and backoff logic handles provider rate limit responses (HTTP 429) gracefully. Naive retry logic that immediately retries on 429 can turn a temporary rate limit into a sustained overload that exhausts quota faster.
The Cost of Load Testing Itself
There is a practical problem that teams discover late: load testing LLM APIs against live endpoints is expensive. A realistic soak test at moderate concurrency can consume millions of tokens. At standard API pricing, a thorough load testing program can cost tens of thousands of dollars per month — and much of that spending produces no user value.
The answer is mock LLM services that simulate realistic latency distributions without consuming tokens. A well-designed mock returns responses with configurable TTFT and ITL drawn from distributions that match your production observations, supports streaming, and can inject failure scenarios (rate limits, timeouts, partial responses) on demand. You validate your application's behavior under load against the mock, and reserve live provider testing for final validation of capacity headroom.
This is not just cost avoidance. Live provider testing introduces environmental noise from shared infrastructure, regional traffic variations, and provider-side changes that make it impossible to isolate your application's behavior from external factors. A mock gives you deterministic, reproducible results that let you confidently attribute performance changes to your own code changes.
When Load Testing Breaks Down
No load test perfectly predicts production. Two failure modes are common.
First, your test traffic does not match production traffic in ways that matter for caching. If your test uses diverse random prompts, cache hit rates will be near zero. If production has common patterns (shared system prompts, repeated queries), cache hit rates may be 60–80%. The result is that production performance significantly exceeds your load test predictions — which sounds good but means your capacity estimates are conservatively wrong in a way that costs money.
Second, your test does not capture multi-turn conversation state. Load tests typically test individual requests. Multi-turn sessions accumulate context across turns, meaning token count per request grows as the conversation continues. A test that measures fresh single-turn requests will not catch the TTFT degradation that appears when a user is 10 turns into a conversation and the accumulated context is 8,000 tokens.
Both failures come from the same root cause: load testing tells you how your system behaves under your test workload, not under your production workload. The closer you can make those two things match — through realistic prompt corpora, accurate cache state modeling, and conversation simulation — the more useful your test results are.
Load testing LLM applications is not fundamentally harder than load testing conventional applications. It just requires measuring different things, modeling different constraints, and resisting the temptation to trust a tool that reports clean numbers when the underlying workload it is modeling bears little resemblance to what users actually do.
- https://blog.premai.io/load-testing-llms-tools-metrics-realistic-traffic-simulation-2026/
- https://reintech.io/blog/llm-load-testing-benchmark-ai-application-production
- https://gatling.io/blog/load-testing-an-llm-api
- https://www.truefoundry.com/blog/llm-locust-a-tool-for-benchmarking-llm-performance
- https://engineering.salesforce.com/how-a-mock-llm-service-cut-500k-in-ai-benchmarking-costs-boosted-developer-productivity/
- https://portkey.ai/blog/rate-limiting-for-llm-applications/
- https://www.typedef.ai/resources/handle-token-limits-rate-limits-large-scale-llm-inference
- https://bentoml.com/llm/inference-optimization/llm-performance-benchmarks
