Load Testing LLM Applications: Why k6 and Locust Lie to You
You ran your load test. k6 reported 200ms average latency, 99th percentile under 800ms, zero errors at 50 concurrent users. You shipped to production. Within a week, users were reporting 8-second hangs, dropped connections, and token budget exhaustion mid-stream. What happened?
The test passed because you measured the wrong things. Conventional load testing tools were designed for stateless HTTP endpoints that return a complete response in milliseconds. LLM APIs behave like nothing those tools were built to model: they stream tokens over seconds, charge by the token rather than the request, saturate GPU memory rather than CPU threads, and respond completely differently depending on whether a cache is warm. A k6 script that hammer-tests your /chat/completions endpoint will produce numbers that look like performance data but contain almost no signal about what production actually looks like.
