Capacity Planning When Every Request Thinks a Different Amount
Classic capacity planning rests on a quiet assumption: requests are roughly interchangeable. A web server handles a login, a search, a checkout — and while those differ, they differ within a band. You measure requests per second, watch p50 and p99 latency, multiply by a safety factor, and provision. The model works because the unit of work — one request — has a stable cost.
Agent workloads break that assumption at the root. One query to your agent resolves in a single short completion: 300 tokens in, 200 out, done in two seconds. The next query, superficially identical, spawns a planning step, fans out to forty tool calls, re-reads its own growing context on every turn, and burns 1.2 million tokens over four minutes. Same endpoint. Same user. Same code path. The cost per request varied by three orders of magnitude, and nothing in the request told you which one you were about to get.
