Skip to main content

6 posts tagged with "capacity-planning"

View all tags

GPU Capacity Is a Roadmap Constraint: The 18-Month Contract That Decided Q3

· 9 min read
Tian Pan
Software Engineer

Somewhere in your company, fourteen months ago, a finance director and a platform lead signed a multi-year accelerator commitment. They built a peak-load model from the prior quarter's telemetry, negotiated a discount of 40 to 70 percent off on-demand pricing, and locked in the cluster shape that your product roadmap now has to fit inside. Nobody on the product team was in the room. Nobody on the application engineering team saw the spreadsheet. The contract is binding, the discount only applies if the commitment is honored, and the capacity envelope it bought is now the silent ceiling on every Q3 feature your PMs are scoping.

The gap most teams don't notice until the second year: capacity contracts are roadmap decisions, but they're being made by people who don't see the roadmap, using inputs that don't include the roadmap. The product trio thinks it's choosing features from a clean priority backlog. Finance thinks it's optimizing a fixed envelope. Both are right inside their own frame, and the collision shows up in a planning meeting where an architect proposes a 70B-parameter model for the new assistant feature and the platform lead says, quietly, that the cluster is already at 85 percent and that model doesn't fit without crowding out something else.

Capacity Math for Agent Loops: Why Your Provisioned Throughput Is Half of What You Think

· 11 min read
Tian Pan
Software Engineer

A team I worked with launched what they called a "modest" feature: an internal research assistant for a few hundred analysts. Their capacity model said one user request equals one model call, so they sized provisioned throughput against peak user QPS with the standard 30 percent burst headroom. On launch day they hit 429s within an hour, traffic that should have used 40 percent of their reserved capacity saturated 100 percent, and the postmortem revealed a number nobody had multiplied in: the average request triggered 11 model calls, not one.

This is the most common capacity miss I see in agent rollouts. The math is not subtle and the failure mode is not exotic. The team asked the wrong unit question — they planned in user requests when the meter ticks in model calls — and the reservation they paid real money for evaporated under a load they would have called light if it had been a chat product.

The Monday Morning AI Degradation Your Dashboard Treats As Noise

· 10 min read
Tian Pan
Software Engineer

Pull up your AI feature's latency and quality dashboards and squint. The line is mostly flat with occasional spikes your team has been calling "noise" or "provider weirdness" for months. Now break that same data out by hour-of-day and day-of-week. The noise resolves into a face: every Monday between 9 and 11am Eastern, your p95 latency is 30–60% worse than it is on a Saturday night, your cache hit rate dips 10–20 points, your retry rate doubles, and your token spend per task quietly climbs. The dashboard wasn't lying. It was averaging.

Most teams discover this pattern the way you discover a slow leak: by tracing the cost back from a quarterly bill nobody can explain. The instinct is to call it provider flakiness, file a ticket with the inference vendor, and move on. But the pattern isn't really about your LLM provider. It's about the fact that your AI feature now sits on top of a stack of shared, time-of-day-sensitive systems — the model API, the embedding API, the dependent SaaS tools your agent calls, the customer's own infrastructure on the receiving end of webhooks — and the cyclic load patterns of every one of them compose. You inherited the diurnal curve of an entire dependency chain, and your dashboard is showing you the average of all of them.

Inference Cost Forecasting: The Capacity Plan Your Finance Team Wants and You Can't Write

· 12 min read
Tian Pan
Software Engineer

Your finance team will ask for a capacity plan you cannot write. Not because you're inexperienced or because the model is new, but because the two assumptions classical capacity planning rests on — a workload distribution you can measure, and a unit cost stable on a quarter timescale — are both violated by AI workloads. The number you hand them will be wrong on day one, and when the variance hits, the conversation that follows will not be about the bill.

The 2026 State of FinOps report named AI as the fastest-growing new spend category, with a majority of respondents reporting that AI costs exceeded original budget projections — for many enterprises, inference now consumes the bulk of the AI bill. The instinct to manage this with a SaaS-style capacity plan — pick a peak QPS, multiply by a unit cost, add 30% buffer — produces a number with the texture of a forecast and the predictive power of a horoscope. The capacity plan you actually need looks more like a FinOps scenario model than a procurement spreadsheet, and the engineering work to produce it is platform work that competes with feature work until the day finance loses patience.

Burst Capacity Planning for AI Inference: When Black Friday Meets Your KV Cache

· 11 min read
Tian Pan
Software Engineer

Your Black Friday traffic spike arrives. Conventional API services respond by spinning up more containers. Within 60 seconds, you have three times the capacity. The autoscaler does what it always does, and you sleep through the night.

Run an LLM behind that same autoscaler, and you get a different outcome. The new GPU instances come online after four minutes of model weight loading. By then, your request queues are full, your existing GPUs are thrashing under memory pressure from half-completed generations, and users are staring at spinners. Adding more compute didn't help — the bottleneck isn't where you assumed it was.

AI inference workloads violate most of the assumptions that make reactive autoscaling work for conventional services. Understanding why is the prerequisite to building systems that survive traffic spikes.

LLM Queuing Theory: Why Your Load Balancer Thinks in Requests While Your GPU Thinks in Tokens

· 11 min read
Tian Pan
Software Engineer

Your load balancer distributes requests evenly across your GPU fleet. Each instance gets roughly the same number of concurrent requests. Everything looks balanced. Yet one instance is crawling at 40 tokens per second while another hums along at 200. The dashboard shows equal request counts, but your users are experiencing wildly different latencies.

The problem is fundamental: traditional load balancing operates at the request level, but LLM inference costs scale with tokens. A single request asking for a 4,000-token essay consumes 50x more GPU time than a request generating an 80-token classification. Treating them as equivalent units is like a highway toll booth counting vehicles without distinguishing motorcycles from 18-wheelers.

This mismatch between request-level thinking and token-level reality is where classical queuing theory meets its most interesting modern challenge.