GPU Capacity Is a Roadmap Constraint: The 18-Month Contract That Decided Q3
Somewhere in your company, fourteen months ago, a finance director and a platform lead signed a multi-year accelerator commitment. They built a peak-load model from the prior quarter's telemetry, negotiated a discount of 40 to 70 percent off on-demand pricing, and locked in the cluster shape that your product roadmap now has to fit inside. Nobody on the product team was in the room. Nobody on the application engineering team saw the spreadsheet. The contract is binding, the discount only applies if the commitment is honored, and the capacity envelope it bought is now the silent ceiling on every Q3 feature your PMs are scoping.
The gap most teams don't notice until the second year: capacity contracts are roadmap decisions, but they're being made by people who don't see the roadmap, using inputs that don't include the roadmap. The product trio thinks it's choosing features from a clean priority backlog. Finance thinks it's optimizing a fixed envelope. Both are right inside their own frame, and the collision shows up in a planning meeting where an architect proposes a 70B-parameter model for the new assistant feature and the platform lead says, quietly, that the cluster is already at 85 percent and that model doesn't fit without crowding out something else.
The Two Planning Tracks That Don't Touch
Most companies running AI features at scale end up with two parallel planning artifacts that never reconcile.
The first is the capacity contract. Finance and platform negotiate it on a 12 to 36 month horizon. The inputs are usage telemetry from the prior quarter, a peak-load multiplier, a growth assumption, and a price ceiling. The output is a fixed pool of accelerator-hours, often expressed as a number of GPUs reserved with AWS, GCP, Azure, or a colo provider. The discount only materializes if utilization stays above the commitment floor for the contract term. Below the floor, the company pays for capacity it isn't using. Above the ceiling, the company pays burst rates or, worse, can't get capacity at all because the burst pool is shared and oversubscribed.
The second is the product roadmap. Product and engineering scope features on a quarterly cadence. The inputs are user research, business priorities, technical debt, and engineering capacity expressed in headcount. The output is a feature list with rough ETAs. Nowhere in this artifact does the phrase "accelerator-hours per feature" appear. Nowhere is there a column that says "this feature requires the H200 pool, the H100 pool can't host it." The model choice — Sonnet versus Opus, a 7B local model versus a 70B hosted one, RAG versus fine-tune — gets made by the engineer building the feature, weeks after the roadmap was approved, with no visibility into whether that choice fits the contracted envelope.
Six months in, the gap becomes a daily friction. The new feature the product team prototyped requires more memory than the H100 pool can hold without batch-size cuts that murder latency. The proposed capability uplift would push p99 past the SLA because the cluster is already at 85 percent and the queueing is nonlinear past that point. The finance team is silently denying internal capacity requests because the burst pool is the only way to absorb seasonal growth and they're saving it for the Black Friday or quarter-close window. The eng team thinks it's scoping from product priorities. Finance is rationing those priorities by what fits on the cluster it bought fourteen months ago.
What Last Quarter's Peak Doesn't Tell You About Next Year's Features
The cleanest way to see the failure mode is to look at the inputs to the capacity contract.
Last quarter's telemetry captures the workloads that existed last quarter. It does not capture the workloads the roadmap is about to add. If the product team is planning to ship a long-context assistant in Q3, the capacity model built in Q1 has no signal for the per-request token count that feature will require, the concurrency it will drive, or the model class it will run on. The peak-load multiplier is a guess — usually 2x or 3x the prior quarter's peak — and a guess is fine when the next quarter looks like the last one. It is not fine when the roadmap explicitly intends to change the workload shape.
The growth assumption is also fragile. Most contracts assume linear or modest exponential growth in token volume. They do not model the discontinuities that ship feature-by-feature: enabling RAG citations on every response can double output tokens overnight; turning on tool use can triple per-request inference depth; rolling out a long-context window can quadruple the memory footprint per active session. Each of these is a roadmap decision. Each is invisible to the capacity model.
The result is predictable. Six to nine months into the contract, utilization either undershoots (and the company pays for stranded capacity) or overshoots (and the company pays burst rates or queues features). Reports of 5 percent average utilization across enterprise GPU fleets, and individual companies discovering seven-figure stranded spends, are the public form of this private misalignment. The contract didn't fail because the GPUs were bad. It failed because the spreadsheet that priced them didn't know what the product was going to do.
The Discipline That Has to Land
The fix isn't a new tool. It's a planning artifact that lives in both worlds.
Per-feature accelerator-hour budgets, approved alongside headcount. When a product trio proposes a feature, the scoping artifact includes an estimated accelerator-hour cost — derived from the model class, expected per-request inference depth, projected concurrency, and rollout curve. The number doesn't have to be precise. It has to exist. A feature that doesn't have one is a feature that hasn't been scoped against the capacity envelope, the same way a feature without a headcount estimate hasn't been scoped against engineering capacity. Both estimates get revised as the feature gets specified; both inform whether the feature can ship in the proposed quarter.
A capacity dashboard with the same fidelity as the headcount roster. Every engineering manager can see the current org chart, the open reqs, and the planned hires for the next two quarters. Almost no engineering manager can see the current cluster utilization, the contracted ceiling, the burst headroom, and the model-mix sensitivity. The capacity model exists inside finance's planning tool; the engineering team sees, at best, a Grafana panel for live utilization. The gap is closeable. Publish the same view of capacity that you publish for headcount, with the same monthly rhythm, the same trend lines, and the same ownership.
A quarterly roadmap-versus-capacity review. Pair the quarterly product review with a capacity review run by the same forum. The output is a list of features that fit, a list of features that don't fit unless something gets retired, and a list of features that would require the next contract amendment. The first list goes into the next quarter's commitment plan. The second forces a real prioritization conversation: what gets cut to make room for what the team actually wants to ship? The third feeds the negotiation window for the next contract cycle, which is the only point at which the envelope itself is renegotiable.
Roadmap inputs into the next contract negotiation. The reason last quarter's telemetry dominates the capacity model is that nobody hands the negotiation team a credible forward roadmap. The fix is procedural: when the contract is up for renewal, the inputs include the next four quarters of planned features with their accelerator-hour estimates, the planned model migrations, and the planned context-window changes. The negotiation team can then size the commitment to the future product, not the past one. The discount stays. The envelope fits.
The Failure Mode Nobody Wants on the Post-Mortem
When the discipline doesn't land, the failure shape is familiar. A team ships a feature into a saturated cluster. P99 latency degrades not just for the new feature but for every workload sharing the pool, because saturated GPU queues spike across tenants. The on-call rotation gets paged for a week. The post-mortem starts by examining the feature launch and the code path. It does not, usually, name the actual root cause: a capacity decision made fourteen months ago, by people who didn't know the feature was coming, using inputs that didn't include the feature, locked into a contract that doesn't unwind for another twelve months.
The remediation in those post-mortems is usually a request for more GPUs, which the finance team can't grant without amending a contract, which takes a full quarter to negotiate, during which the team works around the constraint with degraded models or paused rollouts. The lesson written into the post-mortem template is "test capacity before launching features." The lesson actually being learned by the org is harder to write down: planning artifacts that live in separate forums end up making decisions for each other without consent.
Capacity Is a Product Constraint
The architectural takeaway is the one that's hardest to translate into a backlog ticket: GPU capacity is not infrastructure. It is a product constraint that has to live in the same planning artifact as the feature backlog. Treating it as infrastructure — something the platform team handles, something finance manages — is what produces the silent binding between fourteen-month-old contracts and this quarter's feature decisions.
The companies that get this right will look, structurally, like the companies that figured out the same lesson for headcount fifteen years ago. Headcount used to be a back-office line item until the realization landed that hiring plans and feature plans are the same plan viewed from two sides. The same thing is happening, slowly, with capacity. The product manager and the capacity owner are going to end up in the same forum, looking at the same model-mix sensitivity chart, because that is the only way to stop shipping roadmaps that finance has already silently redacted.
For now, most companies are still in the earlier phase, where the redaction happens quietly and the team only notices when a feature fails to ship for a reason that nobody on the feature team can name. The fix starts whenever an engineering leader asks, in a roadmap review, the question that the capacity contract was built to answer fourteen months ago: which of these features actually fits, and what would have to change if we wanted them to?
- https://www.oreilly.com/radar/why-capacity-planning-is-back/
- https://www.finops.org/wg/finops-for-ai-overview/
- https://nstarxinc.com/blog/gpu-capacity-planning-cost-control-avoiding-stranded-spend-and-failed-reservations/
- https://compute.exchange/blogs/reserved-vs.-on-demand-gpu-in-2026
- https://winbuzzer.com/2026/05/11/enterprises-face-underused-gpu-fleets-as-ai-costs-rise-xcxwbn/
- https://introl.com/blog/gpu-procurement-strategies-leasing-buying-reserved-capacity-2025
- https://www.spheron.network/blog/ai-inference-cost-economics-2026/
- https://www.spheron.network/blog/gpu-shortage-2026/
- https://www.flexera.com/blog/finops/finops-for-ai-governing-the-unique-economics-of-intelligent-workloads/
- https://introl.com/blog/ai-infrastructure-capacity-planning-forecasting-gpu-2025-2030
