Inference Cost Forecasting: The Capacity Plan Your Finance Team Wants and You Can't Write
Your finance team will ask for a capacity plan you cannot write. Not because you're inexperienced or because the model is new, but because the two assumptions classical capacity planning rests on — a workload distribution you can measure, and a unit cost stable on a quarter timescale — are both violated by AI workloads. The number you hand them will be wrong on day one, and when the variance hits, the conversation that follows will not be about the bill.
The 2026 State of FinOps report named AI as the fastest-growing new spend category, with a majority of respondents reporting that AI costs exceeded original budget projections — for many enterprises, inference now consumes the bulk of the AI bill. The instinct to manage this with a SaaS-style capacity plan — pick a peak QPS, multiply by a unit cost, add 30% buffer — produces a number with the texture of a forecast and the predictive power of a horoscope. The capacity plan you actually need looks more like a FinOps scenario model than a procurement spreadsheet, and the engineering work to produce it is platform work that competes with feature work until the day finance loses patience.
This post is about why the SaaS-shaped capacity plan keeps failing, what the AI-shaped version looks like, and the four disciplines a team has to build before finance stops calling every quarter for a re-baseline.
Why Classical Capacity Planning Breaks
Classical capacity planning works because two things are roughly stable. The workload distribution — requests per second, payload size, peak-to-trough ratio — has variance you can measure historically and project forward. The unit cost — dollars per request, dollars per CPU-hour — is stable on a quarter timescale because vendor pricing changes with announcement notice and your stack doesn't restructure between board meetings.
AI workloads break both. Token-per-request variance spans three or more orders of magnitude across users on the same feature. A summarization endpoint serving a casual user processing a 200-word email and a power user processing a 50-page contract has a per-call cost that differs by a factor of 250x, and the user mix at any given hour is not a property your historical traffic data captured because last quarter's user mix was nothing like next quarter's. The "average" request is a statistical fiction — the cost distribution is heavy-tailed, and the tail is where the budget actually lives.
Unit cost is worse. LLM API prices fell roughly 80% between early 2025 and early 2026, but "fell" is the wrong frame for a budget owner — what actually happens is that your stack silently restructures inside budget cycles. A model upgrade ships, the team flips the default tier to a newer model with different per-token pricing and different token efficiency, and the unit-cost line in the forecast is now describing a model you no longer use. The pricing landscape that produced your forecast in January is not the pricing landscape your bill is computed against in April.
And then there's amplification. A single feature change — adding a tool call to the loop, expanding the system prompt by a paragraph, switching from "tools in prompt" to "tools in code" — can amplify per-task token consumption by an order of magnitude or more. Agentic workloads consume 4–15x more tokens than chat interactions, and reflexion-style loops can push that to 50x. A four-turn conversation doesn't cost 4x a single turn; it costs roughly 10x because each turn re-sends accumulated context. The product manager who shipped "let the agent retry up to three times" did not file a budget change request, and the finance team that signed off on the quarterly number did not see the line item that just doubled.
This is not a forecasting problem you can solve by adding buffer. A 30% buffer on a metric that varies by 3x to 10x within the budget cycle is decoration. The forecasting framework itself has to change.
Scenario Bands, Not Point Forecasts
The first discipline is to stop handing finance a single number. The unit of forecast for AI workloads is a scenario band — a plausible range tied explicitly to feature decisions, model choices, and usage assumptions — not a point estimate.
A scenario band has at minimum three branches. A baseline ("current feature set, current model mix, current usage trajectory"). A growth-case ("planned features ship on schedule, user growth hits the product roadmap target"). A stretch-case ("the agentic feature in design lands and shifts cost amplification by N× on the cohort that adopts it"). Each branch names the assumptions that produced the number — what model is doing what work, what the cache hit rate is assumed to be, what fraction of requests are hitting the long-context path.
This sounds like extra work, and it is. But it converts a single brittle prediction into a structured object finance can actually reason about. When the number moves, the conversation is no longer "you were wrong" — it's "which assumption changed, and was that assumption a feature decision somebody owned." The band gives finance and engineering a shared vocabulary for variance, and it forces the engineering side to articulate the cost-relevant decisions that would otherwise be invisible.
The cadence has to change too. Quarterly forecasts work for stable workloads; AI workloads need a weekly or monthly rolling forecast that catches runaway spend before it becomes a board-level surprise. The rolling cadence is not optional — it's the only mechanism that catches a 10× amplification event in time to do anything about it.
Model Mix as a First-Class Budget Lever
The second discipline is to treat the model mix as a budget lever the forecast names explicitly. The gap between flagship and budget tiers is roughly 20×, not 2×. A model choice that "doesn't matter for the prototype" is a six-figure line item at scale, and the team that hasn't named which features run on which tier has implicitly committed every feature to the most expensive default.
The forecast should answer four questions for every revenue-relevant surface. Which model tier is this feature running on today? What's the eval-quality delta if it moves down a tier? What's the cost delta? And what's the routing logic — is the entire feature on one tier, or is it cascading from cheap to expensive based on a confidence signal?
This is where eval infrastructure stops being a quality concern and becomes a budget concern. Without an eval that grades quality at each tier, "downgrade this feature to the cheaper model" is a guess. With one, it's a budget input — the trade-off between "ship cheaper model and lose 2% quality" and "stay on premium and burn 40% more" becomes a decision the finance and product owners can make jointly, with evidence, instead of a discovery the engineering team has at quarter-end.
Routing matters here as much as raw tier selection. Cascading systems that route easy requests to a cheap model and defer hard ones to a flagship can deliver substantial cost reductions while preserving most of the quality, and the routing-penalty parameter is itself a tunable knob. The forecast that names the routing strategy and its quality-cost operating point is a forecast that can move when you want it to. The forecast that bundles all model spend into a single line item is one finance has to take on faith.
Per-Feature Unit Costs, Not One Inference Line Item
The third discipline is decomposing the inference bill by feature, not by model. A single inference line that says "$X this month for LLM API calls" is the level of granularity that lets a CFO ask "is this growing linearly with revenue?" and gives the engineering team no defensible answer.
Per-feature unit cost tracking joins inference logs to user-action telemetry so the bill decomposes into a cost-per-task for each user-visible feature. The summarization feature has a unit cost. The agent-with-tools feature has one. The chat feature has one. Each one trends differently with usage, with feature changes, and with model swaps, and only at this granularity can ROI conversations actually happen.
The shift here is from cost-per-token to cost-per-task as the primary efficiency metric. Token usage is an input measure; cost-per-task is an outcome measure that includes the LLM calls, the tool executions, the retries, and the human escalations triggered by agent failures. A feature where cost-per-task is rising while user satisfaction is flat is a feature where amplification has snuck in — and unit-cost-per-task is the only metric that surfaces this in time.
The instrumentation work to do this is real. It requires a request-level join between billing data and product analytics, retention of token counts at the feature level, and a definition of "task" that's stable across feature evolutions. Most teams don't have this in 2026. The teams that do are the ones whose forecasts hold.
Quality-Cost Curves as Budget Inputs
The fourth discipline is to make the eval-quality-vs-cost curve an input to the budget process, not an output of an engineering side project. Every model decision the team makes — tier selection, routing thresholds, prompt structure, cache strategy — is implicitly a point on a curve that trades quality for cost. The team that hasn't measured the curve is making the trade-off by accident.
A quality-cost curve for a feature plots eval pass-rate against per-task cost across the available operating points: each model tier, each routing aggressiveness, each prompt length. The curve usually has a knee — a point past which further quality improvement costs disproportionately more, or further cost reduction collapses quality disproportionately. The knee is where the feature should sit by default, and "let's move the operating point" is a budget conversation that produces a defensible answer.
This is also where prompt caching becomes a budget input rather than a performance optimization. Cache hits typically cost 10% of standard input pricing, but the discount only applies to content the team has structured to be cacheable, and the cache-write premium means the savings only materialize past a hit-rate threshold. A forecast that assumes a cache hit rate the system isn't structured to deliver is a forecast that misses by the discount factor times the volume.
The Org Failure Mode and Why It's Structural
The pattern that produces quarterly re-baseline meetings is structural, not individual. Finance models AI as a single line item that grows linearly with users, because that's what every other software cost they've ever managed has done. The AI team builds features that grow super-linearly with token amplification effects, because that's what the technology rewards. The variance shows up as a quarterly reforecast nobody trusts, and the cost frame nobody surfaces is that the real cost of getting this wrong isn't the bill — it's the chilling effect on roadmap decisions when finance can't tell the difference between a 10% feature and a 10× cost increase.
When that gap exists, the conservative play for finance is to slow every AI feature decision until they understand it. The conservative play for engineering is to under-promise on cost so they don't get blamed for the next overrun. Both of these are rational individual responses that produce a globally bad outcome — the team ships fewer ambitious AI features than the technology would actually support, and the ones they do ship are over-engineered for cost predictability rather than user value.
The unlock isn't a better model or a better dashboard. It's a shared vocabulary — scenario bands, model-mix levers, per-feature unit costs, quality-cost curves — that lets finance and engineering reason about the same trade-offs. Without that vocabulary, every AI cost conversation is a translation problem, and translations lose information.
AI Infrastructure Economics Is FinOps, Not SaaS
The architectural realization is that AI infrastructure economics is closer to cloud-native FinOps than to SaaS unit economics. SaaS-style forecasting works when unit costs are stable and workloads are predictable. FinOps-style forecasting assumes neither — it assumes the unit cost is a moving target and the workload is the team's own evolving choice, and it builds a discipline around scenario modeling, granular attribution, and rolling forecasts because those are the only tools that produce defensible numbers in that regime.
The team that hands finance a SaaS-style forecast is going to be in a re-baseline meeting every quarter until somebody invests in the AI-aware version. The investment isn't optional, and it isn't a tooling purchase — it's an instrumentation and process change that changes who is allowed to ship what without budget review. The good news is that the FinOps community has done much of the foundational work; the bad news is that AI workloads are weird enough that the foundation only gets you partway, and the rest is a per-team build.
If you can't write the capacity plan finance is asking for, the right response is not to write a worse one. The right response is to explain what's missing, what would have to be built to produce a defensible answer, and what scenario bands you can offer in the meantime. Finance teams adapt faster than engineering teams give them credit for, but only when the engineering side stops handing them numbers it doesn't believe.
- https://www.spheron.network/blog/ai-inference-cost-economics-2026/
- https://www.finops.org/wg/cost-estimation-of-ai-workloads/
- https://www.finops.org/wg/how-to-forecast-ai-services-costs-in-cloud/
- https://www.finops.org/wg/effect-of-optimization-on-ai-forecasting/
- https://data.finops.org/
- https://oplexa.com/ai-inference-cost-crisis-2026/
- https://www.silicondata.com/blog/llm-cost-per-token
- https://pricepertoken.com/trends
- https://introl.com/blog/inference-unit-economics-true-cost-per-million-tokens-guide
- https://artificialanalysis.ai/models/caching
- https://www.clarifai.com/blog/ai-cost-controls
- https://www.tntra.io/blog/finops-for-ai-cost-optimization-strategies/
- https://medium.com/@klaushofenbitzer/token-cost-trap-why-your-ai-agents-roi-breaks-at-scale-and-how-to-fix-it-4e4a9f6f5b9a
- https://online.stevens.edu/blog/hidden-economics-ai-agents-token-costs-latency/
- https://medium.com/@yugank.aman/the-true-cost-of-enterprise-ai-agents-a-complete-tco-framework-e3b6228857e7
- https://router.orq.ai/blog/auto-router-intelligent-llm-routing
- https://zilliz.com/learn/routellm-open-source-framework-for-navigate-cost-quality-trade-offs-in-llm-deployment
- https://www.cleveroad.com/amp/blog/claude-api-cost-optimization-enterprise/
