Building on our FinOps discussion, I want to dig into something that’s become a critical challenge in 2026: How do you implement FinOps for AI inference costs when usage patterns are fundamentally unpredictable?
The New Frontier
Traditional FinOps was built for compute, storage, and network—resources that scale relatively predictably. You can model EC2 costs based on instance types and hours. You can project S3 costs based on data volume. The math is straightforward.
AI inference costs are different. Token consumption can spike 10x overnight if a feature goes viral, if users discover a new use case, or if someone accidentally creates a loop that hammers your LLM endpoint.
My Experience: 40% of Infrastructure Budget in One Month
Last quarter, we shipped an AI-powered feature to our enterprise customers. We did our homework—projected token usage based on beta testing, set what we thought were conservative estimates, got finance approval.
Within 30 days, that single feature consumed 40% of our total infrastructure budget.
What happened? A few power users discovered they could use it for a workflow we hadn’t anticipated. Their usage was legitimate—not abuse, just a use case we didn’t predict. But each interaction cost 5x what we’d modeled.
We didn’t have circuit breakers. We didn’t have per-user limits. We just watched the spend climb and scrambled to implement controls retroactively.
The Challenge: Unknowable Patterns
Unlike traditional compute, where you can A/B test and measure costs in staging, AI usage patterns often don’t emerge until production at scale. Your beta users might not represent your production users. Edge cases that rarely happened in testing become common in production.
This makes pre-deployment cost modeling—the “shift-left FinOps” we discussed in the deployment blocking thread—much harder. How do you set a cost threshold for something you can’t accurately predict?
Approaches We’re Considering
Here’s what we’re exploring:
1. Per-User Token Limits
Give each user a daily/monthly token budget. Exceeding it either throttles them or triggers overage charges (for enterprise customers). But how do you set fair limits without hurting legitimate power users?
2. Circuit Breakers at Cost Thresholds
If a feature hits $X in a day, automatically throttle it down to slow mode. Problem: This creates terrible UX if you don’t design the failure mode well.
3. Separate AI Budgets from Core Infrastructure
Treat AI as experimental with higher tolerance for overages, funded separately from core services. But this only works if finance agrees to open-ended experimental budgets (good luck with that in 2026).
4. Cost-Per-Interaction Caps
Block individual API calls that would exceed a per-request cost limit (e.g., “this prompt would cost $2 to process, are you sure?”). Feels user-hostile but might be necessary for worst cases.
The Real Question
According to Platform Engineering 2026 predictions, AI-specific budgets for token and inference costs are becoming standard. But “having a budget” and “enforcing it without killing the feature” are very different problems.
Has anyone successfully implemented FinOps for LLM costs? What actually works in production when patterns are unpredictable?
I’m specifically curious:
- How do you set initial budgets when you don’t have usage data?
- What’s the right balance between protecting budget and allowing experimentation?
- Have you implemented real-time circuit breakers, and if so, what failure modes do you show users?
- Are you using different policies for different customer tiers (free vs. enterprise)?
The stakes feel higher here than traditional FinOps because AI costs can spiral so much faster than traditional compute. We need preventive controls, but I’m not sure what patterns actually work.
What are you all seeing?