Token Budgets Are the New Internal IAM
The first time your AI bill clears seven figures in a month, the budget meeting changes shape. Until then, the question is "can we afford this." After that, the question is "who gets how much" — and most engineering orgs discover, in real time, that they have no policy framework for answering it. The team that shipped the loudest demo holds the highest quota by accident. Finance pushes for flat per-headcount caps that starve the team doing the highest-leverage work. Security gets cut out of the conversation entirely until somebody notices that the eval team has been pulling production traffic through their personal token allowance for six months.
The reason this conversation always feels like a cloud-cost argument is that it almost is one — but not quite. With cloud, the unit of waste is a forgotten EC2 instance and the worst case is a 3x bill. With token quotas, the unit of waste is a runaway agent loop, and the unit of access is a user-facing capability: whoever holds the budget can ship the feature. That second property is what makes token allocation rhyme with capability-based security instead of with cloud FinOps. The quota is not just a spending cap. It is the right to make a class of inferences happen.
Treating it as a finance line item works until it doesn't. The day a product manager realizes they need to file a ticket with the platform team to raise their per-feature token cap so they can ship a demo on Monday is the day the quota system has, structurally, become an authorization system — and it inherits all the failure modes of one designed by people who weren't trying to design one.
The political problem dressed as a technical one
The State of FinOps survey has tracked AI cost management as a category for three years; in 2024, 31% of respondents reported actively governing AI spend, and by 2026 that figure is 98%. The growth came from organizations crossing two thresholds at once. The first is dollar size — average monthly AI budgets rose 36% in 2025 to roughly $85,000 per team, and the largest enterprises are already in the seven- and eight-figure monthly range. The second threshold is variance. Inference cost is genuinely lumpy in a way that engineering organizations weren't built to manage. Salaries are fixed. Cloud spend grows with traffic. Token spend can 5x in a quarter because one team turned on a new agent product and the prompt happened to be three times longer than the prior prompt. None of the existing budget muscles are calibrated for that.
What snaps once teams cross both thresholds is the implicit assumption that quota is a finance concern. Up to that point, the platform team has typically been operating a single shared API key with a global rate limit, watching the dashboard, and sending the occasional Slack message when somebody's experiment runs hot. Past that point, the quota becomes the gating resource for product velocity. Whichever team has more inference budget can run more experiments, ship more features, and serve more users. And because nobody designed the budget system as an authorization system, the allocation logic that emerges is whatever path of least resistance shows up first — usually "the squeakiest team gets the highest cap, and security finds out later."
This is not a hypothetical. Shadow AI surveys consistently report that more than 90% of employees use AI tools at work but only ~40% of organizations have sanctioned enterprise subscriptions, and the majority of sensitive AI interactions originate from personal accounts. In the same surveys, two-thirds of executives admit to being comfortable with unsanctioned AI use because it preserves speed. Read those two facts together and the conclusion is uncomfortable: the absence of a real allocation policy is not an oversight. It's the equilibrium that emerges when leadership values speed and nobody owns the tradeoff between speed and control.
What the IAM analogy actually buys you
The reason "token budgets are the new internal IAM" is more than a slogan is that capability-based security has already worked through most of the design problems that token-budget systems are now blundering into. A capability, in the classical formulation, is an unforgeable token of authority that designates both the resource and the permitted action. Three properties make a capability system work in practice: unforgeability (users can't mint capabilities out of thin air), transferability (capabilities can be delegated to subsystems), and revocability (the issuer can take them back). Map that onto inference quotas and the parallels are exact. A token budget is unforgeable because it's enforced at the gateway. It is transferable because subsystems and agents can run on tokens granted to a parent service. And it should be revocable — though in most current setups, it isn't.
The discipline that ports cleanly from IAM is grant-shaped allocation. Instead of a global pool that teams draw down against on a first-come basis, every grant is a policy with four named fields: an owner (the human accountable for the spend), a scope (a feature, a tenant, an environment, an experiment), a renewal date (the grant expires; you have to ask for it again), and a class (experimentation versus production traffic, which should be capped under different policies because their risk profiles diverge). When you write quota allocations in this shape, the political conversation gets a vocabulary it didn't have before. "The recommendations team has $40k/month for production traffic, $5k/month for experimentation, expires June 30, owner is Priya" is something a CFO and a head of engineering can actually negotiate over. "The recommendations team has been spending $45k/month, can you cut it down" is not.
The class distinction matters more than it sounds like it should. Production traffic has tight latency requirements, predictable token shapes, and is metered against user-facing SLOs. Experimentation traffic is exploratory by definition — a single bad eval run can chew through a quarter's allocation in an afternoon. Pooling them makes both capacity planning and incident response harder. When experimentation hits its cap, the right behavior is to throttle the eval and page a human; when production hits its cap, the right behavior is to fail open or page differently because user traffic is on the line. A grant model that doesn't distinguish the two is a grant model that will, in some incident, fail in exactly the wrong direction.
The two failure modes nobody escapes by accident
The two natural shapes of a token-allocation policy are the two failure modes. The first is the centralized chokepoint: a platform team owns the budget and every other team files tickets against it. This solves visibility but kills experimentation. The platform team is by construction the slowest place to get a quota change approved, because they bear the risk if the request turns out to be wasteful. New product ideas die in the queue, and the orgs most affected are the ones doing the most exploratory work — exactly the work that should be cheapest to fund at the margin.
The second is the decentralized credit pool: every team holds its own credits, often via virtual keys against a shared provider account. This preserves velocity but destroys aggregate visibility. Nobody can answer "how much are we spending on AI in total this quarter" without doing a reconciliation across N spreadsheets, and the answer always turns out larger than anyone projected. Worse, in the decentralized world, the security and compliance team has no choke point. There is no natural place to enforce "no production traffic with PII goes to vendor X under contract terms Y" because the requests don't pass through a layer that knows about either constraint.
The way out, borrowed again from IAM, is hierarchy with delegation. The org sets a top-level budget; that budget is split into team grants with stated owners; team grants are further split into per-feature or per-environment grants whose owners are the teams themselves. Audit and policy enforcement happen at the gateway — that piece is non-negotiable, because it's the only point where every request can be authoritatively counted and inspected. But the actual allocation decisions cascade downward, the same way IAM lets a project owner mint scoped service accounts without going back to the central IAM admin. Most of the LLM gateways shipping in 2026 — Portkey, LiteLLM, Bifrost, Kong's AI Gateway — already model some version of this hierarchy. What they don't ship is the organizational discipline to use it correctly.
The tooling layer has the buttons; the org doesn't have the policy
If you read the LiteLLM docs, you will find a clean four-level virtual-key hierarchy: customer, team, virtual key, provider, each with its own spend ceiling and reset schedule. If you read the Portkey docs, you find the same idea with a different vocabulary. The buttons exist. What does not exist, in most organizations adopting these tools, is the meta-process for deciding what numbers to put in those buttons.
This shows up in the bug tracker. There is an open issue against LiteLLM where user-level budgets aren't enforced when the virtual key belongs to a team — a subtle hierarchy bug that lets teams quietly bypass per-user caps just by routing through a team key. The bug is mundane on its face. The reason it matters is that the orgs filing the bug had assumed the policy was being enforced. They had written down the user caps. They had configured the gateway. They had told their finance team that the budget was under control. None of that was true. This is the same class of failure mode that IAM systems hit in the early 2010s, when AWS users discovered that policy was actually a hard, uncomposable thing and that the dashboard saying "this user has limited permissions" wasn't the same as the user actually being limited. Token quota tooling is going through that same maturity arc, accelerated.
The right response is not to wait for the gateway to ship the perfect feature set. It's to treat the gateway as a partial enforcement layer and add the missing pieces in your own application: tag every request with the owning team, the feature, the experiment ID, and the environment; emit those tags into a chargeback ledger that engineering, finance, and security all read; reconcile the ledger against vendor invoices monthly so drift surfaces fast; and build the grant-renewal workflow as a real ticket queue with names attached, not as a recurring Slack message that gets ignored. The FinOps Foundation's GenAI cost-and-usage tracker working group has been documenting these patterns for two years now, and the pieces that survive contact with reality are the boring ones: tagging discipline at the application layer, propagation through to billing, and human owners on every grant.
The architectural realization
The thing that takes longest to internalize is that token budgets are not a finance line item. They are an authorization surface. Every quota you grant is a capability — the holder can cause inference to happen at a certain rate, in a certain shape, under a certain set of behaviors that the model will exhibit. Every quota you withhold is a feature you've prevented from shipping. The CFO will eventually ask you to pull the lever; if you've designed the lever as a dashboard slider with no policy attached, you'll discover that pulling it breaks things you didn't know it was attached to.
The teams that get this right tend to do three things. They name an owner for inference budget at the org level — usually a platform-eng leader who reports to both a VP of engineering and to finance, so the political tension lives inside one person's job rather than as a permanent fight between departments. They write grants down with the IAM-style fields above and treat unrenewed grants as expired by default, the same way an SRE team treats unrenewed certificates. And they invest early in the chargeback ledger, because the moment the conversation shifts from "how much are we spending" to "what cost-per-outcome are we getting," every team needs to be able to answer the question for their own grant without involving finance.
If you are still in the world where one shared key serves the whole org, you have time. The day you cross seven figures, you will not. The decision you make in that window — whether token budgets are a finance problem to be optimized or an authorization surface to be designed — will shape what your AI org can do for the next five years. The organizations that treat it as the latter will have a vocabulary for the political fight when it comes. The ones that don't will discover, in the middle of an incident, that the lever they thought they had isn't connected to anything.
- https://portkey.ai/blog/tracking-llm-token-usage-across-providers-teams-and-workloads/
- https://docs.litellm.ai/docs/proxy/virtual_keys
- https://github.com/BerriAI/litellm/issues/12905
- https://www.finops.org/wg/how-to-build-a-generative-ai-cost-and-usage-tracker/
- https://data.finops.org/
- https://www.token.security/blog/the-shift-from-credentials-to-capabilities-in-ai-access-control-systems
- https://www.stackspend.app/resources/blog/managing-llm-spend-2026-approaches-pros-cons
- https://www.truefoundry.com/blog/rate-limiting-in-llm-gateway
- https://www.cio.com/article/4083473/shadow-ai-the-hidden-agents-beyond-traditional-governance.html
- https://docs.cloud.google.com/apigee/docs/api-platform/tutorials/using-ai-token-policies
