AI Agent Cost Explosions and How We Fixed Them with Quota Governance

eng_director_luis · February 19, 2026, 12:15am

I want to tell you about a Friday afternoon that cost us $47,000.

A data pipeline agent - nothing exotic, just an automated workflow that enriched customer records by calling an external firmographic API - got into a loop. The enrichment API started returning a specific error code for records it could not match. Our agent was designed to retry on error. The error code happened to be one the agent interpreted as “try again with slightly different parameters.” So it did. Endlessly. For six hours.

By the time our on-call engineer noticed the anomaly (Sunday morning, actually - it ran through the weekend), the agent had made 2.3 million API calls. The external API billed us for every call. $47,312.40, invoiced on Monday.

There was no quota enforcement. There were no circuit breakers. There were no alerts triggered by abnormal API call volume. There was no human-in-the-loop trigger. The agent just ran until the external API’s own rate limiting finally slowed it to a trickle.

Here is how we rebuilt our quota governance after that incident.

The Four Dimensions of Agent Quotas

We had been thinking about quotas one-dimensionally: how many times can this agent run per day? That is necessary but wildly insufficient. We now enforce quotas across four dimensions:

1. Token consumption per run and per day
LLM tokens are not free. An agent that is supposed to summarize 10 documents but somehow ends up processing 10,000 needs a hard stop, not just a warning. We set per-run token budgets based on the agent’s intended workload, with a 2x buffer. Exceeding 80% triggers a warning log. Exceeding 100% kills the run with a specific error code that pages the owning team.

2. External API calls per time window
This is the lesson from the $47k incident. Every external API call the agent is authorized to make has a rate limit stored in the quota system. The quota enforcer sits between the agent’s tool execution layer and the actual API calls. The agent cannot bypass it.

3. Compute time and parallel execution limits
Some of our agents can spawn parallel sub-tasks. Without limits, one misbehaving agent can consume all available compute and starve other agents. We model this the same way Kubernetes models CPU and memory: requests (what the agent typically needs) and limits (what it is never allowed to exceed).

4. Human escalation triggers
These are not exactly quotas but they belong in the same governance layer. Certain conditions must trigger human review before the agent continues: writing more than N records in a single run, accessing a sensitive data tier, calling an external service that was not in the original approved tool list.

The Soft/Hard Limit Pattern

Binary quotas (you are either within limits or you are not) create bad failure modes - you go from full speed to complete stop. We use a graduated response:

60% of limit: Debug-level log entry only
80% of limit: Warning log + metric recorded in our observability stack
95% of limit: Throttled - add exponential backoff between calls, notify the owning team via Slack
100% of limit: Hard stop. The agent run terminates with a quota-exceeded exit code. The owning team is paged. A summary of what the agent did up to this point is automatically written to our incident log.

The 95% throttle is the most important tier. It gives the owning team a chance to intervene before hard stop, and it slows the agent enough that the blast radius of a runaway loop is bounded even if nobody responds immediately.

Showback: Making Teams Own Their Costs

The organizational challenge is making teams actually define quotas before they deploy an agent, and actually review quota consumption regularly.

We solved this with showback reports. Every Monday morning, each engineering team gets an automated Slack message showing:

Agent runs last week: N
Token consumption: X tokens (estimated cost: $Y)
Top 3 agents by cost
Any quota threshold events

The dollar figure is the key. Engineers respond to “your data enrichment agent consumed $340 in LLM tokens last week” in a way they do not respond to “your agent made 18,000 API calls.” Cost visibility drives quota hygiene.

We also discovered agents that were running way more than anyone realized. One agent was configured to run every 15 minutes but its use case only required daily runs. The owning team had no idea - it had been set up by someone who left the company. Showback surfaced it within a week.

The Quota Review Gate

We now require a quota plan before any agent can be promoted to production. The plan is a simple document (one page max) that answers:

What is the expected number of runs per day, and what is your confidence level?
What is the expected token consumption per run?
What external APIs does this agent call and at what rate?
What is the worst-case cost if this agent runs at 10x expected volume?
What human will be paged if the agent hits quota limits at 3am?

That last question is critical. You need a named human, not a team alias. Someone whose name is on the page is much more motivated to configure quota limits correctly.

The $47k incident paid for our entire quota governance system several times over. But more importantly, we have not had a runaway agent cost incident since. That is the real return on investment.

alex_infrastructure · February 19, 2026, 12:16am

Luis, excellent war story. The $47k incident is exactly the kind of thing that gets executive attention, which is both good (you get resources to fix it) and bad (you are now fixing it under pressure).

One technical addition: token-per-minute rate limiting at the LLM gateway layer is separate from and complementary to your application-level quota enforcement, and most teams only implement one of them.

Your quota system enforces “this agent is allowed to use X tokens per run.” That is the right application-level control. But you also need enforcement at the LLM gateway layer - the proxy or API management layer that sits between your agents and the LLM API - that enforces “no single agent can consume more than Y tokens per minute, regardless of what it is authorized to do.”

Here is why both are needed: your quota system might allow an agent 100k tokens per run. If the agent hits a pathological input and tries to process all 100k tokens in 30 seconds, it will hammer the LLM API, potentially affecting other agents sharing the same API key, and could trigger the LLM provider’s own rate limiting which causes cascading failures.

The gateway layer enforces a smooth consumption curve. It sees: “this agent has consumed 50% of its per-minute token budget in 10 seconds - apply backpressure.” The application quota enforces the ceiling; the gateway enforces the rate of climb toward that ceiling.

We use an LLM gateway (Portkey, though any will work) with per-agent token-per-minute limits configured. The gateway sees all LLM traffic, tracks consumption by agent identity (we pass agent ID as a custom header), and applies backpressure before requests hit the API. Combined with application-level quotas, we have not had a runaway cost event in 14 months.

The other thing I would add: alert on the RATE OF QUOTA CONSUMPTION, not just the level. An agent consuming quota at 10x its historical rate is interesting even if it has not hit its absolute limit yet. Anomalous rate is often an earlier signal than hitting the ceiling.

data_rachel · February 19, 2026, 12:16am

This is a great governance story, Luis. The statistical angle I want to add: quota setting is hard without usage baselines, and most teams set quotas either too tight (breaking legitimate workloads) or too loose (not catching the runaway scenarios you describe).

Here is the approach we use at Anthropic for ML pipeline quota planning:

Phase 1: Baseline collection (first 2 weeks)
New agents run without hard quotas but with comprehensive consumption logging. We collect: tokens per run, run duration, external API call counts, time distribution of runs. This builds a statistical baseline.

Phase 2: Statistical quota setting
We set quotas at p99 of observed usage plus a 50% buffer for the “normal operations” quota tier. The 50% buffer accounts for legitimate usage spikes - larger input batches, more complex documents, seasonal patterns. If an agent consistently runs at 80% of its quota, that is normal. If it hits 100%, something unusual happened.

For the “alert” threshold (your 95% equivalent), we set it at 2x the p99 observed usage. An agent running at twice its historical maximum is almost certainly doing something unexpected.

Phase 3: Regular quota reviews
Quotas are not set-and-forget. We review agent quota utilization quarterly. Agents whose actual usage has grown significantly (feature expansion, more users, more data) get quota increases through a review process. Agents that are consistently using less than 10% of their quota get quotas tightened - both to prevent runaway scenarios and to give other agents more headroom.

One finding that surprised us: about 30% of our agents were running significantly more than their creators anticipated after three months in production, because the agents were being used for use cases the creators had not fully anticipated. Regular quota reviews catch this before it becomes an incident.

The showback report Luis describes is excellent. I would add: include p99 usage from the last 30 days alongside the current quota, so teams can see how much headroom they have. “Your agent used 340 tokens average per run and your quota is 100,000” gives teams much better context than just the dollar figure.

cto_michelle · February 19, 2026, 12:16am

Luis, this is the FinOps problem applied to AI, and I want to name that explicitly because it changes how you staff and organize for it.

We went through the cloud cost governance journey in 2018-2020. Same pattern: engineers deployed things without understanding cost implications, a few runaway workloads produced shocking bills, leadership demanded controls, we built quota and alerting systems. The tooling is different but the organizational dynamics are identical.

What we learned from cloud FinOps that applies directly to AI agent governance:

You need a dedicated function, not just tooling. Cloud FinOps became a recognized discipline with dedicated practitioners. AI agent cost governance needs the same. We hired our first “AI FinOps Engineer” three months ago. Her job is to understand agent cost patterns across the company, set sensible default quotas, build the dashboards that make costs visible, and partner with teams when agents behave unexpectedly. She has prevented three potential incidents in her first quarter by noticing anomalous patterns in the monitoring data.

Chargeback changes behavior more than showback. Your showback reports are valuable - they create awareness. But we found that actual internal cost allocation (teams’ budget is debited for their agent costs) drove a 40% reduction in unnecessary agent runs in the first quarter. When it is real money from a real budget, the quota review becomes a real conversation. When it is a dashboard number, it competes with everything else for attention.

Executive visibility is essential. Our CTO (me) reviews a weekly AI cost summary alongside our cloud cost summary. When executives treat agent costs as a real operational metric, teams treat it the same way. When it is buried in an engineering metric nobody senior reads, it gets deprioritized.

The $47k incident you described would have been caught much earlier with this structure in place. The AI FinOps Engineer would have flagged the anomalous call pattern within the first hour.

vp_eng_keisha · February 19, 2026, 12:16am

Luis, I appreciate the honesty of the war story and I agree the quota governance is necessary. But I want to push back on one aspect: overly conservative quotas are actively strangling experimentation at my company, and I do not think that cost is being accounted for.

We run an EdTech platform and our data science team has been building learning analytics agents - things that analyze student performance patterns and generate personalized recommendations. These agents have highly variable workloads. Some analyses are lightweight; others require processing months of interaction history for a district with 50,000 students.

When we applied strict per-run token quotas, about 30% of legitimate analyses started failing mid-run because they exceeded quotas that were set based on “average” use cases. The data scientists spent two weeks debugging what they thought were model issues before realizing the quota system was silently killing their runs. That is two weeks of wasted engineering time.

My recommendation: separate quota policies for development, staging, and production environments.

Dev: No hard quotas, soft limits with alerts only. Engineers need to understand actual agent consumption before they can set sensible production quotas. Constraining them in dev means they set production quotas wrong.
Staging: Moderate quotas at 2-3x expected production usage. Catch egregious misconfigurations before prod.
Production: Graduated quotas as Luis describes, but set based on MEASURED staging data, not estimates.

The other thing: the soft throttle at 95% can cause real problems for latency-sensitive agents. If a customer-facing agent starts throttling at 95% of its token budget, the user experience degrades noticeably. For real-time agents, I would rather have a hard cutoff at 100% with a graceful error than a degraded-performance state that is hard to debug.

The quota framework is right. The specific parameters need to match your agent’s operational requirements, not a one-size-fits-all policy.