I want to tell you about a Friday afternoon that cost us $47,000.
A data pipeline agent - nothing exotic, just an automated workflow that enriched customer records by calling an external firmographic API - got into a loop. The enrichment API started returning a specific error code for records it could not match. Our agent was designed to retry on error. The error code happened to be one the agent interpreted as “try again with slightly different parameters.” So it did. Endlessly. For six hours.
By the time our on-call engineer noticed the anomaly (Sunday morning, actually - it ran through the weekend), the agent had made 2.3 million API calls. The external API billed us for every call. $47,312.40, invoiced on Monday.
There was no quota enforcement. There were no circuit breakers. There were no alerts triggered by abnormal API call volume. There was no human-in-the-loop trigger. The agent just ran until the external API’s own rate limiting finally slowed it to a trickle.
Here is how we rebuilt our quota governance after that incident.
The Four Dimensions of Agent Quotas
We had been thinking about quotas one-dimensionally: how many times can this agent run per day? That is necessary but wildly insufficient. We now enforce quotas across four dimensions:
1. Token consumption per run and per day
LLM tokens are not free. An agent that is supposed to summarize 10 documents but somehow ends up processing 10,000 needs a hard stop, not just a warning. We set per-run token budgets based on the agent’s intended workload, with a 2x buffer. Exceeding 80% triggers a warning log. Exceeding 100% kills the run with a specific error code that pages the owning team.
2. External API calls per time window
This is the lesson from the $47k incident. Every external API call the agent is authorized to make has a rate limit stored in the quota system. The quota enforcer sits between the agent’s tool execution layer and the actual API calls. The agent cannot bypass it.
3. Compute time and parallel execution limits
Some of our agents can spawn parallel sub-tasks. Without limits, one misbehaving agent can consume all available compute and starve other agents. We model this the same way Kubernetes models CPU and memory: requests (what the agent typically needs) and limits (what it is never allowed to exceed).
4. Human escalation triggers
These are not exactly quotas but they belong in the same governance layer. Certain conditions must trigger human review before the agent continues: writing more than N records in a single run, accessing a sensitive data tier, calling an external service that was not in the original approved tool list.
The Soft/Hard Limit Pattern
Binary quotas (you are either within limits or you are not) create bad failure modes - you go from full speed to complete stop. We use a graduated response:
- 60% of limit: Debug-level log entry only
- 80% of limit: Warning log + metric recorded in our observability stack
- 95% of limit: Throttled - add exponential backoff between calls, notify the owning team via Slack
- 100% of limit: Hard stop. The agent run terminates with a quota-exceeded exit code. The owning team is paged. A summary of what the agent did up to this point is automatically written to our incident log.
The 95% throttle is the most important tier. It gives the owning team a chance to intervene before hard stop, and it slows the agent enough that the blast radius of a runaway loop is bounded even if nobody responds immediately.
Showback: Making Teams Own Their Costs
The organizational challenge is making teams actually define quotas before they deploy an agent, and actually review quota consumption regularly.
We solved this with showback reports. Every Monday morning, each engineering team gets an automated Slack message showing:
- Agent runs last week: N
- Token consumption: X tokens (estimated cost: $Y)
- Top 3 agents by cost
- Any quota threshold events
The dollar figure is the key. Engineers respond to “your data enrichment agent consumed $340 in LLM tokens last week” in a way they do not respond to “your agent made 18,000 API calls.” Cost visibility drives quota hygiene.
We also discovered agents that were running way more than anyone realized. One agent was configured to run every 15 minutes but its use case only required daily runs. The owning team had no idea - it had been set up by someone who left the company. Showback surfaced it within a week.
The Quota Review Gate
We now require a quota plan before any agent can be promoted to production. The plan is a simple document (one page max) that answers:
- What is the expected number of runs per day, and what is your confidence level?
- What is the expected token consumption per run?
- What external APIs does this agent call and at what rate?
- What is the worst-case cost if this agent runs at 10x expected volume?
- What human will be paged if the agent hits quota limits at 3am?
That last question is critical. You need a named human, not a team alias. Someone whose name is on the page is much more motivated to configure quota limits correctly.
The $47k incident paid for our entire quota governance system several times over. But more importantly, we have not had a runaway agent cost incident since. That is the real return on investment.