The Free Trial That Burned Your Quarterly Inference Budget in Eleven Hours

June 1, 2026 · 11 min read

Software Engineer

Your trial offered "100 generations per day." Your pricing team modeled an interested user kicking the tires for a week. The first trialist who points an agent at the endpoint runs through the daily quota in seventy seconds, the weekly quota in nineteen minutes, and the quarterly inference budget by lunch the next day. Nobody alerted, because the only alert wired up was the one that fires when a trial converts.

The trial limits were not wrong when they were written. They were calibrated for a usage distribution that no longer describes the modal user. Somewhere between the pricing review six months ago and the signup that arrived this morning, the population shifted from humans clicking buttons to programs that don't get tired. The numbers on the dashboard stopped meaning what they meant when you set them.

The Unit You Counted Was the Wrong Unit

Almost every trial cap in the industry counts requests, generations, or sessions. Those units made sense when each unit was bounded by a human's patience: a person could only initiate so many actions per hour, and each action had a roughly stable cost on the backend. The unit was a usable proxy for spend because the relationship between a "generation" and a dollar of inference was stable within an order of magnitude.

That stability is gone. A single "generation" can now be a one-shot completion that costs a fraction of a cent, or it can be a fifty-step agent loop that retrieves four documents, calls three tools, generates a thousand-token plan, executes the plan, observes the result, replans, and produces an answer ten thousand tokens long. Both are one generation. Reports from production deployments show that agent paths cost roughly three times more than chat for a five-step loop, and the multiplier exceeds thirty at fifty steps and a hundred at two hundred steps. Your "100 generations per day" trial cap, denominated in the old unit, is now a check that fails to enforce two-orders-of-magnitude swings in actual spend.

The fix is to denominate caps in the unit you actually pay for. Tokens consumed, tool calls invoked, retrieval operations performed, and seconds of compute held are all metered units that move proportionally with cost. The accounting is harder to explain to a non-technical trialist, but the alternative is explaining a budget overrun to a CFO who reads news stories about companies that accidentally spent half a billion dollars on a single vendor because nobody set seat-level limits.

The Failure Mode Is a Distribution Shift Nobody Was Watching

If you ran a histogram of per-trialist spend three years ago, it had a long tail but a recognizable shape: a mode near zero (people who signed up and never came back), a small bump in the middle (active evaluators), and a thin tail of heavy users. The width of the tail set how much you needed to budget per signup. Capacity planning worked because the distribution was stable.

The new distribution has the same mode at zero, the same middle bump, and a tail that has detached from the body. A small number of trialists now consume what your billing model treats as anomalies but what is actually a separate cluster: programmatic users who execute thousands of operations per minute, sustain that rate for hours, and pause only when a quota tells them to. The reports from teams running large fleets are consistent. The top five percent of users routinely consume fifty to seventy percent of inference spend, and that fraction has been growing every quarter without anyone deciding it should.

The danger is not that this distribution exists. The danger is that the safety systems were designed against the old shape. The "per-user soft cap" was sized to accommodate the heavy tail of the human distribution, which is dwarfed by the heavy tail of the programmatic one. The "anomaly detector" was tuned to flag the top one percent, which now means it flags everyone in the programmatic cluster and nobody can triage that many alerts so nobody triages any of them. The fraud team's "abuse" definition predates the existence of agents and does not match what a legitimate but expensive agent user looks like.

Daily Caps Are Not Weekly Caps and Both Are Not the Real Limit

A common trial design pairs a daily cap with no weekly or quarterly ceiling. The implicit assumption is that humans who hit the daily cap will go to bed and not come back until tomorrow, so the daily cap is also approximately the weekly cap. That assumption holds for humans. An agent that hits the daily cap at 7:01 AM will, in many deployments, retry at the top of the next quota window and consume the next day's cap in another seventy seconds. Seven days of that is the weekly budget. Ninety days of that is the quarterly budget. None of it required exotic abuse; it required only a quota window that resets.

The mitigation is layered ceilings: a daily cap that enforces near-term burst control, a weekly cap that is less than seven daily caps because seven consecutive full days is itself an anomaly, and a quarterly anti-abuse ceiling that nobody hits except in the specific failure mode you are now defending against. Each layer catches a different shape of overconsumption. Each one is also easy to forget, because the dashboard that surfaces daily usage rarely surfaces the rolling window that would expose the trialist who is at 95 percent of the weekly ceiling on day three.

A second discipline is to make the cap a token cap, not a request cap, and to count cached tokens, input tokens, output tokens, and tool-call tokens separately so the cap matches the bill. The provider invoice is itemized; the trial cap should be too. Otherwise a trialist consumes a hundred times the expected token volume per "generation" because each generation invoked five tools, and the dashboard shows one hundred generations and the bill shows ten thousand dollars and the only thing that bridges the gap is a post-mortem.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Free Trial That Burned Your Quarterly Inference Budget in Eleven Hours

The Unit You Counted Was the Wrong Unit

The Failure Mode Is a Distribution Shift Nobody Was Watching

Daily Caps Are Not Weekly Caps and Both Are Not the Real Limit

Recommended Reading

About Tian Pan

The Unit You Counted Was the Wrong Unit​

The Failure Mode Is a Distribution Shift Nobody Was Watching​

Daily Caps Are Not Weekly Caps and Both Are Not the Real Limit​

Recommended Reading

About Tian Pan

The Unit You Counted Was the Wrong Unit

The Failure Mode Is a Distribution Shift Nobody Was Watching

Daily Caps Are Not Weekly Caps and Both Are Not the Real Limit