Skip to main content

5 posts tagged with "ai-cost"

View all tags

The Cost Forecast Tied to a Pricing Tier You No Longer Qualify For

· 11 min read
Tian Pan
Software Engineer

The usage curve barely moved. The bill went up 38%.

That is the email the finance lead at a mid-sized fintech opened on the first Monday of the quarter. Three months earlier, the engineering org had renegotiated their LLM inference contract and shaved a sizeable percentage off the negotiated unit price by committing to a volume floor. The finance model rolled the new unit price into the FY forecast. Nobody bookmarked the footnote in the pricing schedule that said the discount would lapse if monthly usage fell below the floor for three consecutive months. The seasonal traffic dip in April-May did exactly that. The provider re-tiered the account back to list price. No notification reached engineering, because the notification went to the procurement inbox that nobody had read since the contract was signed.

The Free Trial That Burned Your Quarterly Inference Budget in Eleven Hours

· 11 min read
Tian Pan
Software Engineer

Your trial offered "100 generations per day." Your pricing team modeled an interested user kicking the tires for a week. The first trialist who points an agent at the endpoint runs through the daily quota in seventy seconds, the weekly quota in nineteen minutes, and the quarterly inference budget by lunch the next day. Nobody alerted, because the only alert wired up was the one that fires when a trial converts.

The trial limits were not wrong when they were written. They were calibrated for a usage distribution that no longer describes the modal user. Somewhere between the pricing review six months ago and the signup that arrived this morning, the population shifted from humans clicking buttons to programs that don't get tired. The numbers on the dashboard stopped meaning what they meant when you set them.

Thinking Tokens Are Invisible in Your Logs and Loud on Your Bill

· 9 min read
Tian Pan
Software Engineer

The first person to notice your reasoning-model regression is almost never on the engineering team. It is the finance analyst who pings your manager on a Tuesday afternoon because the previous month's Anthropic invoice came in 2.4x higher than the prior one, and "we didn't ship anything that should have done that." You open the dashboard, look at request volume — flat. Latency p99 — flat. Output tokens per response — flat. Error rate — flat. Every panel you wired up six months ago says the system is healthy. Finance is looking at a different number, and they are right.

The number they are looking at is reasoning tokens, and most observability stacks were built before the field existed.

Per-Customer Cost Concentration: Why AI Cost Dashboards Hide the Power Law

· 12 min read
Tian Pan
Software Engineer

Your AI feature's cost is a distribution, not a number. The dashboard hanging on the wall of the eng-finance war room says $187,000 last month, broken out by feature, by model, and by region. None of those views answers the question the CFO is actually about to ask: "Who is paying us $40 a month and costing us $4,000?" When you sort by customer_id instead of by feature, the line that was a comfortable bar chart becomes a hockey stick, and the team that designed against the average customer discovers it has been quietly underwriting the top of the tail for a quarter.

The pattern is so consistent it deserves to be called a law. Across production LLM workloads, the top 1% of users routinely drive 30–50% of token spend, with similar shapes showing up at the top 0.1% and the top 0.01%. This isn't a quirk of any one product — it's what happens when you ship a feature whose marginal cost is variable and whose pricing is flat. Average-user margins look fine. Median-user margins look great. The integral over the heavy tail is where the quarter goes.

Reasoning-Model Arbitrage: The Slow Expensive Model Is Cheaper on the Hard Prompts

· 10 min read
Tian Pan
Software Engineer

The cheapest line on the pricing page is rarely the cheapest line on the invoice. A team picks the workhorse model — Sonnet, Haiku, Flash, GPT-mini — because the per-token math is friendly, ships a feature, and watches the cost dashboard report a happy unit-economics story for a quarter. Then the long tail catches up: a slice of requests the workhorse can't quite handle starts retrying, then partially answering, then escalating to a human reviewer, and the per-feature P&L stops resembling the per-call dashboard.

The arbitrage is that, on those hard requests, a reasoning model the team would never default to — Opus, o3, the slow expensive one — frequently lands the answer on the first attempt. The all-in cost of one $0.50 reasoning call beats five $0.05 workhorse calls plus the escalation queue and the engineer who debugs the failure on Monday. The procurement question (which model is cheapest per token?) and the architecture question (which model is cheapest per resolved request?) are different questions, and the team that conflates them is paying the difference.