Skip to main content

17 posts tagged with "llm-cost"

View all tags

The Abstention Tax You Didn't Budget For

· 11 min read
Tian Pan
Software Engineer

You taught the agent to say "I don't know" when the context was thin and called it a safety win. The OpenAI bill went down. Everyone agreed it was the responsible move. Three months later your VP of Support is asking why headcount projections are off by 40% and nobody in the AI org has an answer, because the metric you tracked was abstention rate and the metric that moved was tickets-per-week — and nobody owned the line that summed them.

This is the abstention tax. It's not a model cost. It doesn't show up on the inference invoice. It shows up downstream, in the queue depth of the human team that catches every "I cannot answer," in the second model call that runs against the enriched context the human had to assemble, in the customer who churned during the wait. The model-only cost frame quietly hides it. And the org seam where the AI team owns abstention and the ops team owns the queue means nobody is incentivized to see it.

The Model Migration That Broke Your Prompt Cache Without Warning

· 10 min read
Tian Pan
Software Engineer

The migration looked clean. Evals were re-anchored against the new model version. Judge prompts were re-calibrated. Two weeks of shadow traffic showed behavior parity within tolerance. p50 and p99 latency were inside the budget. The rollout call signed off on Thursday afternoon and the team went home.

By Friday morning, the inference bill was 3x normal. Eval scores were still fine. Latency was still fine. No one on the rollout call had thought to instrument the cache hit rate, because the prefix had not changed — the system prompt was byte-identical, the tool definitions were byte-identical, the conversation framing was byte-identical. What had changed was the model version in the request body, and the provider keys its prefix cache on (prefix bytes + model version). Every request after the cutover landed on a cold cache. The warm-up curve took six weeks of organic traffic to recover, and the team paid full input-token rates for every token on every request for the duration.

Inference Billing as a P&L Line Item Nobody Owns

· 9 min read
Tian Pan
Software Engineer

Somewhere in your company, four people each believe a fifth person owns the inference bill. Engineering treats it as a cloud line item. The AI team treats it as the price of building. Finance treats it as a variable margin input that someone in engineering must already be managing. Product treats it as overhead that engineering absorbs. The bill keeps growing, and the only thing everyone agrees on is that it isn't theirs.

This is not a budgeting problem. It is an ownership vacuum, and it surfaces the first time the line item gets large enough for a CFO to ask about it on a board call. By then, the answers people improvise — "we'll optimize," "we'll cache more," "we'll switch models" — describe interventions without naming an owner. The conversation that should have happened a year earlier was not about how to lower the bill. It was about whose P&L the bill belonged to in the first place.

The shift is structural. Inference moved from 15% of enterprise AI spend in 2024 to roughly 85% in 2026, and the average enterprise AI budget grew from $1.2M to around $7M over the same window. A line item that was once rounding error is now the kind of number a board notices, and the org chart written before that shift has no row for it.

Your Happy Path Is Your Expensive Path: The Agent That Costs More When It Wins

· 10 min read
Tian Pan
Software Engineer

A failed agent run is cheap. It misroutes a query, hits a dead end, returns "I couldn't help with that," and burns maybe a few hundred tokens doing it. A successful run is the disaster. It retrieves context, reflects on it, calls three tools, reflects again, and stitches together a confident multi-paragraph answer — fifty times the token spend of the failure that cost you nothing.

This is the uncomfortable shape of agent economics: your happy path is your expensive path. The outcome you are selling, the one your marketing page promises, the one users thank you for, is the single most costly thing your system can do. And if you priced the product the way SaaS has been priced for fifteen years — a flat monthly fee per seat — then every time the agent does its job well, it quietly erodes your margin.

Most teams discover this backwards. They watch cost dashboards, see failures are cheap, and conclude that reliability work will save money. It won't. Raising your success rate raises your bill.

Why You Can't Budget an AI Feature With a Single Number

· 9 min read
Tian Pan
Software Engineer

Finance asks one question about every feature you ship: "What does it cost per user?" For a traditional feature, the answer is a number. A page render, a database query, a push notification — each has a marginal cost that barely moves from one request to the next. You measure it once, multiply by your user count, and the forecast holds.

An AI feature breaks that contract. Ask "what does this agent cost per request" and the honest answer is not a number, it's a histogram. The same agent that resolves one ticket for two cents will burn four dollars on the next one, because that user asked a vague question, the agent looped through eleven tool calls, and each call dragged the entire growing conversation back through the model. The mean of those two requests — two dollars — describes neither of them, and it definitely doesn't describe the bill.

That is the trap. When you hand finance a single average cost, you are not simplifying a messy reality. You are reporting a number that is wrong in a specific, expensive direction.

The MCP Capability Disclosure Tax: When Every Connected Server Bills Your Context Window

· 11 min read
Tian Pan
Software Engineer

Connect a single GitHub MCP server to your agent and you've already spent twelve to forty thousand tokens before the user types a word. Connect a filesystem server, a calendar, a database, an internal CRM, and a third-party tool catalog, and a heavy desktop configuration has been measured at sixty-six thousand tokens of pure tool disclosure — nearly a third of Claude Sonnet's 200K window, paid every single planning turn. The agent hasn't done anything yet. The user hasn't asked anything yet. The bill is already running.

This is the disclosure tax, and it is the most underpriced line item in agentic systems shipping right now. Teams add MCP servers the way teams once added microservices — each integration looks like a free composition primitive, the procurement story writes itself ("more tools = more capability"), and the unit economics dashboard never surfaces the per-server cost because the cost lives inside a token bucket nobody attributes back to the connector. The result is an agent that gets slower, dumber, and more expensive every time someone adds another integration, and a team that explains the regression by re-tuning prompts and chasing the model vendor for a new version.

The Sliding-Window Tax: Why a 30-Turn Conversation Costs More Than 30x a Single Turn

· 9 min read
Tian Pan
Software Engineer

The conversation looks healthy on the dashboard. Average tokens per call is sane, the p50 input length is comfortably inside the cached prefix, the provider invoice ticks up at the rate finance approved. Then someone exports a single 200-turn coding session and the line item for that one user is larger than the rest of the team's daily traffic combined. The dashboard wasn't lying — it was averaging. The bill comes from the long tail, and the long tail does not scale linearly with turn count.

Every multi-turn AI feature eventually meets this surprise. The per-call token count is the wrong unit of measurement, because the cost of a 30-turn conversation is not 30 times the cost of a single turn — it's something between 50× and 200×, depending on how the history is structured, how the prompt cache decays, and what tier the request lands in once the input crosses 200K tokens. The team that priced the feature off the per-call number is underwriting a tail it never modeled.

Context Bloat: The AI Memory Leak You Cannot Grep For

· 12 min read
Tian Pan
Software Engineer

A long-running agent session that opened with a 2K context is now paying for 40K tokens of mostly-dead state. The retrieval results from turn three, the directory listing the agent already navigated past, the JSON dump from a tool call whose answer was a single integer — all of it is still riding shotgun on every subsequent inference call, billed in full, dragging on attention. The pattern is structurally identical to a memory leak: unbounded growth of unreferenced data. But no profiler will surface it, because the leak does not live in process memory. It lives inside the conversation history, and most agent frameworks ship without a collector.

The cost shows up in two places at once. The token bill grows quadratically — a 20-step loop where each step contributes 1,000 tokens produces roughly 210,000 cumulative input tokens, not 20,000, because every prior turn is rebilled on every subsequent call. And the model itself starts to degrade: by 50K tokens of accumulated noise, even a model with a 1M-token window has already lost double-digit points of accuracy on the actual task. You are paying more, to think worse, about a problem the model was already past three turns ago.

Your LLM Bill Is Half Your Agent's COGS — The Other Half Is The Part Nobody Is Monitoring

· 10 min read
Tian Pan
Software Engineer

The first time a finance team asks an AI product team to forecast unit economics, the conversation goes the same way. The team pulls up the inference dashboard, points at the monthly token spend, and says "that's our COGS." The CFO multiplies by projected volume, draws a line on a chart, and asks where the gross margin curve crosses 70%. Six weeks later, when the actual P&L lands, the inference number on the dashboard is correct and the gross margin is twenty points lower than the forecast. Nobody is lying. Inference was just half of what the agent actually costs.

The other half is distributed across line items that nobody on the AI team owns. The vector database bill grows quietly because retrieval volume tracks usage and re-indexing costs are billed against compute, not storage. The observability platform's invoice arrives from the platform team's budget. Embedding regeneration shows up as a CI cost. Telemetry storage is filed under data warehouse. Human review is in customer-success headcount. None of these line items is alarming on its own — and that is exactly why the integrated number is the one that surprises everyone.

Your AI Pricing Page Is a Leveraged Bet on Token Economics

· 9 min read
Tian Pan
Software Engineer

When the team published the AI tier at "$X per seat for unlimited AI," nobody on the pricing call thought of it as a derivative position. It looked like a SaaS pricing page — a number, a tier, a CTA. But every dollar of revenue from that page is now exposed to a token-cost curve set by a vendor whose roadmap does not care about your gross margin. You did not write a pricing page. You wrote a naked short on token volatility, and the strike is whatever your vendor charges next quarter.

The math arrives quickly. A handful of power users discover the workflow and start running it on the longest context they can fit. A competitor's UX change re-trains the median user to send queries that are 40% longer. The frontier model your feature is locked to gets a price-per-million bump because the older tier you were on is being deprecated. Any one of these is a margin event you cannot reverse from the pricing page in a single quarter — and they tend to arrive together.

Inference Cost Forecasting: The Capacity Plan Your Finance Team Wants and You Can't Write

· 12 min read
Tian Pan
Software Engineer

Your finance team will ask for a capacity plan you cannot write. Not because you're inexperienced or because the model is new, but because the two assumptions classical capacity planning rests on — a workload distribution you can measure, and a unit cost stable on a quarter timescale — are both violated by AI workloads. The number you hand them will be wrong on day one, and when the variance hits, the conversation that follows will not be about the bill.

The 2026 State of FinOps report named AI as the fastest-growing new spend category, with a majority of respondents reporting that AI costs exceeded original budget projections — for many enterprises, inference now consumes the bulk of the AI bill. The instinct to manage this with a SaaS-style capacity plan — pick a peak QPS, multiply by a unit cost, add 30% buffer — produces a number with the texture of a forecast and the predictive power of a horoscope. The capacity plan you actually need looks more like a FinOps scenario model than a procurement spreadsheet, and the engineering work to produce it is platform work that competes with feature work until the day finance loses patience.

The Carbon Math of Agent Workflows: A Token Budget Is Now an ESG Disclosure

· 10 min read
Tian Pan
Software Engineer

A stateless chat completion sips electricity. A median Gemini text prompt clocks in at about 0.24 Wh; a short GPT-4o query is around 0.3–0.4 Wh. These numbers are small enough that nobody puts them on a board deck.

An agent task is not a chat completion. A typical "go research this customer and draft a reply" workflow can fan out to 30+ tool calls, 10–15 model invocations, and a context window that grows with every step. The energy cost compounds with the call graph. By the time the agent returns, you have not consumed one unit of inference — you have consumed fifty to two hundred. Suddenly the per-task footprint is in the same order of magnitude as a video stream.

That arithmetic is about to matter outside the engineering org. The EU's CSRD makes Scope 3 emissions disclosure mandatory for in-scope companies, with machine-readable iXBRL reporting required from 2026. The SEC dropped Scope 3 from its final rule, but any multinational with EU operations still has to answer the question. Procurement teams have started adding "what is the carbon footprint per user task of your AI feature?" to vendor questionnaires. Most engineering teams cannot answer it, because nobody instrumented it.