Skip to main content

12 posts tagged with "llm-cost"

View all tags

The MCP Capability Disclosure Tax: When Every Connected Server Bills Your Context Window

· 11 min read
Tian Pan
Software Engineer

Connect a single GitHub MCP server to your agent and you've already spent twelve to forty thousand tokens before the user types a word. Connect a filesystem server, a calendar, a database, an internal CRM, and a third-party tool catalog, and a heavy desktop configuration has been measured at sixty-six thousand tokens of pure tool disclosure — nearly a third of Claude Sonnet's 200K window, paid every single planning turn. The agent hasn't done anything yet. The user hasn't asked anything yet. The bill is already running.

This is the disclosure tax, and it is the most underpriced line item in agentic systems shipping right now. Teams add MCP servers the way teams once added microservices — each integration looks like a free composition primitive, the procurement story writes itself ("more tools = more capability"), and the unit economics dashboard never surfaces the per-server cost because the cost lives inside a token bucket nobody attributes back to the connector. The result is an agent that gets slower, dumber, and more expensive every time someone adds another integration, and a team that explains the regression by re-tuning prompts and chasing the model vendor for a new version.

The Sliding-Window Tax: Why a 30-Turn Conversation Costs More Than 30x a Single Turn

· 9 min read
Tian Pan
Software Engineer

The conversation looks healthy on the dashboard. Average tokens per call is sane, the p50 input length is comfortably inside the cached prefix, the provider invoice ticks up at the rate finance approved. Then someone exports a single 200-turn coding session and the line item for that one user is larger than the rest of the team's daily traffic combined. The dashboard wasn't lying — it was averaging. The bill comes from the long tail, and the long tail does not scale linearly with turn count.

Every multi-turn AI feature eventually meets this surprise. The per-call token count is the wrong unit of measurement, because the cost of a 30-turn conversation is not 30 times the cost of a single turn — it's something between 50× and 200×, depending on how the history is structured, how the prompt cache decays, and what tier the request lands in once the input crosses 200K tokens. The team that priced the feature off the per-call number is underwriting a tail it never modeled.

Context Bloat: The AI Memory Leak You Cannot Grep For

· 12 min read
Tian Pan
Software Engineer

A long-running agent session that opened with a 2K context is now paying for 40K tokens of mostly-dead state. The retrieval results from turn three, the directory listing the agent already navigated past, the JSON dump from a tool call whose answer was a single integer — all of it is still riding shotgun on every subsequent inference call, billed in full, dragging on attention. The pattern is structurally identical to a memory leak: unbounded growth of unreferenced data. But no profiler will surface it, because the leak does not live in process memory. It lives inside the conversation history, and most agent frameworks ship without a collector.

The cost shows up in two places at once. The token bill grows quadratically — a 20-step loop where each step contributes 1,000 tokens produces roughly 210,000 cumulative input tokens, not 20,000, because every prior turn is rebilled on every subsequent call. And the model itself starts to degrade: by 50K tokens of accumulated noise, even a model with a 1M-token window has already lost double-digit points of accuracy on the actual task. You are paying more, to think worse, about a problem the model was already past three turns ago.

Your LLM Bill Is Half Your Agent's COGS — The Other Half Is The Part Nobody Is Monitoring

· 10 min read
Tian Pan
Software Engineer

The first time a finance team asks an AI product team to forecast unit economics, the conversation goes the same way. The team pulls up the inference dashboard, points at the monthly token spend, and says "that's our COGS." The CFO multiplies by projected volume, draws a line on a chart, and asks where the gross margin curve crosses 70%. Six weeks later, when the actual P&L lands, the inference number on the dashboard is correct and the gross margin is twenty points lower than the forecast. Nobody is lying. Inference was just half of what the agent actually costs.

The other half is distributed across line items that nobody on the AI team owns. The vector database bill grows quietly because retrieval volume tracks usage and re-indexing costs are billed against compute, not storage. The observability platform's invoice arrives from the platform team's budget. Embedding regeneration shows up as a CI cost. Telemetry storage is filed under data warehouse. Human review is in customer-success headcount. None of these line items is alarming on its own — and that is exactly why the integrated number is the one that surprises everyone.

Your AI Pricing Page Is a Leveraged Bet on Token Economics

· 9 min read
Tian Pan
Software Engineer

When the team published the AI tier at "$X per seat for unlimited AI," nobody on the pricing call thought of it as a derivative position. It looked like a SaaS pricing page — a number, a tier, a CTA. But every dollar of revenue from that page is now exposed to a token-cost curve set by a vendor whose roadmap does not care about your gross margin. You did not write a pricing page. You wrote a naked short on token volatility, and the strike is whatever your vendor charges next quarter.

The math arrives quickly. A handful of power users discover the workflow and start running it on the longest context they can fit. A competitor's UX change re-trains the median user to send queries that are 40% longer. The frontier model your feature is locked to gets a price-per-million bump because the older tier you were on is being deprecated. Any one of these is a margin event you cannot reverse from the pricing page in a single quarter — and they tend to arrive together.

Inference Cost Forecasting: The Capacity Plan Your Finance Team Wants and You Can't Write

· 12 min read
Tian Pan
Software Engineer

Your finance team will ask for a capacity plan you cannot write. Not because you're inexperienced or because the model is new, but because the two assumptions classical capacity planning rests on — a workload distribution you can measure, and a unit cost stable on a quarter timescale — are both violated by AI workloads. The number you hand them will be wrong on day one, and when the variance hits, the conversation that follows will not be about the bill.

The 2026 State of FinOps report named AI as the fastest-growing new spend category, with a majority of respondents reporting that AI costs exceeded original budget projections — for many enterprises, inference now consumes the bulk of the AI bill. The instinct to manage this with a SaaS-style capacity plan — pick a peak QPS, multiply by a unit cost, add 30% buffer — produces a number with the texture of a forecast and the predictive power of a horoscope. The capacity plan you actually need looks more like a FinOps scenario model than a procurement spreadsheet, and the engineering work to produce it is platform work that competes with feature work until the day finance loses patience.

The Carbon Math of Agent Workflows: A Token Budget Is Now an ESG Disclosure

· 10 min read
Tian Pan
Software Engineer

A stateless chat completion sips electricity. A median Gemini text prompt clocks in at about 0.24 Wh; a short GPT-4o query is around 0.3–0.4 Wh. These numbers are small enough that nobody puts them on a board deck.

An agent task is not a chat completion. A typical "go research this customer and draft a reply" workflow can fan out to 30+ tool calls, 10–15 model invocations, and a context window that grows with every step. The energy cost compounds with the call graph. By the time the agent returns, you have not consumed one unit of inference — you have consumed fifty to two hundred. Suddenly the per-task footprint is in the same order of magnitude as a video stream.

That arithmetic is about to matter outside the engineering org. The EU's CSRD makes Scope 3 emissions disclosure mandatory for in-scope companies, with machine-readable iXBRL reporting required from 2026. The SEC dropped Scope 3 from its final rule, but any multinational with EU operations still has to answer the question. Procurement teams have started adding "what is the carbon footprint per user task of your AI feature?" to vendor questionnaires. Most engineering teams cannot answer it, because nobody instrumented it.

The Tip Jar Problem: When 5% of Your Users Burn 80% of Your Inference Budget

· 11 min read
Tian Pan
Software Engineer

A single developer ran up more than $35,000 in compute under a $200 monthly plan. That is a 175x subsidy on one user — paid for by the casual majority who would have been just as happy on a $19 tier. This is the load-bearing math behind every "Why is our AI margin negative this quarter?" Slack thread. The problem is not that one user; it is that the long tail of one users follows a power law, and a power law plus flat-rate billing plus a real per-unit cost is a structural margin compressor that no amount of growth will fix.

The reflex when this lands on a finance review is to clamp down: hard token caps, "fair-use" language buried in the TOS, weekly throttles, a quietly degraded model for free tier. These all work in the sense that they cut the bleed. They also alienate the exact users whose evangelism you depend on, because the people who hit your caps are the ones who actually figured out how to extract value from your product. The standard fix is a backwards-compatible apology to the wrong cohort.

The Cancellation Tax: Your Inference Bill After the User Hits Stop

· 9 min read
Tian Pan
Software Engineer

Your stop button is a lie. When a user clicks it, your UI stops rendering tokens; your provider, in most configurations, keeps generating them. The bytes never reach a browser, but they reach your invoice. The gap between what the user saw and what you paid for is the cancellation tax, and it is the single most under-reported line item on AI cost dashboards.

The reason the tax exists is structural. Autoregressive inference is a GPU-bound pipeline: by the time your client closes the TCP connection, the model has already been scheduled, KV-cached, and is emitting tokens at 30–200 per second. Most serving stacks do not check for client liveness between tokens. They finish the job, log the usage, and bill you. The client saw ten tokens; the log recorded eight hundred. Langfuse, Datadog, and every other observability platform will faithfully report the eight hundred, because that's what the provider's usage block reported.

The Model Bill Is 30% of Your Inference Cost

· 8 min read
Tian Pan
Software Engineer

A finance lead at a mid-sized AI company told me last quarter they had "optimized their LLM spend" by switching their agent backbone from Sonnet to Haiku. The token bill dropped 22%. The total inference cost per resolved ticket went down 4%. When we pulled the full decomposition, the model line item was roughly a third of the per-request cost. Retrieval, reranking, observability, retry amplification, and the human-in-the-loop review queue ate the rest — and none of those got cheaper when they swapped models.

This is the most common accounting error I see in AI teams right now. Token cost is the line item on the invoice you pay every month, so it becomes the number everyone optimizes. But for any non-trivial production system — RAG, agents, anything with tool use or evaluation gates — the model inference is often 30 to 50% of the real unit economics. The rest sits in places your engineering dashboard doesn't surface and your finance team doesn't categorize as "AI spend."

The Planning Tax: Why Your Agent Spends More Tokens Thinking Than Doing

· 10 min read
Tian Pan
Software Engineer

Your agent just spent 6solvingataskthatadirectAPIcallcouldhavehandledfor6 solving a task that a direct API call could have handled for 0.12. If you've built agentic systems in production, this ratio probably doesn't surprise you. What might surprise you is where those tokens went: not into tool calls, not into generating the final answer, but into the agent reasoning about what to do next. Decomposing the task. Reflecting on intermediate results. Re-planning when an observation didn't match expectations. This is the planning tax — the token overhead your agent pays to think before it acts — and for most agentic architectures, it consumes 40–70% of the total token budget before a single useful action fires.

The planning tax isn't a bug. Reasoning is what separates agents from simple prompt-response systems. But when the cost of deciding what to do exceeds the cost of actually doing it, you have an engineering problem that no amount of cheaper inference will solve. Per-token prices have dropped roughly 1,000x since late 2022, yet total agent spending keeps climbing — a textbook Jevons paradox where cheaper tokens just invite more token consumption.

The Reasoning Model Premium in Agent Loops: When Thinking Pays and When It Doesn't

· 10 min read
Tian Pan
Software Engineer

Here is a number that should give you pause before adopting a reasoning model for your agent: a single query that costs 7 tokens with a standard fast model costs 255 tokens with Claude extended thinking and 603 tokens with an aggressively-configured reasoning model. For an isolated chatbot query, that is manageable. But inside an agent loop that calls the model twelve times per task, you are not paying a 10x premium — you are paying a 10x premium times twelve, compounded further by the growing context window that gets re-fed on every turn. Billing surprises have killed agent projects faster than accuracy problems.

The question is not whether reasoning models are better. On hard tasks, they clearly are. The question is whether they are better for your specific workload, at your specific position in the agent loop, and by a margin that justifies the cost. Most teams answer this incorrectly in both directions — they either apply reasoning models uniformly (burning budget on tasks that don't need them) or avoid them entirely (leaving accuracy gains on the table for the tasks that do).