Skip to main content

29 posts tagged with "finops"

View all tags

The Idle Agent Tax: What Your AI Session Costs While the User Is in a Meeting

· 11 min read
Tian Pan
Software Engineer

A developer opens their IDE copilot at 9:00, asks it three questions before standup, and then sits in meetings until 11:30. The chat panel is still open. The conversation is still scrollable. The model hasn't generated a token in two and a half hours. And yet that session — sitting there, attended by nobody — has been quietly accruing cost the entire morning. KV cache pinned. Prompt cache being kept warm by a periodic ping. Conversation state held in a hot store. Trace pipeline writing one row per heartbeat. Concurrency slot reserved on the model provider. Multiply by ten thousand seats and the bill is real.

This is the idle agent tax. It is the part of your inference budget that pays for capacity your users are not using, and it is invisible to most engineering dashboards because the dashboards were built for stateless APIs. A request comes in, a response goes out, the box closes. Done. Agentic products broke that model two years ago and most teams have not yet repriced their architecture around it.

Your Inference Chargeback Is Quietly Taxing Eval Discipline

· 12 min read
Tian Pan
Software Engineer

The FinOps team rolled out chargeback for AI a year ago. The dashboard is gorgeous. Every feature team can see, to the cent, what their inference bill was last month, and the platform PM has slides showing line-of-business attribution at the SKU level. The org has more AI features than it had a year ago. It also has worse AI quality. Nobody has connected the two facts yet, but they are the same fact.

Here is the failure mode in one sentence: chargeback prices the inference token and silently fails to price the eval token, so every PM on the org chart faces an incentive structure that rewards model upgrades and punishes evaluation discipline. Twelve months later, eval coverage is shrinking while the bill is growing — the precise opposite of what the FinOps initiative thought it was incentivizing. This is not a bug in the dashboard. It is the chargeback model functioning exactly as designed, in a domain where the design assumptions from cloud-cost FinOps no longer hold.

Inference Cost Forecasting: The Capacity Plan Your Finance Team Wants and You Can't Write

· 12 min read
Tian Pan
Software Engineer

Your finance team will ask for a capacity plan you cannot write. Not because you're inexperienced or because the model is new, but because the two assumptions classical capacity planning rests on — a workload distribution you can measure, and a unit cost stable on a quarter timescale — are both violated by AI workloads. The number you hand them will be wrong on day one, and when the variance hits, the conversation that follows will not be about the bill.

The 2026 State of FinOps report named AI as the fastest-growing new spend category, with a majority of respondents reporting that AI costs exceeded original budget projections — for many enterprises, inference now consumes the bulk of the AI bill. The instinct to manage this with a SaaS-style capacity plan — pick a peak QPS, multiply by a unit cost, add 30% buffer — produces a number with the texture of a forecast and the predictive power of a horoscope. The capacity plan you actually need looks more like a FinOps scenario model than a procurement spreadsheet, and the engineering work to produce it is platform work that competes with feature work until the day finance loses patience.

Retries Aren't Free: The FinOps Math of LLM Retry Policies

· 11 min read
Tian Pan
Software Engineer

A team I talked to last quarter found a $4,200 line item on their inference invoice that nobody could explain. The dashboard showed normal traffic. The latency graphs were flat. The cause turned out to be a single agent stuck in a polite retry loop for six hours, replaying a 40k-token tool chain with exponential backoff that capped out at thirty seconds and then started over. The retry policy was lifted verbatim from an internal SRE handbook written in 2019 for a JSON-over-HTTP service. It worked perfectly. It worked perfectly for the wrong system.

This is the bill that does not show up in capacity-planning spreadsheets. The retry-policy patterns the industry standardized on for stateless REST APIs assume three things that LLM workloads quietly violate: failures are transient, the cost of one extra attempt is bounded, and a retry has a meaningful chance of succeeding. Each assumption was load-bearing. Each one is now wrong, and the variance the cost model never captured is sitting at the bottom of every monthly invoice.

The teams that have not rebuilt their retry policy for token economics are paying a hidden tax that scales with the difficulty of the queries they were already most worried about — the long ones, the agentic ones, the ones with deep tool chains. The retry budget that classical resilience engineering hands you back as a safety net is, in an LLM stack, the rope.

Token-Per-Watt: The AI Sustainability Metric Your Dashboard Cannot Compute

· 11 min read
Tian Pan
Software Engineer

Your sustainability dashboard reports "AI energy: 2.3 GWh this quarter, down 4% YoY" and the slide gets a polite nod in the ESG review. The CFO walks out of an analyst call six months later and asks the head of platform a question that sounds simple: "What is our token-per-watt, and how does it compare to our competitors?" The dashboard cannot answer. Not because the data is missing — the dashboard is full of data — but because it treats inference as a single line item and tasks as a product concept, and the only honest unit of AI sustainability lives at the intersection.

The mismatch is not a reporting bug. It is a category error that the existing carbon-accounting playbook, perfected for cloud workloads on CPU-hours and kWh per VM, cannot fix on its own. Inference is not a workload with a stable energy profile. The watts per token shift by 30× depending on which model tier served the request, by 4× depending on batch size at the moment of the call, and by another order of magnitude depending on whether the prefix cache hit or missed. Aggregating those into a single GWh number is like reporting "average car fuel economy" across a fleet that includes scooters, sedans, and 18-wheelers — accurate in the most useless sense.

Vendor Benchmarks Are Your Ceiling, Not Your Forecast

· 10 min read
Tian Pan
Software Engineer

The model release announcement lands on Tuesday morning. The blog post leads with a chart: HumanEval up four points, SWE-bench Verified up six, MATH up three, the agent harness du jour up a number that would have been a research paper a year ago. By Tuesday afternoon there is a Slack thread inside your company with screenshots of the chart and a question shaped like a decision: "Should we cut over?" The thread treats the benchmark delta as a forecast — as if those numbers describe what the new model will do for your product, on your prompts, in your tool harness, against your eval rubric. They do not. The vendor's number is the upper bound on what you might see. Your realized lift is somewhere between zero and roughly half of that headline, and you cannot know which without running an eval the vendor did not run.

This is not a complaint about benchmark validity. The benchmarks are real. They are run against real eval suites. The vendor is not lying. The problem is that the vendor's harness is an idealized environment that strips away every variable a production deployment introduces, and a number generated under those conditions is structurally incapable of predicting behavior under yours. Treating it as a prediction is a category error — and it leads to procurement decisions, capacity-planning commitments, and rollout schedules that are calibrated against a fiction.

The Chargeback Ledger for Compound AI Systems

· 10 min read
Tian Pan
Software Engineer

The first time the CFO asks "what does the assistant cost us per month," the engineering team produces a number. The second time, a different team produces a different number. The third time, finance produces a third number, and somebody opens a spreadsheet that re-derives the bill from spans because nobody trusts any of the previous answers. This is the moment a compound AI system stops being an architecture problem and becomes an accounting problem.

The shape of the failure is structural. A single user request to "summarize my last quarter's customer feedback" triggers an agent owned by team A, which calls a retrieval tool maintained by team B, which calls a model hosted by provider X, which streams results back through a re-ranking tool from team C, which calls a different model from provider Y. One click; five owners; two invoices that arrive a month apart. Standard FinOps primitives — cost centers, allocation tags, account-level rollups — were designed to slice infrastructure that already had stable owners. They do not compose cleanly across an internal call graph that crosses team boundaries on every request.

The 2026 State of FinOps report puts 98% of FinOps teams on the hook for AI spend, and the same survey lists real-time visibility into AI costs as the top tooling gap. That gap is not "we cannot see the bill." The gap is "we cannot see who caused what slice of the bill, fast enough that anyone changes their behavior before the bill arrives."

Reasoning-Effort Budgeting: When Thinking Tokens Become a Finance Line Item

· 11 min read
Tian Pan
Software Engineer

The first time your finance team asks why a single user racked up a fifty-cent answer to a one-tenth-of-a-cent question, the call will not be about the model. It will be about the line on the invoice that did not exist twelve months ago: reasoning tokens. They look like output tokens on the bill, they bill at output-token rates on most providers, and they have no natural ceiling. A query that would have produced a four-hundred-token reply on a non-reasoning model can quietly burn eight thousand internal thinking tokens to get there — and the only person who notices is the one reconciling the spend.

For most of the API era, "tokens used" was an honest number. You sent a prompt in, you got a response out, and the bill was a clean function of both. Reasoning models broke that intuition. The model now generates a hidden, billable, internally-only-visible chain of thought before it emits the answer the caller will read, and the size of that chain depends on the model's own assessment of how hard the question was. The user-visible output may be a single sentence. The bill may be for ten pages.

Token Amplification: The Prompt-Injection Attack That Burns Your Bill

· 10 min read
Tian Pan
Software Engineer

A user submits a $0.01 request. Your agent reads a webpage. Forty seconds later, the inference bill for that single turn is $42. The query was technically successful — the agent returned a reasonable answer. It just took three nested sub-agents, a 200K-token document fetch, and a recursive plan refinement loop to get there. None of that fanout was the user's idea. It was a sentence buried in the page the agent read.

This is token amplification: a prompt-injection class that does not exfiltrate data, does not call unauthorized tools, and does not leave a clean security signature. It just sets your bill on fire. The cloud bill is the payload, and the user's request is the carrier.

Token Budgets Are the New Internal IAM

· 11 min read
Tian Pan
Software Engineer

The first time your AI bill clears seven figures in a month, the budget meeting changes shape. Until then, the question is "can we afford this." After that, the question is "who gets how much" — and most engineering orgs discover, in real time, that they have no policy framework for answering it. The team that shipped the loudest demo holds the highest quota by accident. Finance pushes for flat per-headcount caps that starve the team doing the highest-leverage work. Security gets cut out of the conversation entirely until somebody notices that the eval team has been pulling production traffic through their personal token allowance for six months.

The reason this conversation always feels like a cloud-cost argument is that it almost is one — but not quite. With cloud, the unit of waste is a forgotten EC2 instance and the worst case is a 3x bill. With token quotas, the unit of waste is a runaway agent loop, and the unit of access is a user-facing capability: whoever holds the budget can ship the feature. That second property is what makes token allocation rhyme with capability-based security instead of with cloud FinOps. The quota is not just a spending cap. It is the right to make a class of inferences happen.

The Inference Budget Committee: Governance When Token Spend Crosses Seven Figures

· 12 min read
Tian Pan
Software Engineer

At $50,000 a month, the "compute + tokens" line on your infra bill is rounding error. At $5,000,000 a month, it is a CFO question. The transition between those two states is not gradual — it is a phase change in how an organization talks about model spend, and most engineering orgs are unprepared for the social and political work that follows. The bill stays a single line; the conversation around it does not.

What changes is who has standing to ask "why." When three product teams share one API key and one capacity reservation, every quota argument has the same structure: someone is currently winning at the expense of someone else, and there is no neutral party to call it. The first time a team's launch is throttled because another team shipped a chatty agent, the absence of a governance body is felt by the entire engineering org at once. Calling a meeting and inventing a process under pressure is the worst time to design one.

The Cancellation Tax: Your Inference Bill After the User Hits Stop

· 9 min read
Tian Pan
Software Engineer

Your stop button is a lie. When a user clicks it, your UI stops rendering tokens; your provider, in most configurations, keeps generating them. The bytes never reach a browser, but they reach your invoice. The gap between what the user saw and what you paid for is the cancellation tax, and it is the single most under-reported line item on AI cost dashboards.

The reason the tax exists is structural. Autoregressive inference is a GPU-bound pipeline: by the time your client closes the TCP connection, the model has already been scheduled, KV-cached, and is emitting tokens at 30–200 per second. Most serving stacks do not check for client liveness between tokens. They finish the job, log the usage, and bill you. The client saw ten tokens; the log recorded eight hundred. Langfuse, Datadog, and every other observability platform will faithfully report the eight hundred, because that's what the provider's usage block reported.