Skip to main content

40 posts tagged with "finops"

View all tags

The Cost Dashboard Your Finance Team Built That Excluded the Embeddings Re-index

· 10 min read
Tian Pan
Software Engineer

Your finance team built a beautiful AI cost dashboard. Token spend, sliced by feature. Embedding spend, sliced by provider. Every quarter, the per-feature pane gets reviewed in a leadership meeting and somebody asks why the support-chat workflow is up 12%, and a product manager has a defensible answer. Every quarter, the per-provider pane gets reviewed in an infra meeting and somebody asks why OpenAI is up 8%, and a platform engineer has a defensible answer. And every quarter, the line that actually doubles your AI bill — the corpus re-index — lands in a third bucket called "infrastructure" that nobody reviews because nobody owns it.

That bucket is where forty percent of your AI spend goes to die unattributed. The teams who could have optimized it never see it. The teams who see it can't tell you which feature it serves. The dashboard is honest about every cost it can explain and silent about the cost it can't, which is exactly the cost that matters most.

The Streaming Abort Your Provider Billed Anyway: A 14% Gap Hiding in Your Invoice

· 10 min read
Tian Pan
Software Engineer

Your finance team filed a dispute and lost. The line item is "output tokens" and it exceeds your sum-of-delivered-tokens metric by fourteen percent. The provider's support engineer closed the ticket as "expected behavior under streaming cancellation," with a link to a documentation page that says "cancellation stops billing at the last delivered token." Both sentences are true, and the gap between them is the line of code you have not written.

The contract you read says one thing. The inference scheduler does another. The mismatch is not a bug, not a billing error, and not malice — it is a layered system in which the cancellation signal travels through three boundaries (browser, edge, GPU) and the billing meter sits at the third boundary while your "stop generating" button sits at the first. Closing the gap is an engineering project with a finance owner.

The Agent Budget That Approved Cost-Per-Call and Never Measured Cost-Per-Resolved-Task

· 10 min read
Tian Pan
Software Engineer

A quarter into the rollout, the AI team reported a 25% reduction in average cost-per-API-call. The support team reported that average handle time on AI-routed tickets had drifted from four turns to seven. Both numbers were correct. Both teams were measuring the system they had been told to optimize. The finance team, sitting between them, could not reconcile the dashboards because neither one was denominated in the thing the customer was actually paying for: a resolved ticket. The cost-per-call had gone down. The cost-per-resolved-task had gone up 40%. Nobody owned that number, so nobody was watching it move.

This is the most common unit-economics failure I see in agentic deployments, and it is not a measurement bug. It is a definitional one. The vendor's pricing page exposes cost-per-call because that is the unit they bill. The spreadsheet line item inherits that unit because it fits in a cell. The engineering team optimizes against the unit they were given. By the time the gap between API economics and business economics becomes visible, it has been compounding for a quarter, and the agent has been quietly trained on the wrong loss function the entire time.

The Chargeback Model That Made Every Team Rewrite Their Prompts Overnight

· 10 min read
Tian Pan
Software Engineer

Finance sent a memo on a Monday. By Friday, every product team had shipped a prompt change, and on the following Tuesday the support queue grew by a third. Nobody had touched the model. Nobody had touched the product. The only thing that had changed was that the LLM bill was now flowing back to the teams that issued the calls — and the teams had responded the way any rational cost center responds to a new line item on its P&L. They cut it.

The story that gets told inside the company afterwards is a story about prompt engineering, or about the model regressing, or about a noisy week of user traffic. The truer story is that finance, through a chargeback policy, had quietly become a product manager. The cost-attribution dashboard was a product-quality lever that nobody had reviewed, nobody had instrumented for, and nobody owned. When it moved, every prompt in the company moved with it, and the trade-offs that produced the quality regression were never seen by the people whose job it was to see them.

The Coding Agent CI Bill That Doubled Without a Postmortem

· 10 min read
Tian Pan
Software Engineer

The line item climbed 130% over six weeks and nobody on the engineering team noticed. PRs were landing faster. Per-PR CI cost on the dashboard looked the same as last quarter. The agent's branches went green on the first try more often than the humans' branches did, which actually pulled the median CI duration down. Finance found it during quarterly review, flagged it as an unexplained variance, and asked engineering for the postmortem. Engineering had nothing to write — no incident, no regression, no failed deploy. Just a budget line that had quietly doubled while every dashboard reported normal.

That postmortem-shaped hole is the artifact. The cost shifted from a labor-dominant curve to an infrastructure-dominant curve, and the team that owned the labor budget was not the team that owned the infrastructure budget. The agent didn't break anything. It just changed which line on the P&L absorbed the work.

The Cost Forecast Tied to a Pricing Tier You No Longer Qualify For

· 11 min read
Tian Pan
Software Engineer

The usage curve barely moved. The bill went up 38%.

That is the email the finance lead at a mid-sized fintech opened on the first Monday of the quarter. Three months earlier, the engineering org had renegotiated their LLM inference contract and shaved a sizeable percentage off the negotiated unit price by committing to a volume floor. The finance model rolled the new unit price into the FY forecast. Nobody bookmarked the footnote in the pricing schedule that said the discount would lapse if monthly usage fell below the floor for three consecutive months. The seasonal traffic dip in April-May did exactly that. The provider re-tiered the account back to list price. No notification reached engineering, because the notification went to the procurement inbox that nobody had read since the contract was signed.

The Token Budget You Cannot See Until You Hit It

· 10 min read
Tian Pan
Software Engineer

Your team negotiated a monthly token allocation with your inference provider. The contract specifies the cap. The dashboard in the provider portal shows yesterday's usage with a one-day lag. The API itself returns per-minute rate-limit headers — anthropic-ratelimit-tokens-remaining, x-ratelimit-remaining-requests — and nothing about the monthly bucket you actually have to plan against. And your agent fleet has no mechanism to slow down as the budget depletes, because the only signal that arrives in real time is the 429 — which arrives after the budget is already gone, dressed up as the same transient error your retry logic was tuned to ignore.

This is a different shape of problem than rate limiting. Rate limits are a fast-moving throttle the consumer must react to within seconds; the headers tell you the bucket has a thousand tokens left and refills in forty seconds, and a well-written client backs off and tries again. Monthly quota is a slow-moving budget the consumer must plan against over weeks. The two get confused because they share the failure code and sometimes share the dashboard, but they require different controls — and the gap between what the provider exposes and what the consumer needs is where the worst incident of the month lives.

The Model Migration That Broke Your Prompt Cache Without Warning

· 10 min read
Tian Pan
Software Engineer

The migration looked clean. Evals were re-anchored against the new model version. Judge prompts were re-calibrated. Two weeks of shadow traffic showed behavior parity within tolerance. p50 and p99 latency were inside the budget. The rollout call signed off on Thursday afternoon and the team went home.

By Friday morning, the inference bill was 3x normal. Eval scores were still fine. Latency was still fine. No one on the rollout call had thought to instrument the cache hit rate, because the prefix had not changed — the system prompt was byte-identical, the tool definitions were byte-identical, the conversation framing was byte-identical. What had changed was the model version in the request body, and the provider keys its prefix cache on (prefix bytes + model version). Every request after the cutover landed on a cold cache. The warm-up curve took six weeks of organic traffic to recover, and the team paid full input-token rates for every token on every request for the duration.

Your Happy Path Is Your Expensive Path: The Agent That Costs More When It Wins

· 10 min read
Tian Pan
Software Engineer

A failed agent run is cheap. It misroutes a query, hits a dead end, returns "I couldn't help with that," and burns maybe a few hundred tokens doing it. A successful run is the disaster. It retrieves context, reflects on it, calls three tools, reflects again, and stitches together a confident multi-paragraph answer — fifty times the token spend of the failure that cost you nothing.

This is the uncomfortable shape of agent economics: your happy path is your expensive path. The outcome you are selling, the one your marketing page promises, the one users thank you for, is the single most costly thing your system can do. And if you priced the product the way SaaS has been priced for fifteen years — a flat monthly fee per seat — then every time the agent does its job well, it quietly erodes your margin.

Most teams discover this backwards. They watch cost dashboards, see failures are cheap, and conclude that reliability work will save money. It won't. Raising your success rate raises your bill.

Why You Can't Budget an AI Feature With a Single Number

· 9 min read
Tian Pan
Software Engineer

Finance asks one question about every feature you ship: "What does it cost per user?" For a traditional feature, the answer is a number. A page render, a database query, a push notification — each has a marginal cost that barely moves from one request to the next. You measure it once, multiply by your user count, and the forecast holds.

An AI feature breaks that contract. Ask "what does this agent cost per request" and the honest answer is not a number, it's a histogram. The same agent that resolves one ticket for two cents will burn four dollars on the next one, because that user asked a vague question, the agent looped through eleven tool calls, and each call dragged the entire growing conversation back through the model. The mean of those two requests — two dollars — describes neither of them, and it definitely doesn't describe the bill.

That is the trap. When you hand finance a single average cost, you are not simplifying a messy reality. You are reporting a number that is wrong in a specific, expensive direction.

Who Owns the Idle Cost of an AI Feature

· 10 min read
Tian Pan
Software Engineer

The pay-per-token mental model has trained a generation of engineers to think AI cost is a function of usage. No requests, no bill. It is a comforting model, and for the API call itself, it is roughly true. But it describes only one layer of a production AI feature, and not the layer that quietly drains the budget.

Provisioned throughput, reserved GPU capacity, warm vector indexes, and standby fine-tuned endpoints all bill on a clock, not a counter. They charge for the right to serve traffic, whether or not traffic arrives. The feature nobody touches on a Saturday still has a meter running. The internal tool used by twelve people during business hours bills for all 168 hours of the week. The launch you provisioned for in March still holds its reservation in May, long after the spike flattened.

This is idle cost, and the reason it grows unchecked is not technical. It is organizational: no single role can see it, and no single role owns it.

GPU Capacity Is a Roadmap Constraint: The 18-Month Contract That Decided Q3

· 9 min read
Tian Pan
Software Engineer

Somewhere in your company, fourteen months ago, a finance director and a platform lead signed a multi-year accelerator commitment. They built a peak-load model from the prior quarter's telemetry, negotiated a discount of 40 to 70 percent off on-demand pricing, and locked in the cluster shape that your product roadmap now has to fit inside. Nobody on the product team was in the room. Nobody on the application engineering team saw the spreadsheet. The contract is binding, the discount only applies if the commitment is honored, and the capacity envelope it bought is now the silent ceiling on every Q3 feature your PMs are scoping.

The gap most teams don't notice until the second year: capacity contracts are roadmap decisions, but they're being made by people who don't see the roadmap, using inputs that don't include the roadmap. The product trio thinks it's choosing features from a clean priority backlog. Finance thinks it's optimizing a fixed envelope. Both are right inside their own frame, and the collision shows up in a planning meeting where an architect proposes a 70B-parameter model for the new assistant feature and the platform lead says, quietly, that the cluster is already at 85 percent and that model doesn't fit without crowding out something else.