Skip to main content

7 posts tagged with "unit-economics"

View all tags

The Off-Hours Cost Curve: Why Your AI Feature Spends Differently on Saturday Than on Tuesday

· 10 min read
Tian Pan
Software Engineer

The cost dashboard everyone looks at is a weekly rolling average, and that average is lying to you. Not in the sense that the number is wrong — it's a faithful arithmetic mean of a billing event stream — but in the sense that it is hiding the shape of the cost curve underneath. The hours between Friday evening and Monday morning consume tokens differently from the hours between Tuesday at 10am and Thursday at 4pm. The cohort active on Saturday at 3am is not the cohort active on Tuesday at 11am, and the per-user economics of those cohorts diverge by a factor that nobody writes down because the dashboard averaged it away.

Most teams discover this the first time a weekend automation script melts the budget. A LangChain agent gets into an infinite conversation cycle Friday night, runs for the better part of a week before anyone notices, and produces a five-figure invoice that has to be explained to finance on Monday morning. The post-incident review treats it as a one-off — bad retry logic, missing budget cap, didn't page on-call. But the same dashboard that hid the runaway loop is also hiding the steady-state version of the same phenomenon: a baseline of off-hours traffic whose unit economics are structurally worse than the business-hours baseline, every single week, and which the weekly average smooths into invisibility.

Cost-Per-Conversation as a Product Contract: When Pricing Drives Architecture

· 10 min read
Tian Pan
Software Engineer

The cleanest way to find out your AI feature's pricing model is wrong is to look at which engineer is currently rewriting the truncation logic at midnight. They aren't shipping a capability — they're patching a unit-economics leak that the PRD never named, and the patch is necessarily user-hostile because the product spec told them the budget was infinite. On a flat-fee SaaS plan, every conversation that runs longer than the median pulls margin out of the company in real time. The only real question is whether the product team admits it before finance does.

Traditional SaaS economics rest on near-zero marginal cost per user: once the software is built, serving the next customer barely moves the infrastructure line. AI features break that assumption. Every turn in a conversation consumes inference compute that scales with prompt size, output length, tool-call fan-out, and retrieval volume — and conversations don't have a natural stopping point. A heavy user can consume 50× the median in a billing period without leaving the happy path of the product. Under flat pricing, that user is funded by the rest of the user base, and the company finds out only when COGS reporting catches up a quarter later.

This is why pricing on AI features is not a finance problem to be handled after launch. It is an architecture input that decides what the product is allowed to do, and refusing to make it visible in the spec just means it gets resolved later, in worse ways, by people without product authority.

Token Economics for AI-Powered API Products: Pricing What You Cannot Predict

· 10 min read
Tian Pan
Software Engineer

A team ships a customer-facing AI assistant. They price it at $49/month per seat, targeting 70% gross margins based on a spreadsheet that assumed "average 500 tokens per query." Three months later, finance flags that their heaviest users are consuming 15,000 tokens per session. The pricing model collapses not because the feature failed, but because the product team priced something they didn't yet understand.

This isn't a failure of forecasting. It's a structural problem: the cost basis of an LLM-powered product is fundamentally unlike anything traditional SaaS pricing was designed to handle. Every API call has unpredictable and material token cost. The inputs vary wildly by user, task, and time of day. The outputs compound in ways that only show up weeks later on your cloud bill. And once you layer in agentic patterns — tool calls, multi-turn reasoning, subagent orchestration — a single user interaction can cost $0.02 or $20 depending on what the model decides to do.

Why Token Forecasts Drift After Launch — and How to Catch the Spike Before Finance Does

· 10 min read
Tian Pan
Software Engineer

The pre-launch cost model is a beautiful spreadsheet. It assumes a synthetic traffic mix run through a representative prompt at a tested cache hit rate and a clean tool-call path. The post-launch reality is that none of those assumptions survive the moment the feature actually starts working. The intents your synthetic traffic didn't cover are precisely the ones that stick. The marketing surge from a campaign engineering didn't get the meeting invite for lands on the highest-cost branch in your routing tree. The heavy-user cohort that uses 40× the median doesn't show up until week three.

The industry-wide version of this problem is now well-documented: surveys put the share of enterprises missing their AI cost forecasts by more than 25% at around 80%, and report routine cost increases of 5–10× in the months immediately after a successful launch. The crucial detail in those numbers is the word successful. Failed AI features stay on budget. The drift is driven by the feature working, not by the team doing something wrong. That makes it a planning artifact problem, not an engineering problem — and the planning artifact most teams reach for, the monthly bill, is the worst possible detector.

Your LLM Bill Is Half Your Agent's COGS — The Other Half Is The Part Nobody Is Monitoring

· 10 min read
Tian Pan
Software Engineer

The first time a finance team asks an AI product team to forecast unit economics, the conversation goes the same way. The team pulls up the inference dashboard, points at the monthly token spend, and says "that's our COGS." The CFO multiplies by projected volume, draws a line on a chart, and asks where the gross margin curve crosses 70%. Six weeks later, when the actual P&L lands, the inference number on the dashboard is correct and the gross margin is twenty points lower than the forecast. Nobody is lying. Inference was just half of what the agent actually costs.

The other half is distributed across line items that nobody on the AI team owns. The vector database bill grows quietly because retrieval volume tracks usage and re-indexing costs are billed against compute, not storage. The observability platform's invoice arrives from the platform team's budget. Embedding regeneration shows up as a CI cost. Telemetry storage is filed under data warehouse. Human review is in customer-success headcount. None of these line items is alarming on its own — and that is exactly why the integrated number is the one that surprises everyone.

Cost-Per-Correctness, Not Cost-Per-Token: The Unit Metric Your Bill Won't Tell You

· 11 min read
Tian Pan
Software Engineer

A team I know cut their inference bill 40% last quarter by migrating their support-email triage flow from a frontier model to a mid-tier one. The CFO sent a thank-you note. Six months later, customer support headcount was up two FTEs and average resolution time had risen 35%. Nobody connected the dots, because the dots lived in different dashboards: the inference bill on the platform team's, the support load on the operations team's. The migration looked like a win on the only metric anyone was tracking. The metric was wrong.

This is the cost-per-token trap. Your invoice tells you what you spent on tokens. It cannot tell you what you spent per correct task, because the inference vendor has no idea what "correct" means in your domain. They sold you raw compute. You bought outcomes — or thought you did. The gap between those two units is where AI unit economics quietly comes apart, and the team that doesn't measure the right denominator is running half the equation and shipping the other half blind.

The Tip Jar Problem: When 5% of Your Users Burn 80% of Your Inference Budget

· 11 min read
Tian Pan
Software Engineer

A single developer ran up more than $35,000 in compute under a $200 monthly plan. That is a 175x subsidy on one user — paid for by the casual majority who would have been just as happy on a $19 tier. This is the load-bearing math behind every "Why is our AI margin negative this quarter?" Slack thread. The problem is not that one user; it is that the long tail of one users follows a power law, and a power law plus flat-rate billing plus a real per-unit cost is a structural margin compressor that no amount of growth will fix.

The reflex when this lands on a finance review is to clamp down: hard token caps, "fair-use" language buried in the TOS, weekly throttles, a quietly degraded model for free tier. These all work in the sense that they cut the bleed. They also alienate the exact users whose evangelism you depend on, because the people who hit your caps are the ones who actually figured out how to extract value from your product. The standard fix is a backwards-compatible apology to the wrong cohort.