Skip to main content

7 posts tagged with "cost"

View all tags

The Hidden Tax on Your AI Features: What Your Inference Bill Isn't Telling You

· 10 min read
Tian Pan
Software Engineer

When engineers pitch an AI feature, the cost conversation almost always centers on the inference API. How much per token? What's the monthly estimate at our expected call volume? Can we negotiate a volume discount? This is the wrong conversation — or at least an incomplete one.

In practice, the inference bill accounts for roughly 20-30% of what it actually costs to run a mature AI feature. The rest is distributed across a portfolio of costs that don't show up on your LLM provider's invoice: the vector database your retrieval pipeline depends on, the embedding jobs that populate it, the observability platform catching silent failures, the human reviewers validating model outputs, and the engineers who spend weeks tuning the prompts that make everything work. Teams discover this the hard way, usually six months after launch when they're trying to explain a cost center that's 3-5x higher than projected.

Per-User AI Quotas: The UX Layer Your Cost Dashboard Can't See

· 10 min read
Tian Pan
Software Engineer

A user opens your AI feature at 3pm on a Tuesday. They've been using it lightly for three weeks. This time the request hangs for eight seconds and returns a red banner: "Something went wrong. Please try again later." They try again. Same banner. They close the tab and go back to whatever they were doing before — and they tell their teammate at standup the next morning that "the AI thing is broken."

What actually happened: they crossed an invisible per-user quota that your cost team set six months ago to keep a single power user from blowing through the GPU budget. The quota worked. Spend stayed flat. The dashboard is green. The feature is, by every metric your engineering org tracks, healthy. It's also dead, because the user who got that banner is never coming back, and the three teammates they told at standup will never try it.

This is the gap your cost dashboard cannot see. Per-user AI quotas are a product surface. The team that hides them inside an HTTP 429 is letting their cost-control system silently shape user perception of the product, and they will not find out until churn shows up in a quarterly review with no obvious cause.

The SIEM Bill Your AI Feature Forgot to Include

· 10 min read
Tian Pan
Software Engineer

The math is simple and nobody did it. Pre-AI, a single user action — "summarize this ticket," "send this email" — produced one application log line. Post-AI, the same action emits a request log, an LLM call trace, a tool-invocation span for each tool the agent called, a retrieval span per chunk it read, a response log, and an eval log if you sample for offline scoring. The fan-out for one user click is now 30 to 50 records on the floor of your observability pipeline, and that's before retries, before sub-agents, before the planner-executor split that 2x's everything again.

You shipped an AI feature in Q1. In Q2, your security director walks into a budget review with a Splunk renewal that's 4x higher than last cycle. Nobody on the AI team is in the room. The conversation that happens next — about who owns the cost, why the threat-detection rules stopped working, and whether legal hold on every conversation is actually mandatory — is a conversation you should have had at design time and didn't, because the cost didn't show up on the LLM invoice. It showed up downstream, in a tool the AI team has never logged into.

Tokenizer Drift: Your Local Counter Lies, the Bill Tells the Truth

· 9 min read
Tian Pan
Software Engineer

A team I know spent three weeks chasing a "context truncation" bug that only fired in production for Japanese customers. Their CI fixtures were English. Their tiktoken count said the prompt fit in 8K with a 600-token margin. The provider's invoice said the request had been rejected for exceeding the limit. The two numbers were off by 11%, the safety margin lived inside that 11%, and nobody had ever measured the disagreement on CJK text. The fix wasn't a new model — it was throwing away the local counter as a source of truth.

That's the subtle, expensive shape of tokenizer drift: not a single wrong number, but a class of small systematic errors that accumulate at the boundaries you forgot to test. The local counter in your IDE, the budget calculator in your gateway, the rate-limit estimator in your retry middleware, and the authoritative count the provider charges against — none of these agree, and the gap widens exactly where your users live.

Determinism Budgets: Treat Randomness as a Per-Surface Allocation, Not a Global Knob

· 11 min read
Tian Pan
Software Engineer

The temperature debate is the most religious argument in AI engineering, and one of the least productive. Two camps form on every team: the determinists who want temperature pinned at zero everywhere because they cannot debug a flaky system, and the creatives who want it cranked up because the outputs feel more "alive." Both are wrong, because both are answering the question at the wrong level. Temperature is not a global setting. It is a budget — and like any budget, it should be allocated, not declared.

The productive frame is simple: every model call in your system has a purpose, and randomness either earns its keep at that surface or it does not. A planner deciding which tool to call next has nothing to gain from variation; an off-by-one tool selection is a debugging nightmare and there is no creative upside. A response-synthesis surface that summarizes a search result for ten thousand users gets robotic in a hurry if every user sees the same phrasing — and the SEO team will eventually flag the boilerplate. A brainstorming surface where the model proposes alternatives for a human to pick from is worse at temperature 0; the diversity is the feature.

If you cannot articulate what randomness is for at a given call site, you should not be paying for it.

Tokenizer Blindspots That Break Production LLM Systems

· 10 min read
Tian Pan
Software Engineer

Most engineers who build on LLMs eventually learn the rough conversion: one token is about 0.75 English words, so a 4,000-token context window fits roughly 3,000 words. That number is fine for back-of-napkin estimates when your input is casual English prose. It is quietly wrong everywhere else — and "everywhere else" turns out to be most of the interesting production workloads.

Token miscalculations don't fail loudly. They show up as cost overruns that don't match any line item, as context windows that silently truncate the last few paragraphs of a document, or as multilingual pipelines that work fine in English testing and go 4x over budget the first week they hit real traffic. By the time you trace the issue back to tokenization, the damage is done.

The Build-vs-Buy LLM Infrastructure Decision Most Teams Get Wrong

· 10 min read
Tian Pan
Software Engineer

A FinTech team built their AI chatbot on GPT-4o. Month one: $15K. Month two: $35K. Month three: $60K. Projecting $700K annually, they panicked and decided to self-host. Six months and one burned-out engineer later, they were spending $85K/month on infrastructure, a part-time DevOps engineer, and three CUDA incidents that took down production. They eventually landed at $8K/month — but not by self-hosting everything. By routing intelligently.

Both decisions were wrong. The real failure was that they never ran the actual math.