Skip to main content

46 posts tagged with "cost-optimization"

View all tags

Reasoning Model Economics: When Chain-of-Thought Earns Its Cost

· 9 min read
Tian Pan
Software Engineer

A team at a mid-size SaaS company added "let's think step by step" to every prompt after reading a few benchmarks. Their response quality went up measurably — and their LLM bill tripled. When they dug into the logs, they found that most of the extra tokens were being spent on tasks like classifying support tickets and summarizing meeting notes, where the additional reasoning added nothing detectable to output quality.

Extended thinking models are a genuine capability leap for hard problems. They're also a reliable cost trap when applied indiscriminately. The difference between a well-tuned reasoning deployment and an expensive one often comes down to one thing: understanding which tasks actually benefit from chain-of-thought, and which tasks are just paying for elaborate narration of obvious steps.

The Token Economy of Multi-Turn Tool Use: Why Your Agent Costs 5x More Than You Think

· 10 min read
Tian Pan
Software Engineer

Every team that builds an AI agent does the same back-of-the-envelope math: take the expected number of tool calls, multiply by the per-call cost, add a small buffer. That estimate is wrong before it leaves the whiteboard — not by 10% or 20%, but by 5 to 30 times, depending on agent complexity. Forty percent of agentic AI pilots get cancelled before reaching production, and runaway inference costs are the single most common reason.

The problem is structural. Single-call cost estimates assume each inference is independent. In a multi-turn agent loop, they are not. Every tool call grows the context that every subsequent call must pay for. The result is a quadratic cost curve masquerading as a linear one, and engineers don't discover it until the bill arrives.

Knowledge Distillation Without Fine-Tuning: Extracting Frontier Model Capabilities Into Cheaper Inference Paths

· 10 min read
Tian Pan
Software Engineer

A 770-million-parameter model beating a 540-billion-parameter model at its own task sounds impossible. But that is exactly what distilled T5 models achieved against few-shot PaLM—using only 80% of the training examples, a 700x size reduction, and inference that costs a fraction of a cent per call instead of dollars. The trick wasn't a better architecture or a cleverer training recipe. It was generating labeled data from the big model and training the small one on it.

This is knowledge distillation. And you do not need to fine-tune the teacher to make it work.

The Quality Tax of Over-Specified System Prompts

· 9 min read
Tian Pan
Software Engineer

Most engineering teams discover the same thing on their first billing spike: their system prompt has quietly grown to 4,000 tokens of carefully reasoned instructions, and the model has quietly started ignoring half of them. The fix is rarely to add more instructions. It's almost always to delete them.

The instinct to be exhaustive is understandable. More constraints feel like more control. But there's a measurable quality degradation that kicks in as system prompts bloat — and it compounds with cost in ways that aren't visible until they hurt. Research consistently finds accuracy drops at around 3,000 tokens of input, well before hitting any nominal context limit. The model doesn't refuse to comply; it just starts underperforming in ways that are hard to pin down.

This post is about making that degradation visible, understanding why it happens, and building a trimming discipline that doesn't require hoping nothing breaks.

Model Routing in Production: When the Router Costs More Than It Saves

· 10 min read
Tian Pan
Software Engineer

A team at a mid-size SaaS company deployed a model router six months ago with a clear goal: stop paying frontier-model prices for the 70% of queries that are simple lookups and reformatting tasks. They ran it for three months before someone did the math. Total inference cost had gone up by 12%.

The router itself was cheap — a lightweight classifier adding about 2ms of overhead per request. But the classifier's decision boundary was miscalibrated. It escalated 60% of queries to the expensive model, not 30%. The 40% it handled locally had worse quality, which increased user retry rates, which increased total request volume. The router's telemetry showed "routing working correctly" because it was routing — it just wasn't routing well.

This failure pattern is more common than the success stories suggest. Here's how to build routing that actually saves money.

The AI-Everywhere Antipattern: When Adding LLMs Makes Your Pipeline Worse

· 9 min read
Tian Pan
Software Engineer

There is a type of architecture that emerges at almost every company that ships an AI feature and then keeps shipping: a pipeline where every transformation, every routing decision, every classification, every formatting step passes through an LLM call. It usually starts with a legitimate use case. The LLM actually helps with one hard problem. Then the team, having internalized the pattern, reaches for it again. And again. Until the whole system is an LLM-to-LLM chain where a string of words flows in at one end and a different string of words comes out the other, with twelve API calls in between and no determinism anywhere.

This is the AI-everywhere antipattern, and it is now one of the most reliable ways to build a production system that is slow, expensive, and impossible to debug.

Prompt Cache Break-Even: The Exact Math on When Provider-Side Prefix Caching Actually Pays Off

· 9 min read
Tian Pan
Software Engineer

Prompt caching sounds like a clear win: Anthropic and OpenAI both advertise a 90% discount on cache hits, and the documentation shows impressive cost reduction charts. Teams implement it, monitor the cache hit rate counter going up, and assume they're saving money. Some of them are paying more than if they hadn't cached anything.

The issue is the write premium. Every time you cache a prefix, you pay a surcharge — 1.25× on a 5-minute cache window, 2× for a 1-hour window. If your hit rate is too low, those write premiums accumulate faster than the read discounts recover them. Caching is not free insurance; it's a bet you place against your own traffic patterns.

The max_tokens Knob Nobody Tunes: Output Truncation as a Cost Lever

· 11 min read
Tian Pan
Software Engineer

Look at the max_tokens parameter on every LLM call in your codebase. If you're like most teams, it's either unset, set to the model's maximum, or set to some round number like 4096 that someone picked six months ago and nobody has touched since. It's the one budget knob in your API request that's staring you in the face, and it's silently paying for slack you never use.

Output tokens cost roughly four times what input tokens cost on the median commercial model, and as much as eight times on the expensive end. The economics of the generation step are completely lopsided: every unused token of headroom you leave in max_tokens is a token you might pay for, and every token you generate extends your p50 latency linearly because decoding is sequential. Yet most production systems treat this parameter as a safety valve — set it high, forget it, move on.

Model Routing Is a System Design Problem, Not a Config Option

· 11 min read
Tian Pan
Software Engineer

Most teams choose their LLM the way they choose a database engine: once, during architecture review, and never again. You pick GPT-4o or Claude 3.5 Sonnet, bake it into your config, and ship. The choice feels irreversible because changing it requires a redeployment, coordination across services, and regression testing against whatever your evals look like this week.

That framing is a mistake. Your traffic is not homogeneous. A "summarize this document" request and a "debug this cryptic stack trace" request hitting the same endpoint at the same time have radically different capability requirements — but with static model selection, they're indistinguishable from your infrastructure's perspective. You're either over-provisioning one or under-serving the other, and you're doing it on every single request.

Model routing treats LLM selection as a runtime dispatch decision. Every incoming query gets evaluated on signals that predict the right model for that specific request, and the call is dispatched accordingly. The routing layer doesn't exist in your config file — it runs in your request path.

Retry Budgets for LLM Agents: Why 20% Per-Step Failure Doubles Your Token Bill

· 8 min read
Tian Pan
Software Engineer

Most teams discover their retry problem when the invoice shows up. The agent "worked"; latency dashboards stayed green; error rates looked fine. Then finance asks why inference spend doubled this month, and someone finally reads the logs. It turns out that 20% of the tool calls in a 3-step agent were quietly retrying, each retry replayed the full prompt history, and the bill had been ramping for weeks.

The math on this is not mysterious, but it is aggressively counterintuitive. A 20% per-step retry rate sounds tolerable — most engineers would glance at it and move on. The actual token cost, once you factor in how modern agent frameworks retry, lands much closer to 2x than 1.2x. And the failure mode is invisible to every metric teams typically watch.

Retry budgets — an old idea from Google SRE work — are the cleanest fix. But the LLM version of the pattern needs tweaking, because tokens don't behave like RPCs.

The Good Enough Model Selection Trap: Why Your Team Is Overpaying for AI

· 8 min read
Tian Pan
Software Engineer

Most teams ship their first AI feature on the best model available, because that's what the demo ran on and nobody had time to think harder about it. Then a second feature ships on the same model. Then a third. Six months later, every call across every feature routes to the frontier tier — and the bill is five to ten times higher than it needs to be.

The uncomfortable truth is that 40–60% of the requests your production system processes don't require frontier-level reasoning at all. They require competent text processing. Competent text processing is dramatically cheaper to buy.

The Inference Cost Paradox: Why Your AI Bill Goes Up as Models Get Cheaper

· 10 min read
Tian Pan
Software Engineer

In 2021, GPT-3 cost 60permilliontokens.Byearly2026,youcouldbuyequivalentperformancefor60 per million tokens. By early 2026, you could buy equivalent performance for 0.06. That is a 1,000x reduction in three years. During the same period, enterprise AI spending grew 320% — from 11.5billionto11.5 billion to 37 billion. The organizations spending the most on AI are overwhelmingly the ones that benefited most from falling prices.

This is not a contradiction. It is the Jevons Paradox, and it is running your AI budget.