Skip to main content

55 posts tagged with "cost-optimization"

View all tags

When Thinking Models Actually Help: A Production Decision Framework for Inference-Time Compute

· 10 min read
Tian Pan
Software Engineer

There is a study where researchers asked a reasoning model to compare two numbers: 0.9 and 0.11. One model took 42 seconds to answer. The math took a millisecond. The model spent the remaining 41.9 seconds thinking — badly. It re-examined its answer, doubted itself, reconsidered, and arrived at the correct conclusion it had already reached in its first three tokens.

This is the overthinking problem, and it is not a corner case. It is what happens when you apply inference-time compute indiscriminately to tasks that don't need it.

The emergence of reasoning models — o1, o3, DeepSeek R1, Claude with extended thinking — represents a genuine capability leap for hard problems. It also introduces a new class of production mistakes: deploying expensive, slow deliberation where fast, cheap generation was perfectly adequate. Getting this decision right is increasingly central to building AI systems that actually work.

LLM Routing and Model Cascades: How to Cut AI Costs Without Sacrificing Quality

· 9 min read
Tian Pan
Software Engineer

Most production AI systems fail at cost management the same way: they ship with a single frontier model handling every request, watch their API bill grow linearly with traffic, and then scramble to add caching or reduce context windows as a band-aid. The actual fix — routing different queries to different models based on what each query actually needs — sounds obvious in retrospect but is rarely implemented well.

The numbers make the case plainly. Current frontier models like Claude Opus cost roughly $5 per million input tokens and $25 per million output tokens. Efficient models in the same family cost $1 and $5 respectively — a 5x ratio. Research using RouteLLM shows that with proper routing, you can maintain 95% of frontier model quality while routing 85% of queries to cheaper models, achieving cost reductions of 45–85% depending on your workload. That's not a marginal improvement; it changes the unit economics of deploying AI at scale.

Prompt Caching: The Optimization That Cuts LLM Costs by 90%

· 7 min read
Tian Pan
Software Engineer

Most teams building on LLMs are overpaying by 60–90%. Not because they're using the wrong model or prompting inefficiently — but because they're reprocessing the same tokens on every single request. Prompt caching fixes this, and it takes about ten minutes to implement. Yet it remains one of the most underutilized optimizations in production LLM systems.

Here's what's happening: every time you send a request to an LLM API, the model runs attention over every token in your prompt. If your system prompt is 10,000 tokens and you're handling 1,000 requests per day, you're paying to process 10 million tokens daily just for the static part of your prompt — context that never changes. Prompt caching stores the intermediate computation (the key-value attention states) so subsequent requests can skip that work entirely.

LLM Routing: How to Stop Paying Frontier Model Prices for Simple Queries

· 11 min read
Tian Pan
Software Engineer

Most teams reach the same inflection point: LLM API costs are scaling faster than usage, and every query — whether "summarize this sentence" or "audit this 2,000-line codebase for security vulnerabilities" — hits the same expensive model. The fix isn't squeezing prompts. It's routing.

LLM routing means directing each request to the most appropriate model for that specific task. Not the most capable model. The right model — balancing cost, latency, and quality for what the query actually demands. Done well, routing cuts LLM costs by 50–85% with minimal quality degradation. Done poorly, it creates silent quality regressions you won't detect until users churn.

This post covers the mechanics, the tradeoffs, and what actually breaks in production.

Token Budget Strategies for Production LLM Applications

· 10 min read
Tian Pan
Software Engineer

Most teams discover their context management problem the same way: a production agent that worked fine in demos starts hallucinating after 15 conversation turns. The logs show valid JSON, the model returned 200, and nobody changed the code. What changed was the accumulation — tool results, retrieved documents, and conversation history quietly filled the context window until the model was reasoning over 80,000 tokens of mixed-relevance content.

Context overflow is the obvious failure mode, but "context rot" is the insidious one. Research shows that LLM performance degrades before you hit the limit. As context grows, models exhibit a lost-in-the-middle effect: attention concentrates at the beginning and end of the input while content in the middle becomes unreliable. Instructions buried at turn 12 of a 30-turn conversation may effectively disappear. The model doesn't error out — it just quietly ignores them.

Token Economics for AI Agents: Cutting Costs Without Cutting Corners

· 10 min read
Tian Pan
Software Engineer

A Shopify-scale merchant assistant handling 10 million conversations per day costs $2.1 million per month without optimization — or $450,000 per month with it. That 78% gap isn't from algorithmic breakthroughs; it's from caching, routing, and a few engineering disciplines that most teams skip until the invoice arrives.

AI agents are not chatbots with extra steps. A single user request triggers planning, tool selection, execution, verification, and often retry loops — consuming roughly 5x more tokens than a direct chat interaction. A ReAct loop running 10 cycles can consume 50x tokens compared to a single pass. At frontier model prices, that math becomes a liability fast.

This post covers the mechanics of where agent costs come from and the concrete techniques — with numbers — that actually move the needle.

The Hidden Costs of Context: Managing Token Budgets in Production LLM Systems

· 9 min read
Tian Pan
Software Engineer

Most teams shipping LLM applications for the first time make the same mistake: they treat context windows as free storage. The model supports 128K tokens? Great, pack it full. The model supports 1M tokens? Even better — dump everything in. What follows is a billing shock that arrives about three weeks before the product actually works well.

Context is not free. It's not even cheap. And beyond cost, blindly filling a context window actively makes your model worse. A focused 300-token context frequently outperforms an unfocused 113,000-token context. This is not an edge case — it's a documented failure mode with a name: "lost in the middle." Managing context well is one of the highest-leverage engineering decisions you'll make on an LLM product.