Skip to main content

3 posts tagged with "token-optimization"

View all tags

Token Budget as Architecture Constraint: Designing Agents That Work Under Hard Ceilings

· 8 min read
Tian Pan
Software Engineer

Your agent works flawlessly in development. It reasons through multi-step tasks, calls tools confidently, and produces polished output. Then you set a cost cap of $0.50 per request, and it falls apart. Not gracefully — catastrophically. It truncates its own reasoning mid-thought, forgets tool results from three steps ago, and confidently delivers wrong answers built on context it silently lost.

This is the gap between abundance-designed agents and production-constrained ones. Most agent architectures are prototyped with unlimited token budgets — long system prompts, verbose tool schemas, full document retrieval, uncompacted conversation history. When you introduce hard ceilings (cost caps, context limits, latency requirements), these agents don't degrade gracefully. They break in ways that are difficult to detect and expensive to debug.

The Planning Tax: Why Your Agent Spends More Tokens Thinking Than Doing

· 10 min read
Tian Pan
Software Engineer

Your agent just spent 6solvingataskthatadirectAPIcallcouldhavehandledfor6 solving a task that a direct API call could have handled for 0.12. If you've built agentic systems in production, this ratio probably doesn't surprise you. What might surprise you is where those tokens went: not into tool calls, not into generating the final answer, but into the agent reasoning about what to do next. Decomposing the task. Reflecting on intermediate results. Re-planning when an observation didn't match expectations. This is the planning tax — the token overhead your agent pays to think before it acts — and for most agentic architectures, it consumes 40–70% of the total token budget before a single useful action fires.

The planning tax isn't a bug. Reasoning is what separates agents from simple prompt-response systems. But when the cost of deciding what to do exceeds the cost of actually doing it, you have an engineering problem that no amount of cheaper inference will solve. Per-token prices have dropped roughly 1,000x since late 2022, yet total agent spending keeps climbing — a textbook Jevons paradox where cheaper tokens just invite more token consumption.

The Hidden Token Tax: Where 30-60% of Your Context Window Disappears Before Users Say a Word

· 8 min read
Tian Pan
Software Engineer

You're paying for a 200K-token context window. Your users get maybe 80K of it. The rest vanishes before their first message arrives — consumed by system prompts, tool definitions, safety preambles, and chat history padding. This is the hidden token tax, and most teams don't realize they're paying it until they hit context limits in production.

The gap between advertised context window and usable context window is one of the most expensive blind spots in production LLM systems. It compounds across multi-turn conversations, inflates latency through attention overhead, and silently degrades output quality as useful information gets pushed into the "lost in the middle" zone where models stop paying attention.