Token Economics for AI Agents: Cutting Costs Without Cutting Corners

February 9, 2026 · 10 min read

Software Engineer

A Shopify-scale merchant assistant handling 10 million conversations per day costs $2.1 million per month without optimization — or $450,000 per month with it. That 78% gap isn't from algorithmic breakthroughs; it's from caching, routing, and a few engineering disciplines that most teams skip until the invoice arrives.

AI agents are not chatbots with extra steps. A single user request triggers planning, tool selection, execution, verification, and often retry loops — consuming roughly 5x more tokens than a direct chat interaction. A ReAct loop running 10 cycles can consume 50x tokens compared to a single pass. At frontier model prices, that math becomes a liability fast.

This post covers the mechanics of where agent costs come from and the concrete techniques — with numbers — that actually move the needle.

Why Agent Costs Are Different From Chatbot Costs

The output token premium is the first thing to internalize. Across major providers, output tokens are priced at a 3-8x premium over input tokens, because generation is sequential while input processing parallelizes. For reasoning-heavy models, that ratio reaches 8:1. When your agent produces verbose tool call responses, detailed reasoning traces, or long-form summaries, you're paying for every one of those output tokens at premium rates.

Context length compounds the problem. Due to the quadratic cost of attention computation, a 128K-token context costs roughly 64x more to process than an 8K context. Agentic systems naturally accumulate context: system prompts, tool definitions, conversation history, retrieved chunks, tool responses. Every turn, the context grows. Most teams notice this in staging — when their agent that cost $0.05 per task in a short test suddenly costs $1.50 per task against a realistic document corpus.

The spread between cheapest and most expensive model options is now about 60x. Gemini Flash-Lite at roughly $0.075/$0.30 per million input/output tokens versus frontier reasoning models at $15/$60 per million. This spread is an opportunity — but only if you route deliberately.

Prompt Caching: The Easiest Money on the Table

Prompt caching works by reusing the computed key-value attention tensors from prior requests when the new request shares a common prefix with a previous one. Anthropic offers 90% off cached input tokens ($0.30/M vs $3.00/M), Google offers 75% off, and OpenAI applies 50% discounts automatically on eligible requests.

For agentic systems, the implication is significant: structure your prompt so static content comes first. System prompt, tool definitions, few-shot examples, policy documents — all of this should form the stable prefix. Dynamic content (the actual user message, retrieved context for this turn) goes at the end. This isn't aesthetic; it directly determines whether caching fires.

Claude Code achieves a 92% cache hit rate in practice, delivering an 81% reduction in processing costs. A fixed 10,000-token system prompt effectively costs nothing after the first request. A customer support application that moved its product catalog from dynamic insertion to a cached prefix cut its API bill by $12,000/month without changing its output quality.

Beyond cost, caching cuts latency. Average response latency drops from 800ms to 350ms when caching is active on long prefixes, because the model skips recomputing the attention matrices for the stable portion.

The engineering overhead is minimal: cache window TTLs range from 5 minutes (Anthropic) to about an hour (OpenAI). For agents serving repeated user sessions, a warm cache is almost always available. For batch pipelines, structure jobs so requests share prefixes within the batch.

Model Routing and Cascading: Matching Cost to Complexity

Not every query needs a frontier model. The question is how to determine which ones do — and the answer depends on three axes: reasoning complexity, quality sensitivity, and context length.

In typical production agentic workloads, the distribution looks roughly like this:

60% of tasks are straightforward: extraction, classification, formatting, templated responses. These run fine on sub-$1/M models.
25% require moderate reasoning: multi-hop Q&A, code generation, structured analysis. Mid-tier models ($0.80-$4/M) handle these well.
12% involve genuine complexity: ambiguous instructions, long-horizon planning, synthesis across heterogeneous sources. Premium models earn their cost here.
3% need frontier reasoning: novel problems, high-stakes decisions, emergent behavior.

Well-implemented routing systems achieve 30-60% cost reduction in typical agentic deployments, with best-in-class implementations reaching 87%.

The practical pattern for agent systems is to separate orchestration from execution. Use an expensive model for the planning layer — it's reading relatively short task descriptions and making routing decisions, so its token consumption is bounded. Use cheap models for execution steps: summarization, extraction, format conversion, retrieval ranking. Claude Haiku executing the tool calls while Sonnet or Opus plans the overall strategy is a common and effective split.

Model cascades take this further: start every request at the cheapest tier, score the response against criteria (confidence, format validity, factual grounding if you have a retrieval source), and escalate if the score is below threshold. The added latency of a cascade is usually worth it — most requests complete at the first tier, and escalation only fires for the difficult fraction.

Confidence-based routing requires some calibration. If you're building it yourself, logprob entropy is a usable signal for open-source models. For proprietary APIs, you need a proxy evaluator (typically a smaller, fast model that checks whether the first response meets your quality bar). The added cost of the proxy is usually less than 5% of the savings from routing.

Context Compression: Shrinking What Goes In

Every token in context has a direct cost. Context compression is the practice of stripping that context to the minimum necessary for the task.

Rolling summaries are the baseline technique. Instead of passing the full conversation history, summarize every N turns (typically 5-10). The summary passes forward; the full transcript is archived. This bounds context growth linearly with summary frequency rather than linearly with turns. The tradeoff is that fine-grained detail from earlier turns is unavailable — acceptable for most use cases, not acceptable for code review agents that need to remember every decision.

Tool output masking is frequently overlooked. When an agent calls a web scraper, an API, or a database query, the raw response often contains headers, metadata, and fields irrelevant to the current task. Stripping these before inserting them into context can reduce tool output tokens by 60-80%. Write post-processors for each tool type that extract only the fields the model actually needs.

Learned compression tools like LLMLingua compress prompts using a smaller model to identify and remove low-information tokens. Customer service prompts reduced from 800 tokens to 40 tokens (a 95% reduction) have been reported, with acceptable accuracy preservation. The catch: compression requires its own LLM call, adding latency and token cost. The math only works when the compressed prompt is reused across many requests, or when the cost of the compressor is much lower than the cost of the main model.

Relevance filtering for retrieval is straightforward: don't pass all retrieved chunks, only those above a cosine similarity threshold. Raising this threshold from 0.7 to 0.8 often cuts retrieved tokens by 40-60% while reducing noise that would otherwise dilute the model's attention.

Semantic Caching: Eliminating Calls Entirely

Semantic caching stores LLM responses indexed by embedding of the input. When a new query arrives, its embedding is compared to cached queries — if similarity exceeds a threshold, the cached response is returned without an API call.

Roughly 31% of LLM queries across typical production workloads exhibit semantic similarity high enough to benefit from this. Cache hits return in milliseconds versus seconds, and cost exactly zero in API fees. For support chatbots, FAQ systems, and applications with clustered query distributions, semantic caching can eliminate 20-40% of API calls outright.

The tradeoff is freshness sensitivity. For applications where answers change frequently, staleness is a risk. Configure TTLs based on how quickly your content domain changes. For static knowledge bases, aggressive TTLs are appropriate. For live data queries, disable semantic caching entirely for those query types.

Hard Limits Are Not Optional

The cheapest optimization is preventing runaway loops. A documented production incident: an agent spent a weekend making 847,000 API calls against a broken data source, accumulating $3,847 in charges before account suspension stopped it. Another: an agent called a scraping tool 400 times in five minutes because the tool returned "more results may be available" — which the agent interpreted as an invitation to keep fetching.

Every agent needs three hard limits set before deployment:

Maximum iterations per task. Set this at 2-3x the expected average. Most agent frameworks (LangGraph, AutoGen, CrewAI) expose this as a first-class config.
Maximum token spend per task. Set at 3x the P95 observed spend from staging. Implement this as middleware that checks accumulated cost before each model call.
Maximum wall-clock time. Catches infinite loops that stay under token budgets by making fast, cheap calls repeatedly.

Ambiguous tool feedback is the most common cause of runaway loops. If a tool can return a signal that could be interpreted as "keep going," the agent will keep going. Be explicit in tool output schemas: include an is_complete boolean or next_action_required field rather than relying on the model to infer termination.

FinOps: Instrumentation You Need Before You Ship

Cost visibility is what closes the loop. Without it, optimizations are guesses and anomalies are surprises.

The minimum viable instrumentation layer tracks:

Cost per trace. Every agent run should emit its total cost (input tokens × price + output tokens × price, broken down by model tier) to your observability system.
Cache hit rate. If this drops below your baseline, something changed in your prompt structure or request patterns.
Output token ratio. Output tokens / (input + output) tokens. A rising ratio usually means your agent is being too verbose — often fixable by adding "be concise" to the system prompt (which routinely reduces output tokens by 15-25%).
Steps per completion. Rising step counts indicate either harder tasks or the agent getting stuck. Either warrants investigation.

Tools like Langfuse, Helicone, and Portkey provide per-request cost tracking and budget controls at the API gateway level. For anomaly detection, set spend alerts at 2σ deviation from your rolling baseline — most cost incidents are detectable within minutes if you're watching this signal.

The cost variance between an unoptimized and well-optimized deployment of the same agent can be 30-200x. That is the highest-ROI engineering work available to most AI teams right now.

The Practical Priority Order

If you're starting from zero, apply techniques in this order, stopping when you've hit your cost target:

Prompt caching. Zero code changes if your framework supports it. Move static content to the prefix. Immediate impact.
Hard limits. Prevents the tail risk of a runaway incident that makes everything else irrelevant.
Output token control. Add "respond concisely" to your system prompt. Instrument output token ratio and watch it drop.
Tool output masking. Write post-processors for your highest-volume tools.
Model routing. Classify tasks by complexity and route to appropriate tiers. Start with a simple rule-based classifier; upgrade to a learned router if the volume justifies it.
Context compression. Implement rolling summaries for long-running sessions.
Semantic caching. Add if your query distribution has enough clustering.

The gap between what agentic systems cost by default and what they cost when engineered well is not marginal. It's the difference between a project that reaches production and one that gets canceled at the budget review.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Token Economics for AI Agents: Cutting Costs Without Cutting Corners

Why Agent Costs Are Different From Chatbot Costs

Prompt Caching: The Easiest Money on the Table

Model Routing and Cascading: Matching Cost to Complexity

Context Compression: Shrinking What Goes In

Semantic Caching: Eliminating Calls Entirely

Hard Limits Are Not Optional

FinOps: Instrumentation You Need Before You Ship

The Practical Priority Order

Recommended Reading

About Tian Pan

Why Agent Costs Are Different From Chatbot Costs​

Prompt Caching: The Easiest Money on the Table​

Model Routing and Cascading: Matching Cost to Complexity​

Context Compression: Shrinking What Goes In​

Semantic Caching: Eliminating Calls Entirely​

Hard Limits Are Not Optional​

FinOps: Instrumentation You Need Before You Ship​

The Practical Priority Order​

Recommended Reading

About Tian Pan

Why Agent Costs Are Different From Chatbot Costs

Prompt Caching: The Easiest Money on the Table

Model Routing and Cascading: Matching Cost to Complexity

Context Compression: Shrinking What Goes In

Semantic Caching: Eliminating Calls Entirely

Hard Limits Are Not Optional

FinOps: Instrumentation You Need Before You Ship

The Practical Priority Order