The Inference Cost Paradox: Why Your AI Bill Goes Up as Models Get Cheaper

April 14, 2026 · 10 min read

Software Engineer

In 2021, GPT-3 cost $60 per million tokens. By early 2026, you could buy equivalent performance for$ 0.06. That is a 1,000x reduction in three years. During the same period, enterprise AI spending grew 320% — from $11.5 billion to$ 37 billion. The organizations spending the most on AI are overwhelmingly the ones that benefited most from falling prices.

This is not a contradiction. It is the Jevons Paradox, and it is running your AI budget.

What Victorian Coal Engineers Already Knew

William Stanley Jevons described the mechanism in 1865. As James Watt's steam engine made coal combustion dramatically more efficient, he observed that total UK coal consumption did not fall — it surged. More efficient engines made steam power viable for factories, mines, railways, and ships that previously could not afford it. Falling per-unit cost of mechanical work expanded the set of economically viable use cases faster than efficiency reduced existing consumption. The rebound exceeded 100%.

The same loop is running in LLM inference. Cheaper tokens do not replace existing AI workloads at lower cost. They unlock new workloads that were previously too expensive to build. Those workloads — agent loops, multi-step reasoning chains, multi-model pipelines — consume dramatically more tokens per task than the simpler systems they extend or replace.

When GPT-4 cost $36 per million tokens blended, rational engineers minimized every LLM call. When equivalent capability costs$ 0.15 per million tokens, the constraint disappears and architectural patterns change. The question stops being "can we afford to call the API?" and starts being "should we run three parallel agents or five?"

The Token Multiplication Mechanisms

The price drop has changed how AI systems are built, and the new architectures are structurally more expensive per user intent.

Reasoning chains. Models like o3, DeepSeek R1, and extended thinking modes generate thousands of internal reasoning tokens before producing a final answer. A direct response that would have cost 7 tokens through a 2023-era model now costs 255–603 tokens when reasoning mode is active. These are billed as output tokens at premium rates. Teams that default to reasoning models for every task may be paying 10–86x more than necessary.

Agent loops. A ReAct-style agent running a planning-execute-verify cycle generates tokens at every step: decompose the task (tokens), select tools (tokens), format each tool call (tokens), parse each result (tokens), reflect and re-plan if needed (tokens), generate a final response (tokens). A 10-turn agent loop consumes roughly 50x the tokens of a single-pass response to the same question. Frameworks that automatically retry on tool failure can multiply this further.

Context window saturation. As windows expanded from 4K to 128K to 1M tokens, applications started filling them. RAG pipelines inject 20K–100K tokens of retrieved context per query. Code agents load entire repositories. Customer service agents carry full session history across dozens of turns. The marginal cost of including one more document dropped to nearly nothing, so teams include everything.

Multi-agent parallelism. Orchestrator-worker patterns spawn specialized sub-agents simultaneously. Each worker receives its own context window. Coordination messages pass between agents. Final synthesis requires additional LLM calls. The same work that was a single-model query in 2023 is a multi-model orchestration in 2026, with total token consumption multiplying accordingly.

The OpenRouter State of AI 2025 report, analyzing over 100 trillion tokens of real-world usage, documented this shift directly. Average prompt tokens per request grew 4x over 13 months. Average completion tokens nearly tripled. Reasoning models exceeded 50% of total token consumption by mid-2025. The per-token unit cost fell; the tokens-per-task multiplier rose faster.

How the Paradox Compounds Across an Organization

At the individual-team level, the pattern is: successful optimization saves money, saved money justifies new features, new features have richer architectures that consume more tokens than the savings. SaaStr documented one example explicitly.

A team built two AI tools. The first — an AI valuation calculator — cost $0.0002 per use, processing 275,000 completions for under$ 50 total in 30 days. The second — an AI pitch deck analyzer — cost $0.20 per use, running five sequential API calls against multi-megabyte documents. Four hundred uses in 48 hours cost$ 80 and climbing.

The team that built the first tool felt smart about cost efficiency. They used those savings to justify building the second. Their total spend went up. This is not failure. This is correct product behavior. The Jevons Paradox does not describe waste — it describes efficiency gains unlocking demand that was previously suppressed by price.

The organizational problem is that most AI budgets are built on the wrong model. CFOs who see "per-token cost dropped 10x" and conclude "we should be spending 10x less" are modeling a world where token consumption is fixed. In practice, the teams spending most aggressively on AI optimization are the ones whose consumption grows fastest, because they are the teams building the most.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Inference Cost Paradox: Why Your AI Bill Goes Up as Models Get Cheaper

What Victorian Coal Engineers Already Knew

The Token Multiplication Mechanisms

How the Paradox Compounds Across an Organization

Recommended Reading

About Tian Pan

What Victorian Coal Engineers Already Knew​

The Token Multiplication Mechanisms​

How the Paradox Compounds Across an Organization​

Recommended Reading

About Tian Pan

What Victorian Coal Engineers Already Knew

The Token Multiplication Mechanisms

How the Paradox Compounds Across an Organization