The Quadratic Cost of a Conversation: Why AI Chat Spend Grows Superlinearly

May 17, 2026 · 8 min read

Software Engineer

A ten-turn conversation does not cost ten times a single turn. It costs closer to fifty-five times. This is the first thing most teams get wrong when they model the unit economics of an AI feature, and it is the reason a product that looks profitable in a spreadsheet bleeds money in production.

The mistake is treating a conversation as a sequence of independent requests. It is not. Because LLM APIs are stateless, every turn re-sends the entire accumulated history. Turn one sends one unit of context. Turn two sends two. Turn ten sends ten. The total tokens billed across the session is the sum 1 + 2 + ... + N, which grows as N²/2 — quadratically — while your product almost certainly charges a flat, linear price.

The users who love your product most are the ones holding the longest conversations. They are also the ones quietly destroying your margins.

The Math Nobody Puts in the Pitch Deck

Start with the formula. A conversation has N turns. Each turn adds roughly S tokens of new content — a user message plus a model response. By turn k, the input you send the model is the system prompt plus all k-1 prior turns plus the new message. Billed input tokens across the whole session are:

Sum over k of (system prompt + (k-1)·S) ≈ N·(system prompt) + S·N(N-1)/2.

That second term is the killer. It is quadratic in N. The first term — the part most people estimate from — is merely linear.

Make it concrete. Say each turn adds 1,000 tokens. A naive per-turn estimate for a 20-turn session predicts 20,000 input tokens. The actual cumulative input is about 210,000 tokens — more than ten times the estimate. Push to 50 turns and you are re-processing a 150K-token history on every call. By turn 100 the running context can exceed 300K tokens, and every one of those tokens is paid for again on turn 101.

This is why "average tokens per request" is one of the most dangerous metrics in an AI cost dashboard. The average is dominated by short sessions, because most sessions are short. But the spend is dominated by the long tail, because cost per session scales with the square of length. A handful of 80-turn power-user sessions can outweigh thousands of 3-turn sessions on the invoice while contributing almost nothing to the average. The metric that looks calm is hiding the metric that matters.

Your Best Users Are Your Worst Customers

Here is the part that should make a product manager uncomfortable. Engagement and cost are the same curve.

In a flat-rate subscription — the dominant pricing model for AI chat products — every user pays the same price regardless of how much they use. The implicit assumption is the gym-membership model: light users subsidize heavy users, and it averages out. That model works when the marginal cost of a heavy user is bounded. With conversational AI, it is not bounded. It is quadratic.

The user who has a deep, 60-turn working session with your product is getting enormous value. They are also the user most likely to renew, to evangelize, to show up in your testimonials. And they cost you not 20x a casual user but something closer to 400x, because their session length sits on the wrong end of an N² curve. Flat pricing turns your most valuable customers into loss leaders, and it does so silently, because nothing in a standard analytics funnel surfaces "cost per session by session length."

It gets worse with model migration. Token prices fall every year, which tempts teams to assume the problem solves itself. It does not. Users migrate to the newest, most capable frontier model the moment it ships, and frontier models are never the cheap ones. Reasoning models compound the issue further: they emit large volumes of intermediate reasoning tokens that join the billable history. The cost-per-token line trends down while the cost-per-session line trends up, because session length and model capability are both climbing faster than price is falling.

Why Prompt Caching Does Not Save You

The standard first reflex is prompt caching, and it is worth understanding precisely why it only goes so far.

Caching works on a stable prefix. Providers let you mark the front of your prompt — typically the system prompt and other fixed boilerplate — so that on a cache hit you pay a steep discount, often around 90% off, for re-reading that prefix. If your system prompt is large and unchanging, caching genuinely helps.

But look back at the cost formula. Caching attacks the linear term — the N·(system prompt) part. It does almost nothing to the quadratic term. The growing conversation history is not a stable prefix: each new user message, tool result, and reasoning trace is unique to that turn and appears for the first time. It has never been cached because it has never existed before. You can cache the conversation so far as a prefix for the next turn, which helps at the margin, but caches expire on short TTLs, and any edit, branch, or retry invalidates the prefix from the edit point forward.

Caching is a real optimization. It is not a structural fix. The N(N-1)/2 term is untouched, and that term is the one that grows without bound.

Treat Session Length as a First-Class Constraint

The fix is not a single trick. It is a decision to treat the conversation history as a managed resource with a budget, the way an operating system treats memory, rather than an append-only log you let grow until it overflows.

A few patterns, roughly in order of how much they cost you to implement:

Sliding-window truncation. Keep only the most recent K turns; drop the oldest as new ones arrive. It is nearly free and it flattens the quadratic into a linear cost, because the billed history stops growing once it hits the window size. The price is amnesia: the model forgets early context and the rationale behind earlier decisions. Fine for shallow support chats, dangerous for long reasoning tasks.
Summarization checkpoints. When the history crosses a threshold — say 70–80% of a target budget — compress the older segment into a compact summary and keep only recent turns verbatim. This preserves the gist of long sessions while capping the token count. The cost is an extra summarization call and some fidelity loss at the seams. It is the workhorse pattern for serious long-running assistants.
Tiered compression cascades. The most robust production systems do not pick one strategy. They compress verbose tool outputs first, apply a sliding window second, and invoke LLM summarization only as a last resort. Each layer is cheaper than the one below it, so the expensive option runs least often.
Hard caps with graceful handoff. When a session genuinely cannot continue affordably, end it deliberately — checkpoint the state, tell the user, offer a fresh thread that carries forward a summary. An honest boundary beats a silent margin leak.

Whatever you choose, the architectural point is the same: context pressure must be observable. Instrument billed input tokens per session against session length. Plot the distribution, not the average. The moment you can see the N² curve, the truncation-versus-quality trade-off becomes a deliberate engineering decision instead of an accident you discover on the invoice.

Pricing Has to Acknowledge the Curve

Engineering can flatten the quadratic. It cannot make it linear for free — every mitigation trades tokens for quality or latency. So the pricing model has to absorb what is left.

Pure flat-rate pricing is a structural mismatch with a quadratic cost base, full stop. The realistic options are usage-aware: credit pools that deplete faster for longer sessions, length-aware tiers, or hybrid plans with a flat base plus metered overage. Each has a trust cost — users dislike meters they cannot predict, and they especially dislike suspecting that a feature was constrained to push them toward overage. That trust cost is real and worth managing carefully. But it is smaller than the cost of an unbounded liability growing on every power user's account.

The honest framing for a product team is this: a conversation is not a sequence of messages, it is a context window that gets re-billed on every turn. The team that internalizes the N²/2 — in its dashboards, its mitigations, and its pricing — builds a feature that survives its own success. The team that estimates from the average ships a feature whose best week is the week the finance team starts asking questions.

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The Quadratic Cost of a Conversation: Why AI Chat Spend Grows Superlinearly

The Math Nobody Puts in the Pitch Deck

Your Best Users Are Your Worst Customers

Why Prompt Caching Does Not Save You

Treat Session Length as a First-Class Constraint

Pricing Has to Acknowledge the Curve

Recommended Reading

About Tian Pan

The Math Nobody Puts in the Pitch Deck​

Your Best Users Are Your Worst Customers​

Why Prompt Caching Does Not Save You​

Treat Session Length as a First-Class Constraint​

Pricing Has to Acknowledge the Curve​

Recommended Reading

About Tian Pan

The Math Nobody Puts in the Pitch Deck

Your Best Users Are Your Worst Customers

Why Prompt Caching Does Not Save You

Treat Session Length as a First-Class Constraint

Pricing Has to Acknowledge the Curve