The max_tokens Knob Nobody Tunes: Output Truncation as a Cost Lever

April 16, 2026 · 11 min read

Software Engineer

Look at the max_tokens parameter on every LLM call in your codebase. If you're like most teams, it's either unset, set to the model's maximum, or set to some round number like 4096 that someone picked six months ago and nobody has touched since. It's the one budget knob in your API request that's staring you in the face, and it's silently paying for slack you never use.

Output tokens cost roughly four times what input tokens cost on the median commercial model, and as much as eight times on the expensive end. The economics of the generation step are completely lopsided: every unused token of headroom you leave in max_tokens is a token you might pay for, and every token you generate extends your p50 latency linearly because decoding is sequential. Yet most production systems treat this parameter as a safety valve — set it high, forget it, move on.

That's leaving money on the table. If you measure actual output length distributions across your routes and set max_tokens based on the shape of those distributions rather than paranoid defaults, you can cut output token spend 20–40% on a typical production workload without any observable quality change. This is not a prompt-engineering trick or a model swap. It's a budget calibration exercise, and almost nobody does it.

Why max_tokens Is Treated as a Safety Valve, Not a Budget

The default mental model goes like this: max_tokens prevents runaway generations. If the model starts hallucinating a novel when you asked for a summary, this cap saves you. Set it high enough to never accidentally truncate a legitimate answer, and you're fine. The cost implication doesn't register because most teams think of output cost as "whatever the model generates," which is bounded by the model's natural stopping behavior, not by the cap.

This mental model has two bugs.

First, the cap matters for capacity planning, not just for runaway prevention. Providers reserve KV cache and accounting slots based on your declared max_tokens, and some pricing plans bill against the reservation rather than actual generation. Even where billing is purely based on generated tokens, your rate limit math uses the declared ceiling. A call that asks for 4096 tokens but generates 120 counts against your token-per-minute budget more aggressively on some providers — and it occupies more inference scheduler slots, which hurts tail latency under load.

Second, and more importantly, most people never check what their actual output length distribution looks like. They assume the model is "about as verbose as it needs to be," set max_tokens comfortably above that, and call it done. But if you sample a week of production traffic for a specific route — say, a summarization endpoint or a structured extraction call — the distribution is almost always extremely right-skewed. The p50 is a small fraction of max_tokens. The p95 is still well under it. The p99 might be close to it, but the p99.9 is what gets truncated, and that's an edge case you want to handle differently anyway.

The cap is sitting at 4096. Your actual p99 is 680. You're budgeting for a tail that does not exist.

The Calibration Methodology

The first step is to get out of the one-size-fits-all mindset. max_tokens should be set per route, per task, per call type — not per application. The call that summarizes a support ticket has a completely different output distribution than the call that generates a code review or the call that returns a JSON classification label. Treating them with a single shared default is like putting every endpoint behind the same timeout and wondering why p99 latency is bad.

For each route, collect a representative sample of output token counts. Most serious LLM client libraries surface completion_tokens in the response metadata. Log it alongside your request ID and route identifier. A week of production traffic is usually enough; a day is often enough if the route is high-volume.

Then look at the distribution:

p50, p95, p99, p99.9, max. These five numbers tell you almost everything you need.
Shape. Is it unimodal and tight (structured extraction), or does it have a long tail (open-ended generation)? Tight distributions can be capped aggressively. Long-tailed ones need continuation handling.
Known ceilings. For structured outputs or classification labels, there's often a theoretical maximum — the longest valid enum, the widest JSON schema. Use that, not a guess.

Set max_tokens to roughly p99 plus a small cushion for most routes. Where the tail truly matters (legal summaries, long-form generation where truncation is unacceptable), either set it higher or invest in a continuation pattern — but don't do both.

The reason this works is that the long tail is paying for itself. If the tail truly represents legitimate generations that need to fit, capping to p99 will cause about 1% of requests to truncate — which you handle with a continuation call, at the cost of one extra round trip for that 1%. If the tail is model verbosity that adds no value, capping removes it entirely and saves you the tokens.

When Truncation Is a Feature, Not a Bug

The instinct to pad max_tokens comes from a legitimate fear: a truncated response in production is visible to the user, often breaks downstream parsing, and looks broken even when the underlying answer was fine. Nobody wants to ship a chatbot that cuts off mid-sentence.

But truncation is a signal, not a failure. When the API returns finish_reason: "length" (or stop_reason: "max_tokens" on Anthropic), it's telling you precisely which calls need continuation. This is information you can act on.

Three ways to handle it:

Graceful continuation. Keep the conversation context, append the truncated response as an assistant turn, and issue a follow-up prompt like "continue from where you stopped." The model picks up the thread. This works for prose, structured text, and streaming scenarios. Cost-wise, you pay for the truncated tokens once, the continuation tokens once, and some duplicated prompt overhead. If your calibration is right, this happens on ~1% of requests, so the amortized cost of the continuation pattern is tiny compared to the savings from capping the other 99%.

Structural continuation. For JSON or code generation, raw "continue" often breaks schema. Instead, parse the partial output, identify the last complete structural element (closed object, finished function), and re-prompt asking for only the missing pieces. OpenAI's community has documented this pattern extensively for finish_reason: length on structured outputs. More engineering, but robust.

Fallback tier. For routes where continuation is expensive or complicated, detect the truncation and retry on a model with a higher output ceiling, or with an explicitly larger max_tokens. This is rarer than it should be — most teams just pad everything upfront.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The max_tokens Knob Nobody Tunes: Output Truncation as a Cost Lever

Why max_tokens Is Treated as a Safety Valve, Not a Budget

The Calibration Methodology

When Truncation Is a Feature, Not a Bug

Recommended Reading

About Tian Pan

Why max_tokens Is Treated as a Safety Valve, Not a Budget​

The Calibration Methodology​

When Truncation Is a Feature, Not a Bug​

Recommended Reading

About Tian Pan

Why max_tokens Is Treated as a Safety Valve, Not a Budget

The Calibration Methodology

When Truncation Is a Feature, Not a Bug