The max_tokens Knob Nobody Tunes: Output Truncation as a Cost Lever
Look at the max_tokens parameter on every LLM call in your codebase. If you're like most teams, it's either unset, set to the model's maximum, or set to some round number like 4096 that someone picked six months ago and nobody has touched since. It's the one budget knob in your API request that's staring you in the face, and it's silently paying for slack you never use.
Output tokens cost roughly four times what input tokens cost on the median commercial model, and as much as eight times on the expensive end. The economics of the generation step are completely lopsided: every unused token of headroom you leave in max_tokens is a token you might pay for, and every token you generate extends your p50 latency linearly because decoding is sequential. Yet most production systems treat this parameter as a safety valve — set it high, forget it, move on.
That's leaving money on the table. If you measure actual output length distributions across your routes and set max_tokens based on the shape of those distributions rather than paranoid defaults, you can cut output token spend 20–40% on a typical production workload without any observable quality change. This is not a prompt-engineering trick or a model swap. It's a budget calibration exercise, and almost nobody does it.
Why max_tokens Is Treated as a Safety Valve, Not a Budget
The default mental model goes like this: max_tokens prevents runaway generations. If the model starts hallucinating a novel when you asked for a summary, this cap saves you. Set it high enough to never accidentally truncate a legitimate answer, and you're fine. The cost implication doesn't register because most teams think of output cost as "whatever the model generates," which is bounded by the model's natural stopping behavior, not by the cap.
This mental model has two bugs.
First, the cap matters for capacity planning, not just for runaway prevention. Providers reserve KV cache and accounting slots based on your declared max_tokens, and some pricing plans bill against the reservation rather than actual generation. Even where billing is purely based on generated tokens, your rate limit math uses the declared ceiling. A call that asks for 4096 tokens but generates 120 counts against your token-per-minute budget more aggressively on some providers — and it occupies more inference scheduler slots, which hurts tail latency under load.
Second, and more importantly, most people never check what their actual output length distribution looks like. They assume the model is "about as verbose as it needs to be," set max_tokens comfortably above that, and call it done. But if you sample a week of production traffic for a specific route — say, a summarization endpoint or a structured extraction call — the distribution is almost always extremely right-skewed. The p50 is a small fraction of max_tokens. The p95 is still well under it. The p99 might be close to it, but the p99.9 is what gets truncated, and that's an edge case you want to handle differently anyway.
The cap is sitting at 4096. Your actual p99 is 680. You're budgeting for a tail that does not exist.
The Calibration Methodology
The first step is to get out of the one-size-fits-all mindset. max_tokens should be set per route, per task, per call type — not per application. The call that summarizes a support ticket has a completely different output distribution than the call that generates a code review or the call that returns a JSON classification label. Treating them with a single shared default is like putting every endpoint behind the same timeout and wondering why p99 latency is bad.
For each route, collect a representative sample of output token counts. Most serious LLM client libraries surface completion_tokens in the response metadata. Log it alongside your request ID and route identifier. A week of production traffic is usually enough; a day is often enough if the route is high-volume.
Then look at the distribution:
- p50, p95, p99, p99.9, max. These five numbers tell you almost everything you need.
- Shape. Is it unimodal and tight (structured extraction), or does it have a long tail (open-ended generation)? Tight distributions can be capped aggressively. Long-tailed ones need continuation handling.
- Known ceilings. For structured outputs or classification labels, there's often a theoretical maximum — the longest valid enum, the widest JSON schema. Use that, not a guess.
Set max_tokens to roughly p99 plus a small cushion for most routes. Where the tail truly matters (legal summaries, long-form generation where truncation is unacceptable), either set it higher or invest in a continuation pattern — but don't do both.
The reason this works is that the long tail is paying for itself. If the tail truly represents legitimate generations that need to fit, capping to p99 will cause about 1% of requests to truncate — which you handle with a continuation call, at the cost of one extra round trip for that 1%. If the tail is model verbosity that adds no value, capping removes it entirely and saves you the tokens.
When Truncation Is a Feature, Not a Bug
The instinct to pad max_tokens comes from a legitimate fear: a truncated response in production is visible to the user, often breaks downstream parsing, and looks broken even when the underlying answer was fine. Nobody wants to ship a chatbot that cuts off mid-sentence.
But truncation is a signal, not a failure. When the API returns finish_reason: "length" (or stop_reason: "max_tokens" on Anthropic), it's telling you precisely which calls need continuation. This is information you can act on.
Three ways to handle it:
Graceful continuation. Keep the conversation context, append the truncated response as an assistant turn, and issue a follow-up prompt like "continue from where you stopped." The model picks up the thread. This works for prose, structured text, and streaming scenarios. Cost-wise, you pay for the truncated tokens once, the continuation tokens once, and some duplicated prompt overhead. If your calibration is right, this happens on ~1% of requests, so the amortized cost of the continuation pattern is tiny compared to the savings from capping the other 99%.
Structural continuation. For JSON or code generation, raw "continue" often breaks schema. Instead, parse the partial output, identify the last complete structural element (closed object, finished function), and re-prompt asking for only the missing pieces. OpenAI's community has documented this pattern extensively for finish_reason: length on structured outputs. More engineering, but robust.
Fallback tier. For routes where continuation is expensive or complicated, detect the truncation and retry on a model with a higher output ceiling, or with an explicitly larger max_tokens. This is rarer than it should be — most teams just pad everything upfront.
The key insight is that handling truncation gracefully at the 1% tail is nearly always cheaper than budgeting for the tail on 100% of requests. The only time that's not true is when continuation itself is prohibitively expensive (chained tool calls with complex state) or when a single truncated response causes a cascading failure downstream. Both cases are narrow and identifiable — they're exceptions, not defaults.
The Latency Story Nobody Talks About
Output tokens don't just cost more — they also dominate latency. Generation is sequential: each token depends on the one before it, and time-per-output-token (TPOT) times tokens-generated equals most of your response time. A route that generates 2000 tokens at 50 tokens per second takes 40 seconds of generation time. The same route calibrated to 600 tokens takes 12.
Cutting output length is the single biggest latency lever in most LLM systems, bigger than prompt compression, bigger than model choice within a tier, bigger than caching. The common advice to "reduce latency by cutting 50% of your output tokens cuts ~50% of your latency" is approximately correct because TPOT is roughly constant at a given load.
When you calibrate max_tokens aggressively:
- p99 latency drops because the worst-case generation shortens.
- p50 latency drops too, because the model sometimes generates up to the cap when it would have stopped naturally at a lower point — but only when the cap is close to the natural length.
- Throughput improves because inference schedulers can pack more concurrent requests into the same GPU pool when declared reservations shrink.
The cost savings are the headline number, but the latency improvement is frequently what makes the calibration exercise land with product teams. Users feel latency; finance feels cost. You can sell the work to both.
Where Calibration Breaks Down
This isn't a universal hammer. A few cases where aggressive max_tokens calibration is the wrong move:
Reasoning models. Models with extended thinking (Claude's thinking budget, OpenAI's reasoning tokens) consume hidden tokens before emitting visible output. max_tokens interacts with the thinking budget in provider-specific ways, and capping too tight can truncate the reasoning phase and produce garbled answers. For these, use the provider's explicit thinking budget parameter and set max_tokens only on the visible portion.
Streaming UIs where the user reads as it generates. If the product pattern is "the model writes a document live," truncation is highly visible. Continuation patterns work but introduce a noticeable pause. For these routes, a looser cap plus explicit length instructions in the system prompt may be better than aggressive calibration.
Agentic tool loops. When the output is a function call, the tokens are cheap and the cost is dominated by the number of iterations, not the length of each. Calibrate the iteration count and tool arguments, not the overall cap.
Cold-start routes. Routes with low volume or changing behavior — new features, A/B test variants — don't have a stable distribution to calibrate against. Use a conservative default and revisit after you have traffic.
For everything else — summarization, extraction, classification, Q&A, batch processing, evaluation pipelines — calibration almost always pays.
How to Ship It Without Breaking Things
The failure mode to avoid is flipping max_tokens from 4096 to 600 across a hundred endpoints in one PR and discovering on Monday that your JSON parser is unhappy on 0.8% of support tickets. A safer rollout:
- Instrument first. Log
completion_tokensandfinish_reasonon every call, tagged by route. Don't change anything yet. - Compute per-route distributions. Dump the last week into a notebook; look at the five numbers.
- Shadow-test the new cap. Pick one route, set
max_tokensto your proposed value, and monitor truncation rate. Expect something near your calibration target (~1% if you picked p99). - Add a continuation handler before rolling wider. Even if you don't need it yet, have it ready.
- Dashboard the truncation rate as a first-class metric per route. Alert if it drifts above, say, 3%. That drift is a signal that output distribution has shifted — prompt change, model change, user behavior change — and warrants recalibration.
The pattern is the same as any tuning exercise: measure, change one thing, verify, then generalize. What's different is the reward. A single afternoon of calibrating the top ten highest-volume routes in a production LLM system can reliably clip 20–40% off output spend and measurably tighten tail latency. There are not many levers that cheap.
The Budget Was Always There
The uncomfortable realization from doing this work is how much slack was hiding in plain sight. Nobody intentionally over-budgeted; they just never budgeted at all. max_tokens got treated as a boundary, not a dial, and the boundary was set by whoever wrote the first integration — often the model default, often nothing to do with the actual use case.
The same dynamic shows up elsewhere in LLM systems: context windows padded with unused history, retrieval returning more chunks than the model ever attends to, embeddings cached at the wrong granularity. Cost engineering for LLMs is mostly the discipline of noticing the budgets that were implicitly set and making them explicit. max_tokens is the easiest one to fix because the distribution is already in your logs — you just have to look.
Next time you're staring at a rising LLM bill, before you reach for prompt compression or model routing, pull the completion_tokens distribution for your top three routes. If the p99 is materially below your current cap, you already know what to do.
- https://www.silicondata.com/blog/llm-cost-per-token
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
- https://www.vellum.ai/llm-parameters/max-tokens
- https://blog.premai.io/llm-cost-optimization-8-strategies-that-cut-api-spend-by-80-2026-guide/
- https://developers.openai.com/api/docs/guides/latency-optimization
- https://www.bentoml.com/blog/beyond-tokens-per-second-how-to-balance-speed-cost-and-quality-in-llm-inference
- https://community.openai.com/t/tips-for-handling-finish-reason-length-with-json/806445
- https://medium.com/@ankitmarwaha18/overcoming-response-truncation-in-azure-openai-a-comprehensive-guide-cb85249cf007
- https://arxiv.org/html/2412.18547v5
