Skip to main content

Token Budget Strategies for Production LLM Applications

· 10 min read
Tian Pan
Software Engineer

Most teams discover their context management problem the same way: a production agent that worked fine in demos starts hallucinating after 15 conversation turns. The logs show valid JSON, the model returned 200, and nobody changed the code. What changed was the accumulation — tool results, retrieved documents, and conversation history quietly filled the context window until the model was reasoning over 80,000 tokens of mixed-relevance content.

Context overflow is the obvious failure mode, but "context rot" is the insidious one. Research shows that LLM performance degrades before you hit the limit. As context grows, models exhibit a lost-in-the-middle effect: attention concentrates at the beginning and end of the input while content in the middle becomes unreliable. Instructions buried at turn 12 of a 30-turn conversation may effectively disappear. The model doesn't error out — it just quietly ignores them.

This isn't a problem you can solve by upgrading to a million-token context window. Large windows delay the problem; they don't eliminate it. At 60-70% capacity utilization, performance starts to degrade even on models that advertise 1M+ token support. Real production systems need explicit token budget strategies.

Why Token Costs Compound Faster Than You Expect

Consider a production support agent with a 3,000-token system prompt. Add 4,000 tokens of retrieved documentation per query, 500 tokens of conversation history (growing each turn), and 200 tokens of tool call results per step. By turn 8, you're at 12,000+ tokens per call.

At $0.05 per 1K input tokens, a million API calls per month means your system prompt alone costs $150K monthly. Add retrieval and history, and you're looking at $500K+ — before output tokens. Teams that don't track this are routinely shocked by their first full-scale invoice.

The accumulation problem is worse in multi-agent workflows. A pipeline with 20 LLM calls, each passing results to the next, can accumulate 50,000+ tokens before producing a final answer. Every intermediate step gets billed. Every intermediate step affects quality.

The fix isn't to cut context arbitrarily — that causes its own failures. The fix is to treat token budget as a first-class engineering concern with deliberate allocation strategies.

Use Model-Specific Tokenizers, Not Approximations

Before you can manage a token budget, you need accurate counts. The "4 characters per token" heuristic is wrong often enough to matter in production.

Different models use different tokenizers:

  • OpenAI GPT-4o: tiktoken with cl100k_base encoding (newer gpt-4o-2024-11-20 uses o200k_base)
  • Anthropic Claude: Use the count_tokens() endpoint — it's free and doesn't consume rate limits
  • Google Gemini: Use the count_tokens() method from the official SDK (SentencePiece-based)
  • Meta Llama: Use the exact HuggingFace tokenizer from Llama's repository

Using tiktoken to count Claude tokens can produce estimates that are off by 30-50%. That margin collapses your safety buffer and causes unexpected truncation.

Accurate counting also requires accounting for message formatting overhead. In the OpenAI Chat Completions API, each message adds ~4 tokens for the ChatML wrapper. A function call schema adds more. Production code should always count exactly:

import tiktoken

def count_chat_tokens(messages: list[dict], model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
total = 3 # completion primer overhead
for msg in messages:
total += 4 # per-message wrapper
for key, value in msg.items():
total += len(enc.encode(str(value)))
return total

Reserve output tokens explicitly. If your model can generate up to 2,000 tokens of output, set your context budget to context_limit - 2000. Never let the input consume the entire window.

Allocate Budget by Priority, Not Chronology

The naive approach to context management is chronological: keep everything until you hit the limit, then drop oldest content. This fails because the oldest content often includes your system prompt, initial constraints, and critical instructions. Dropping it silently corrupts agent behavior.

Instead, define explicit budget tiers:

TierContentBudget
ProtectedSystem prompt, current queryAlways included
High priorityCurrent tool results, latest retrieved docs~30% of window
Medium priorityRecent conversation history (last 5 turns)~25% of window
Low priorityOlder conversation historyRemainder

When you're approaching the limit, compress or drop from the bottom up. Never touch protected content.

This tiered approach also reveals where budget is actually going. Most teams, once they instrument this, discover that RAG retrieval is their largest variable cost — not conversation history. Retrieving 8 chunks at 512 tokens each per query adds up faster than multi-turn history in typical chat applications.

Summarize Before You Truncate

When conversation history grows beyond your medium-priority budget, the right move is summarization, not truncation. Truncation is destructive — you lose information that may be referenced later. Summarization compresses it.

The pattern that works in production:

  1. After every N turns (typically 8-12), trigger a background summarization pass
  2. Replace raw history older than N turns with a structured summary
  3. Inject the summary as a system message at the start of the context

A structured summary outperforms raw concatenation significantly. Instead of preserving dialogue verbatim, store the semantically meaningful residue:

User context:
- Building a Rust CLI tool for log parsing
- Has ~3 years of Rust experience
- Running on macOS, targets Linux production

Decisions made:
- Using clap for argument parsing (rejected structopt as deprecated)
- Async I/O via tokio, not blocking reads
- Output format: JSON to stdout

Current task: Implementing the --filter flag with regex support

This 80-token summary replaces what was originally 2,000 tokens of dialogue. The model retains everything it needs for coherent continuation. A 2025 analysis of production conversational agents found that structured summaries over 8-12 turns reduce per-turn token usage by 40-60% with negligible accuracy loss.

The tradeoff: one additional LLM call per summarization cycle. For most applications, this is a worthwhile investment — the reduction in per-turn input costs exceeds the summarization overhead after a few turns.

Structure Context to Exploit KV Caching

Modern LLM inference caches the key-value tensors computed for prefixes that repeat across requests. When multiple requests share the same prefix — same system prompt, same few-shot examples — the cache is hit and you avoid recomputing those tokens. This translates directly to lower latency and, on infrastructure you control, lower compute cost.

To exploit this:

  1. Put static content first: System prompt, examples, and boilerplate should appear before dynamic content. The cache key is prefix-based — any change to early tokens invalidates the cache for everything after.

  2. Separate static from dynamic: Don't interleave boilerplate with per-request content. Keep the stable prefix as long as possible.

  3. Route consistently: On self-hosted infrastructure with vLLM or similar, requests sharing a prefix should route to the same GPU worker. Cross-worker cache reuse is limited.

Production teams using prefix-aware routing on read-heavy workloads (document Q&A, customer support) commonly report 60-80% reductions in input token computation. Thomson Reuters' ML team documented 60% cost reduction on legal document summarization by caching the boilerplate legal context prepended to every query.

For hosted APIs (OpenAI, Anthropic), prefix caching is automatic — but you still need to structure prompts so the cacheable prefix is stable. Placing a timestamp or per-request ID early in the prompt defeats caching.

Build a Token Budget Middleware Layer

The strategies above don't work in isolation — they need to be applied systematically. The right architecture is a token budget middleware that intercepts context before it reaches the model:

class TokenBudgetManager:
def __init__(self, model: str, context_limit: int, output_reserve: int = 2000):
self.model = model
self.budget = context_limit - output_reserve

def prepare_context(
self,
system_prompt: str,
current_query: str,
history: list[dict],
retrieved_docs: list[str],
) -> list[dict]:
messages = []

# Protected: always include
protected_tokens = count_tokens(system_prompt + current_query, self.model)
remaining = self.budget - protected_tokens

# High priority: recent retrieved docs
doc_tokens = 0
included_docs = []
for doc in retrieved_docs:
t = count_tokens(doc, self.model)
if doc_tokens + t < remaining * 0.35:
included_docs.append(doc)
doc_tokens += t

# Medium priority: recent history
history_tokens = 0
included_history = []
for turn in reversed(history[-10:]): # most recent first
t = count_tokens_for_message(turn, self.model)
if history_tokens + t < remaining * 0.30:
included_history.insert(0, turn)
history_tokens += t

# Low priority: older summarized history
# (summary was computed in background, inject as system message)

return build_messages(system_prompt, included_docs, included_history, current_query)

This is roughly 50 lines of infrastructure that prevents the entire class of silent context failures. Production engineering teams uniformly report that implementing explicit budget middleware is among the highest-leverage reliability improvements they've made to LLM systems.

When to Retrieve Instead of Accumulate

For knowledge-heavy tasks, the question isn't "how do I fit everything in context" — it's "how do I retrieve only what's relevant." Stuffing large documents into context is expensive, and the lost-in-the-middle effect means retrieval quality degrades as document count increases.

The break-even point, in practice: if the source material is larger than 10-15K tokens, retrieval is almost always more cost-effective and often more accurate than full-context inclusion. For materials under 5K tokens, full inclusion is simpler and avoids retrieval errors.

One nuance from production deployments: retrieval and long-context models complement each other. Retrieval narrows to the most relevant chunks; a capable long-context model can then reason over slightly more context than pure RAG would provide, catching nuances that retrieval misses. The combination outperforms either approach alone on complex reasoning tasks.

Monitor and Alert on Budget Utilization

Token budgets need observability. At minimum, track:

  • Input tokens per request: Average and P95, broken down by source (system prompt, retrieval, history, tool results)
  • Context utilization rate: What fraction of your budget is being consumed
  • Summarization trigger rate: How often you're compressing history
  • Cache hit rate: For prefix caching, what fraction of requests are hitting the cache

Alert when:

  • Average input tokens exceeds 70% of your budget (approaching context rot territory)
  • P95 input tokens exceeds 90% of your budget (approaching overflow)
  • Summarization trigger rate spikes (signals conversation length distribution has shifted)

These metrics frequently surface surprises. Teams often discover that a single tool — a code execution result, a large retrieved document — is responsible for 40% of their total input token spend. Fixing that one source often has more impact than optimizing the entire conversation history pipeline.

Putting It Together

The practical approach to token budget management:

  1. Instrument first — count tokens accurately, by source, before optimizing
  2. Set explicit budget tiers — protect critical content, prioritize the rest
  3. Summarize at scale — compress conversation history after N turns with structured summaries
  4. Structure for caching — put stable content first, route consistently
  5. Retrieve don't accumulate — use RAG for large knowledge bases
  6. Monitor utilization — alert before you're in the failure zone, not after

The teams building reliable multi-turn agents in production have all converged on some version of this stack. The specific numbers vary — your ideal summarization interval and budget allocation depend on your use case — but the underlying discipline is universal. Context is not free, and treating it as unlimited is how you end up debugging hallucinations that only appear after 15 turns.

References:Let's stay in touch and Follow me for more thoughts and updates