Context Compression Changes What Your Model Actually Sees
When your API costs spike and someone suggests "just compress the context," the pitch sounds clean: feed fewer tokens in, pay less, get equivalent output. LLMLingua benchmarks show 20x compression on math reasoning with only 1.5% accuracy loss. What's not to like?
The problem is that those benchmarks measure what the compressed context scores on clean, curated test sets. They don't measure what happens when your agent quietly drops the constraint it was given three turns ago, or resolves a pronoun to the wrong entity, or confabulates an exact file path because the original tool output was summarized away. Context compression doesn't just reduce tokens — it changes what your model actually sees. And the gaps between the original context and the compressed version are reliably where your system will fail.
How Token Pruning Decides What Stays
The dominant family of prompt compression tools — LLMLingua, LLMLingua-2, and their descendants — uses a small language model to judge which tokens carry information. The basic intuition: if a small model finds a token entirely predictable given what came before, that token is informationally redundant and can be dropped. Tokens that surprise the small model (high self-information score) are retained.
LLMLingua-1 (2023) runs a causal small LM like GPT-2 or LLaMA-7B over the prompt, computing per-token perplexity. Its budget controller allocates different compression ratios to different prompt sections — instructions get 10–20% compression, few-shot examples get 60–80%, and the query itself gets minimal trimming. This structure is sound: not all prompt text carries equal weight.
LLMLingua-2 (2024) reframes the problem as binary token classification rather than perplexity ranking. It uses a bidirectional encoder (XLM-RoBERTa-large at 355M parameters) that sees full context in both directions when scoring each token — fixing the fundamental flaw in causal models, which only see left context. The training signal comes from GPT-4-generated compressed texts with strict constraints: only remove words (no rewriting), preserve order, no new content added. The result is a model that's 3–6x faster than LLMLingua-1, uses far less memory, and scores substantially better on QA benchmarks at equivalent compression ratios.
The headline numbers are genuinely impressive. LLMLingua-2 hits 79% exact match on GSM8K math reasoning at 5x compression (uncompressed baseline: 78.85%). It preserves chain-of-thought reasoning at 14x compression with less than 2% accuracy drop. For RAG workflows with multi-document retrieval, the LongLLMLingua variant adds question-aware compression — ranking retrieved passages by joint perplexity with the query rather than compressing uniformly — and reports +21 point improvement on multi-document QA tasks at 4x compression.
These are real gains. But the benchmarks measure clean, factual QA with well-formed retrieved documents. They don't surface the failure modes that show up in production.
Three Failure Modes That Benchmarks Don't Catch
Lost Anaphora Chains
Token pruning is agnostic to coreference. When a noun phrase like "the authentication service" appears early and is followed by repeated references to "it" and "the service," the pruner sees those later pronouns as informationally cheap — the small model can predict them with high confidence given surrounding context. The pronouns stay. The antecedent noun phrase, introduced once and carrying lower entropy mass in later passes, may get pruned.
The downstream model then encounters "it failed to connect" with no recoverable referent. In a multi-agent or multi-step pipeline where early context establishes which entity is being discussed, this failure silently produces wrong answers that look fully coherent. The model doesn't flag uncertainty — it resolves the pronoun to whatever entity is most salient in the surviving context.
Dropped Constraint References
Behavioral constraints and permissions stated early in system prompts have low perplexity by the standards of a small language model trained on general text. "Do not use library X," "the user is on a free plan," "output must be valid JSON" — these patterns are common in instruction-tuned data, so a small LM finds them predictable. At aggressive compression ratios (8x and above), early constraint statements are among the first candidates for pruning.
The downstream model then operates without the constraint it was given. In a customer-facing agent, this might mean surfacing paid-tier features to free users. In an agentic coding workflow, it might mean importing a forbidden dependency. The output is coherent and confident because the model simply doesn't know the rule was there.
This is especially dangerous in multi-turn sessions where context is compressed iteratively. Constraints present in the original system prompt may survive the first compression pass but get progressively thinned on subsequent passes as conversation history grows.
Tool Output Hallucination
When an agent's tool results are summarized rather than retained verbatim, the summarizing model introduces fabricated specifics. File paths get approximated. Line numbers shift. Error codes get generalized. API response fields get merged or dropped.
This is categorically different from the accuracy loss on benchmark QA tasks. A QA benchmark checks whether the compressed context supports the correct factual answer. It doesn't test whether exact values — strings, identifiers, structured data — survive compression intact. Research evaluating compression strategies for long-running agents found that artifact tracking (knowing which files were modified, which tests passed, which errors occurred) was consistently the weakest dimension across all compression methods, scoring 2.19–2.45 out of 5.0 regardless of whether extractive pruning or abstractive summarization was used.
Abstractive summarization (using an LLM to rewrite and condense rather than prune tokens) makes this worse. It produces fluent output but replaces source facts with the summarizing model's internal knowledge. Each compression pass accumulates small distortions that compound over a long session.
Validating Compression Before It Ships
The validation methodology that catches compression-induced degradation requires testing dimensions that standard NLP metrics miss:
Constraint retention testing: Include behavioral constraints in the prompt under test — explicit rules about output format, feature flags, permissions. After compression, measure whether the model violates constraints it was given. This catches the dropped constraint failure mode directly.
Coreference stress tests: Construct prompts where important entities are introduced early and referenced by pronoun later. After compression, probe whether the model resolves references correctly. Permutation testing — varying which entities appear where — reveals positional sensitivity. A system that gets the right answer when "the authentication service" appears in the first sentence but the wrong answer when it appears in the third is exhibiting positional fragility, not robustness.
Verbatim value preservation: For agentic workflows, inject tool outputs containing exact values — file paths, numbers, API responses, error codes. After compression, query the model about those exact values. Compare against the uncompressed baseline. Any divergence on exact values is a signal that the compression method is inappropriate for that workflow.
Shadow deployment: Run compressed and uncompressed contexts in parallel for a sample of real production queries before full cutover. Compare task completion rate, not just output quality scores. Measure re-fetch frequency — how often does the agent need to re-read a file or re-query a tool it already queried, because the relevant information was compressed away?
The last metric matters because of a counterintuitive finding: the obvious optimization target, tokens per request, is the wrong objective. Research on long-running coding agents found that aggressive compression that forces re-fetching increases total tokens to task completion while appearing to reduce per-request costs. The correct metric is tokens consumed to successfully complete a task.
Setting the Compression Budget
The compression budget is the explicit allocation of how aggressively to compress each section of a prompt, with accuracy constraints that trigger when you've compressed too far.
A practical starting point by section type:
- System prompt containing constraints or permissions: 0–15% compression maximum. These are load-bearing. The cost of retaining them verbatim is far lower than the cost of a constraint violation.
- Few-shot examples: 60–80% compression is generally safe. Examples carry structural information (format, reasoning pattern) that survives aggressive pruning.
- RAG-retrieved documents: 3–5x compression for well-formed factual documents. Go to 8x only with careful validation.
- Conversation history with resolved topics: 50–70% compression on turns that don't contain unresolved references. Zero compression on the most recent 2–3 turns.
- Tool outputs containing exact values: Do not summarize. If you need to reduce size, truncate entire tool responses rather than compressing them — retain the ones that matter verbatim, drop the ones that don't.
Dynamic budget allocation outperforms static budgets. Setting the budget to a fixed ratio regardless of query complexity is both wasteful on simple queries and harmful on complex ones. Research on budget-aware reasoning found that dynamic allocation by query complexity achieves 68% average token reduction with less than 5% accuracy loss, while static low budgets paradoxically increase token usage (the model hedges and generates verbose uncertainty reasoning when the budget is too constrained).
KV Cache Compression: A Different Layer
Token pruning operates on the text before inference. KV cache compression operates inside the attention mechanism, evicting stored key-value pairs during the forward pass. They address related problems through different mechanisms, and the failure modes differ.
SnapKV (2024) selects important key-value positions per attention head without fine-tuning. At 8x KV cache compression on Qwen-7B, it achieves 3.6x faster generation and 8.2x lower memory usage, enabling 380K-token contexts on a single 80GB GPU. ChunkKV extends this to semantic chunks rather than individual tokens, preserving linguistic structures that token-level pruning fragments.
KV cache compression doesn't hallucinate in the same way abstractive summarization does — it's not rewriting text. But it does create positional and attention-pattern artifacts. When attention heads don't see the full KV history, they weight surviving positions more heavily, which can amplify the "lost in the middle" effect: content near the start and end of the surviving cache receives disproportionate attention relative to what was evicted from the middle.
When Not to Compress
Given the failure modes, the clearest guidance is about when to stay away:
Multi-step reasoning chains: When intermediate steps reference each other by content ("as computed above," "using the result from step 2"), compression breaks the reference chain. The model loses track of its own reasoning.
Security and access control contexts: Any prompt containing access policies, permission checks, or authentication state should be treated as constraint-bearing and compressed minimally. Silent permission drops are harder to detect than factual errors.
Early in a session: Context is still small. The overhead of running a compressor exceeds the token savings. Apply compression only when context genuinely approaches limits or costs materially.
Verbatim-sensitive workflows: Anything involving code, structured data, error messages, API responses, or numerical values is better served by verbatim compaction (retaining some turns intact and dropping others entirely) rather than compressing all turns partially.
A Practical Architecture for Long-Running Agents
The structure that holds up across long agent sessions separates compressible from non-compressible context:
- Pinned section (never compressed): system prompt, active constraints, current task specification
- Structured ledger (append-only, never summarized): artifact modification log, tool call results with exact values, decisions made
- Compressible history: general conversation turns, resolved sub-tasks, exploration that's been completed
- Retrieved context (compressed with question-awareness): documents and passages from retrieval, compressed relative to the current query
The insight is that not all context is context in the same sense. Constraints and exact values are load-bearing in a way that conversational history is not. A compression system that treats them uniformly will reliably fail on the load-bearing portions while correctly handling the rest.
Research across frontier models (GPT-4.1, Claude Opus, Gemini 2.5) found measurable accuracy degradation in coding agents past the 35-minute mark — not because the context window was exhausted, but because of compounding noise in iterative compression and the "context rot" that accumulates as relevant information drifts toward the middle of a growing window. Structured memory that pins the load-bearing content outside the compression budget is what prevents that degradation curve.
Compression gives you real cost and latency improvements. But the gains come from compressing the right things. The failure modes are predictable, the validation methodology is tractable, and the budget allocation framework isn't complicated — it just requires knowing which parts of your prompt are load-bearing before you start pruning.
