Context Compression Artifacts: What Your Summarization Middleware Is Silently Losing
Your agent said "Do NOT use eval()" at turn three. By turn thirty, it called eval(). Your insurance processor said "Never approve claims without valid ID." After fifteen compression cycles, it approved one. These aren't model failures — they're compression failures. The agent's reasoning was fine. The summarization middleware threw away the one constraint that mattered.
Context compression is now standard infrastructure in long-running agent systems. When conversation history grows too large for the context window, you compress it — roll up older turns into a summary, trim, chunk, or distill. The problem is that modern summarizers don't destroy information randomly. They destroy it predictably, along specific fault lines, and most teams only discover those fault lines in production.
Why Summarization Is a Lossy Transform, Not a Distillation
The framing of "summarization" encourages a false analogy: you go from a longer thing to a shorter thing that captures the same content. In practice, LLM summarizers don't compress — they rewrite. They generate semantically-plausible prose from the source content, using what the model finds salient. What the model finds salient is optimized for general fluency, not for what your specific downstream task needs to recover.
This creates a critical asymmetry: the summarizer doesn't know what future queries will ask. A summarization pass prioritizes high-frequency mentions and prominent entities — topics that appear often, concepts that take up space. Rare but critical information gets suppressed because it scores low on implicit salience signals.
The multi-pass problem compounds this. Roll up 10 turns into a summary, then roll that summary into another summary a dozen turns later, and you're not running one lossy transform — you're running a cascade of them. Precision degrades toward approximation. "Exactly 512 records" becomes "about 500 records" becomes "several hundred records." Negations don't just degrade — they invert. "User prefers Python but Rust is a backup" becomes "User prefers Python" becomes "User knows Python." After three cycles, the constraint has semantically reversed.
The positional bias of transformer-based summarizers makes this worse. Content at the beginning of context receives more attention weight. Information buried in the middle of a long conversation — the part most likely to contain detailed constraints established early in a task — gets the least weight and drops first.
What Gets Dropped First
Research across production agent traces and compression benchmarks consistently identifies the same categories as the first casualties of summarization:
Negations and constraints. These are sparse by nature — a constraint like "never delete without a backup" might appear once in a long session. Summarizers assign low salience to low-frequency content. The result: the constraint is either dropped entirely, or rephrased in ways that lose the mandatory character. "ID verification is important" is not the same as "Do not proceed without verified ID." One is advisory; the other is a hard stop.
Conditional dependencies. "If the API returns a 429, retry with exponential backoff up to 5 attempts, then escalate" requires three facts to survive in the right relationship. A summarizer might preserve all three as separate facts — but lose the conditional structure binding them. The agent then treats them as independent suggestions rather than an ordered decision tree.
Exact numerical values. Approximation is the default loss mode for numbers. Thresholds, limits, and counts are particularly vulnerable: "max 100 records per batch" is operationally different from "batches should be reasonably sized," but the latter is what summarization frequently produces. In rate-limited or quota-sensitive systems, this matters immediately.
Ordering and causality. Task execution sequences flatten into unordered collections of steps. If "step 3 requires step 2 to complete first" was implicit in the original instructions, that dependency rarely survives compression. The agent will rediscover it the hard way.
Tool-output attribution. When an agent queries a database and gets a specific result set, that result gets logged in context. After summarization, it becomes "the agent retrieved customer records." Which records? When? What was the query? That metadata — the chain of evidence connecting observation to source — vanishes. On the next turn, the agent may confabulate specifics, not from hallucination in the usual sense, but because the ground truth was removed from its context.
Rejection history. "User rejected approach A because of latency concerns" is a compound fact: the rejection, the rejected option, and the reason. Summarizers frequently preserve the rejection but drop the reason, or preserve the option name but lose that it was rejected. Either error sends the agent back toward the same dead end.
The 65% Problem
The production impact is significant. Analysis of enterprise AI agent failures finds that roughly 65% are attributable to context drift and memory loss — not raw context exhaustion. Teams run out of context window less often than they run out of accurate context window. The agent has tokens to spare, but those tokens contain degraded information that leads it astray.
In multi-turn task benchmarks, agents achieve around 58% success on single-turn queries and drop to 35% on multi-turn tasks. Adding naive compression improves that somewhat — but introduces a new failure mode: the agent forgets early-session requirements before reaching the goal. The improvement is real but incomplete. And critically, the failure mode shifts from "out of context" to "wrong context," which is harder to detect because the agent continues confidently.
Context loss by itself accounts for 3.33% of traced agent failures in structured analysis of long-horizon task trajectories. But that understates the actual contribution, because context-quality degradation shows up as downstream failures — wrong API calls, constraint violations, repeated work — that get attributed to model errors rather than memory errors.
What Must Survive Verbatim
If you're building or deploying context compression, certain categories of information need to be pinned outside the summarization pass:
Explicit constraints with negations. Any statement that contains "not," "never," "do not," or "must not" should be treated as load-bearing until proven otherwise. The cost of preserving a stale constraint is low. The cost of discarding an active one is catastrophic. Pin these to a dedicated section that compression never touches.
Numerical thresholds and exact quantities. Extract every precise number — counts, limits, percentages, durations — into a structured parameter store. Don't let prose summarization touch these values. A rule that says "max 100" should survive as max_records: 100, not as natural language that can be paraphrased away.
Goal statements and their refinements. The original task definition, and every revision to it made during the session, should be preserved as an ordered chain. Not summarized. Not merged into a single statement. If the user revised the goal at turn 8 and again at turn 21, you need to know that turn 21's version supersedes turn 8's — and why.
Rejection records. Store rejections as structured tuples: (rejected_option, reason, turn_number). These are small and cheap to preserve, and they prevent the agent from re-proposing discarded approaches without the context of why they were discarded.
Tool output attribution. Every observation that came from an external tool should carry a provenance tag: which tool, what query, what timestamp. When that observation gets referenced later, the agent should be able to trace it back to source rather than confabulating.
Evaluating Whether Your Compressor Is Failing
Standard compression evaluation metrics — ROUGE scores, BERTScore, embedding similarity — don't tell you what you actually need to know. A summary can score well on all of these while removing the one constraint that prevents a production incident. Lexical overlap and semantic similarity metrics optimize for prose quality, not for operational fidelity.
Probe-based evaluation is more reliable. After compression, ask the agent specific questions that require it to have retained the critical information:
- "What is the maximum batch size?" (tests numerical retention)
- "Can you delete this record?" (tests negation/constraint retention)
- "Why did we reject approach A earlier?" (tests rejection history)
- "Which files have you edited so far?" (tests tool-output attribution)
- "What's the next step in the plan?" (tests sequencing and causality)
If the agent answers these correctly after compression, your compressor preserved what matters. If it answers them with fabricated or degraded information — even if its prose is fluent and confident — you have a compression artifact problem. Task success rate on these probe questions is a better proxy for compression quality than any similarity metric.
The round-trip test is the most rigorous version: run a multi-turn task to completion with full context, then replay the same task with compressed context, and measure how often the compressed version produces identical decisions. Divergences reveal exactly which information loss caused which downstream error.
Production Patterns That Work
The teams with the best results in production have moved away from treating compression as a single operation applied uniformly to all context. The pattern that consistently outperforms rolling summarization is a structured hybrid:
A stable semantics layer holds goal statements, constraints, parameters, and rejection history. This section is never compressed. It grows slightly over the course of a session, but it's small relative to total context and the information in it is irreplaceable.
A structured long-term memory holds compressed representations of past interactions — but as atomic facts (subject-relation-object triples) rather than prose summaries. Atomic facts preserve the precision that prose summarization loses, and they're independently searchable when the agent needs to recover a specific detail.
A high-fidelity short-term buffer holds the last few turns uncompressed. Recent context is always the most relevant to the current step, and compressing it introduces the most immediate risk of operational error.
The result is that aggressive compression only applies to the middle layer — older interactions that are unlikely to be referenced directly but might contribute general context. Everything critical is either pinned or structured. Frameworks built on this architecture have demonstrated 80-90% token cost reduction with measurable improvements in task accuracy, because they're compressing the right content and protecting the rest.
The Design Choice You're Making
Every time you run a summarization pass on agent context, you're making an implicit choice about which information is worth keeping. The problem is that implicit choices made by a general-purpose LLM summarizer aren't optimized for your specific task, your specific constraints, or the specific failure modes in your system.
Making that choice explicit — by deciding what gets pinned, what gets structured, and what gets compressed — is the actual design work. Rolling summarization isn't a neutral operation. It's a series of editorial decisions about what your agent is allowed to remember. The question is whether those decisions are being made by design or by default.
The teams that have moved to structured context management report not just better accuracy but better debuggability: when an agent makes an error, you can look at the structured memory and see exactly what information it had and didn't have. That's a much better debugging surface than trying to reverse-engineer what a rolling summarizer threw away seven cycles ago.
Context compression is necessary. Context compression that doesn't account for which information must survive verbatim is a liability that compounds with every session turn.
- https://arxiv.org/html/2510.00615v2
- https://arxiv.org/html/2512.22087v1
- https://arxiv.org/html/2603.04257
- https://arxiv.org/html/2601.07190
- https://arxiv.org/html/2604.15877
- https://arxiv.org/html/2503.19114
- https://arxiv.org/html/2505.15774v1
- https://arxiv.org/html/2412.15266v1
- https://arxiv.org/html/2503.13657v1
- https://aclanthology.org/2024.tacl-1.9/
- https://mem0.ai/blog/llm-chat-history-summarization-guide-2025
- https://mem0.ai/blog/context-engineering-ai-agents-guide
- https://factory.ai/news/evaluating-compression
- https://factory.ai/news/compressing-context/
- https://weaviate.io/blog/context-engineering
- https://www.letta.com/blog/memory-blocks
- https://github.com/microsoft/LLMLingua
- https://medium.com/the-ai-forum/automatic-context-compression-in-llm-agents-why-agents-need-to-forget-and-how-to-help-them-do-it-43bff14c341d
