The Compaction Strategy That Summarized Away the User's Original Question
A user asked our support agent: "Why was invoice INV-2025-08-44719 charged twice on April 3rd?" Forty-five minutes and eighteen tool calls later, the agent confidently reported back: there was no evidence of any duplicate billing on the account that quarter. The user, understandably, escalated. When we replayed the trace, the answer became obvious. The agent had compacted its conversation at turn nine. The summary said the user was "asking about a duplicate charge in early April." It did not contain the string "INV-2025-08-44719." Every subsequent tool call — the ledger lookup, the chargeback API query, the audit log scan — was issued against a paraphrased intent, not the literal invoice number the user typed.
The bug was not in the tools. It was not in the model's reasoning. It was that our context manager had a contract with every downstream component, and nobody had written it down. The contract said: "I will preserve meaning." The components needed: "I will preserve strings."
This is the failure mode that auto-summarization keeps producing in production agents, and it is more pernicious than the more famous failures of compaction. We hear a lot about summaries that drop important facts or that compress away a reasoning step. What we hear less about is summaries that keep all the facts, preserve the conversational arc beautifully, and silently mutate the exact phrasing that a deterministic downstream lookup was keying on. The agent does not get confused. It just searches for the wrong string with full confidence.
The Two Contracts That Compaction Quietly Breaks
Most compaction strategies are implicitly designed around one assumption: that what flows through the agent's context is meaning, and meaning is what needs to be preserved. The summary captures intent. The agent reasons from intent. The tools serve the reasoning. Compress meaning, lose nothing important. That model is fine when every component downstream of the context manager is itself an LLM, because LLMs are robust to paraphrase. The model can read "asking about a duplicate charge in early April" and produce a query that has roughly the right shape.
But the moment any component in the pipeline does a literal lookup — a SQL WHERE invoice_id =, a hash map lookup, a regex match, an exact-string vector retrieval against a BM25 index, a webhook signature verification, an idempotency-key check — that component has a different contract with the context. It does not need meaning. It needs the exact characters the user typed. And nothing in the summarization prompt told the summarizer which strings were load-bearing.
There are really two contracts in play, and they are usually unstated:
- Semantic contract. "I need to know what the user wanted." Held by the planner, the routing logic, the tone-matching, anything that consumes intent. Robust to paraphrase. Robust to summarization.
- Lexical contract. "I need the exact bytes the user provided." Held by every database lookup, every API call with a fixed schema, every cache key, every fingerprint, every audit identifier. Brittle to paraphrase. Destroyed by summarization.
The disaster is that the lexical contract is invisible at design time. The architect drawing the agent diagram sees boxes labeled "summarizer" and "tool executor" and assumes the summary is good enough input for both. They are wrong about the second one, and they will only discover it when a user follows up with "did you actually look at invoice INV-2025-08-44719?" — a question the agent now also cannot answer, because the invoice number is still not in its context window.
Why Claude Code and Codex Diverged Here
This isn't a hypothetical distinction. The two most widely deployed coding agents handle it in opposite ways, and the divergence is instructive. Codex preserves every user message verbatim through compaction. It deletes assistant replies and tool messages, replaces them with a summary, but the original human turns are physically retained in the prompt. Claude Code's /compact and the broader Anthropic compaction beta default to summarizing everything, including user turns, and require the developer to pass explicit instructions to protect specific kinds of content — code snippets, file paths, identifiers.
The Codex design implicitly recognizes the lexical contract: it assumes that whatever the user typed might be load-bearing for a future tool call, so it never paraphrases it. The cost is token efficiency. Long sessions with chatty users keep paying for every original message indefinitely. The Anthropic design optimizes the other way: it assumes the developer knows which strings are sacred and will tell the system. That's strictly more expressive but it puts a configuration burden on a team that may not have realized the burden existed.
Both are defensible engineering choices. Neither is correct as a default for an agent you didn't build yourself. If you're consuming a managed compaction API and you have any tool that does literal lookups against user-provided values, you need to assume the summary will not preserve those values, and you need to design a separate mechanism that does.
The Pattern That Actually Survives
- https://platform.claude.com/docs/en/build-with-claude/compaction
- https://platform.claude.com/cookbook/tool-use-automatic-context-compaction
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://factory.ai/news/evaluating-compression
- https://www.morphllm.com/compaction-vs-summarization
- https://www.morphllm.com/context-compaction
- https://justin3go.com/en/posts/2026/04/09-context-compaction-in-codex-claude-code-and-opencode
- https://redis.io/blog/context-compaction/
- https://www.dbreunig.com/2025/06/22/how-contexts-fail-and-how-to-fix-them.html
- https://www.philschmid.de/context-engineering-part-2
- https://www.langchain.com/blog/context-management-for-deepagents
