Tool Output Compression: The Injection Decision That Shapes Context Quality
Your agent calls a database tool. The query returns 8,000 tokens of raw JSON — nested objects, null fields, pagination metadata, and a timestamp on every row. Your agent needs three fields from that response. You just paid for 7,900 tokens of noise, and you injected all of them into context where they'll compete for attention against the actual task.
This is the tool output injection problem, and it's the most underrated architectural decision in agent design. Most teams discover it the hard way: the demo works, production degrades, and nobody can explain why the model started hedging answers it used to answer confidently.
The root cause is almost always context pollution from uncompressed tool outputs. Average agent session length tripled from under 2,000 tokens in late 2023 to over 5,400 tokens by late 2025, and the bulk of that growth came from tool results accumulating in context. Poor serialization alone wastes 40–70% of available tokens on formatting overhead — JSON indentation, redundant schema wrappers, metadata that nobody asked for.
The decision you make about how to inject tool outputs is not a detail. It's a load-bearing architectural choice that determines your cost ceiling, your latency floor, and your accuracy curve as sessions get longer.
The Three Injection Strategies
There are exactly three ways to handle a tool result before it enters context, and each one has a distinct cost-quality profile.
Raw injection means you pass the tool output directly to the model without any preprocessing. This is the default for most implementations — it's easy to reason about and preserves complete information. The problem is that it hands the compression problem to the model itself. The model has to read and process every token you inject, including the irrelevant ones. At small scales this is fine. At production scale, a 70B model serving sessions with stuffed versus curated contexts shows a 719% increase in time-to-first-token. The quality hit is subtler but real: models injected with irrelevant context start hedging, producing non-committal answers, and occasionally contradicting themselves. Raw injection is only appropriate for tool results under roughly 500 tokens where the entire output is likely relevant.
In-model compression routes the tool output through an LLM summarization step before injecting the result. You ask a model (often a smaller, cheaper one) to extract or summarize the relevant information, then inject the condensed version. This preserves semantic intent better than rule-based extraction and handles diverse or unpredictable output schemas well. The tradeoff is that you're paying twice — once to produce the tool result and once to compress it — and you're adding a latency step. There's also a hallucination risk: compression models can silently drop critical edge cases or smooth over numerical precision. In-model compression works best when tool outputs are large, unstructured, and where you expect the relevant content to be a small fraction of the whole.
Structured field extraction intercepts the tool result before it ever reaches the model, parses it against a known schema, and injects only the fields you specified. This is the lowest-cost, lowest-latency option — rule-based extraction requires no LLM call — and the quality ceiling is high when the schema is complete. The constraint is that you need to know in advance what fields matter. This strategy works for tools with stable, predictable output schemas: SQL queries, REST API calls, structured logs, internal service responses. It fails when tool outputs are genuinely heterogeneous or when the relevant fields vary by query type.
The practical heuristic: start with structured extraction for any tool with a known schema, use raw injection only when outputs are already compact, and reserve in-model compression for cases where neither applies.
The Quality-Cost Matrix
Understanding when to switch strategies requires knowing what each costs in concrete terms.
Raw injection has a deceptively low upfront cost — no extra processing step, no additional LLM calls. But the downstream costs compound. Context poisoning degrades accuracy on tasks that require the model to reason across multiple pieces of information. As sessions lengthen, the model's effective attention window shrinks even if the nominal context window is large. You also hit provider limits faster, which causes either truncation (silent information loss) or errors.
In-model compression doubles your per-step token cost in the worst case. A 10,000-token tool result runs through a summarization call that itself generates and consumes tokens. For high-volume pipelines, this can be the dominant cost driver. The quality benefits are real but hard to measure — it's difficult to attribute accuracy improvements to compression rather than other changes.
Structured field extraction approaches zero marginal cost per invocation once the extraction logic is written. The investment is upfront: schema definition, edge case handling, testing against real outputs. The hidden cost is maintenance: when a tool's output schema changes, extraction silently breaks and you won't notice until accuracy starts declining.
ACON, a research framework for context compression in long-horizon agents, achieved 26–54% peak memory reduction while preserving 95%+ task accuracy by combining structured extraction with selective summarization. AutoTool, which dynamically selects tool subsets rather than compressing outputs, reduced per-step context tokens by 95% and cut end-to-end costs by 70%. These results are in controlled settings, but they illustrate the order-of-magnitude gap between naive injection and deliberate context management.
Production Signals That Tell You to Change Strategy
The first sign that your injection strategy is wrong is almost never a crash or an error. It's a drift in quality metrics that's easy to attribute to the wrong cause.
Rising context token counts is the leading indicator. If your p95 session length is growing week over week without a corresponding increase in task complexity, tool outputs are accumulating in context. The threshold to act is typically when sessions regularly exceed 80% of your context window — at that point, you're paying for tokens that are likely hurting more than helping.
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://www.blog.langchain.com/context-management-for-deepagents/
- https://arxiv.org/html/2510.00615v1
- https://arxiv.org/html/2601.07190
- https://arxiv.org/html/2512.13278v1
- https://openai.github.io/openai-agents-python/context/
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
- https://openrouter.ai/state-of-ai
