Tool Output Compression: The Injection Decision That Shapes Context Quality
Your agent calls a database tool. The query returns 8,000 tokens of raw JSON — nested objects, null fields, pagination metadata, and a timestamp on every row. Your agent needs three fields from that response. You just paid for 7,900 tokens of noise, and you injected all of them into context where they'll compete for attention against the actual task.
This is the tool output injection problem, and it's the most underrated architectural decision in agent design. Most teams discover it the hard way: the demo works, production degrades, and nobody can explain why the model started hedging answers it used to answer confidently.
The root cause is almost always context pollution from uncompressed tool outputs. Average agent session length tripled from under 2,000 tokens in late 2023 to over 5,400 tokens by late 2025, and the bulk of that growth came from tool results accumulating in context. Poor serialization alone wastes 40–70% of available tokens on formatting overhead — JSON indentation, redundant schema wrappers, metadata that nobody asked for.
The decision you make about how to inject tool outputs is not a detail. It's a load-bearing architectural choice that determines your cost ceiling, your latency floor, and your accuracy curve as sessions get longer.
The Three Injection Strategies
There are exactly three ways to handle a tool result before it enters context, and each one has a distinct cost-quality profile.
Raw injection means you pass the tool output directly to the model without any preprocessing. This is the default for most implementations — it's easy to reason about and preserves complete information. The problem is that it hands the compression problem to the model itself. The model has to read and process every token you inject, including the irrelevant ones. At small scales this is fine. At production scale, a 70B model serving sessions with stuffed versus curated contexts shows a 719% increase in time-to-first-token. The quality hit is subtler but real: models injected with irrelevant context start hedging, producing non-committal answers, and occasionally contradicting themselves. Raw injection is only appropriate for tool results under roughly 500 tokens where the entire output is likely relevant.
In-model compression routes the tool output through an LLM summarization step before injecting the result. You ask a model (often a smaller, cheaper one) to extract or summarize the relevant information, then inject the condensed version. This preserves semantic intent better than rule-based extraction and handles diverse or unpredictable output schemas well. The tradeoff is that you're paying twice — once to produce the tool result and once to compress it — and you're adding a latency step. There's also a hallucination risk: compression models can silently drop critical edge cases or smooth over numerical precision. In-model compression works best when tool outputs are large, unstructured, and where you expect the relevant content to be a small fraction of the whole.
Structured field extraction intercepts the tool result before it ever reaches the model, parses it against a known schema, and injects only the fields you specified. This is the lowest-cost, lowest-latency option — rule-based extraction requires no LLM call — and the quality ceiling is high when the schema is complete. The constraint is that you need to know in advance what fields matter. This strategy works for tools with stable, predictable output schemas: SQL queries, REST API calls, structured logs, internal service responses. It fails when tool outputs are genuinely heterogeneous or when the relevant fields vary by query type.
The practical heuristic: start with structured extraction for any tool with a known schema, use raw injection only when outputs are already compact, and reserve in-model compression for cases where neither applies.
The Quality-Cost Matrix
Understanding when to switch strategies requires knowing what each costs in concrete terms.
Raw injection has a deceptively low upfront cost — no extra processing step, no additional LLM calls. But the downstream costs compound. Context poisoning degrades accuracy on tasks that require the model to reason across multiple pieces of information. As sessions lengthen, the model's effective attention window shrinks even if the nominal context window is large. You also hit provider limits faster, which causes either truncation (silent information loss) or errors.
In-model compression doubles your per-step token cost in the worst case. A 10,000-token tool result runs through a summarization call that itself generates and consumes tokens. For high-volume pipelines, this can be the dominant cost driver. The quality benefits are real but hard to measure — it's difficult to attribute accuracy improvements to compression rather than other changes.
Structured field extraction approaches zero marginal cost per invocation once the extraction logic is written. The investment is upfront: schema definition, edge case handling, testing against real outputs. The hidden cost is maintenance: when a tool's output schema changes, extraction silently breaks and you won't notice until accuracy starts declining.
ACON, a research framework for context compression in long-horizon agents, achieved 26–54% peak memory reduction while preserving 95%+ task accuracy by combining structured extraction with selective summarization. AutoTool, which dynamically selects tool subsets rather than compressing outputs, reduced per-step context tokens by 95% and cut end-to-end costs by 70%. These results are in controlled settings, but they illustrate the order-of-magnitude gap between naive injection and deliberate context management.
Production Signals That Tell You to Change Strategy
The first sign that your injection strategy is wrong is almost never a crash or an error. It's a drift in quality metrics that's easy to attribute to the wrong cause.
Rising context token counts is the leading indicator. If your p95 session length is growing week over week without a corresponding increase in task complexity, tool outputs are accumulating in context. The threshold to act is typically when sessions regularly exceed 80% of your context window — at that point, you're paying for tokens that are likely hurting more than helping.
Declining extraction accuracy shows up as the model hedging answers it used to answer confidently. Phrases like "based on the information provided" or "I'm not certain but" appearing in outputs that previously produced clean, direct answers are a signal the model is confused by competing or irrelevant context. This is distinct from genuine model uncertainty — it's a precision problem caused by context noise.
Latency increases correlated with session length rather than query complexity point to context bloat. If your time-to-first-token climbs as sessions get longer, the bottleneck is context processing, not computation.
Budget overruns on specific tool types in your cost monitoring will tell you which tools are the culprits. If one tool produces results averaging 8,000 tokens and you're calling it 20 times per session, that tool needs a compression strategy regardless of how the rest of your pipeline works.
Anti-Patterns That Compound the Problem
Most teams don't have one injection strategy — they have an implicit one that evolved by accident.
Context stuffing is the most common anti-pattern: treating large context windows as permission to inject everything. The 1M-token context window didn't change the model's effective attention distribution. It changed the ceiling on how much noise you can inject before the model breaks entirely. Filling the window with uncompressed tool outputs works in demos because demo queries are direct and the relevant information is obvious. In production, queries are ambiguous, sessions are long, and the model has to find signal in a growing haystack.
Multi-agent context propagation compounds the problem. Agent A accumulates 30,000 tokens across 15 tool calls, spawns Agent B for a subtask, and passes its full context "so B has everything it needs." B does the same to C. Each agent pays for the full inherited context, and each adds more. The fix is explicit context contracts at agent boundaries: define what information each sub-agent needs and pass only that.
Indiscriminate observation logging injects every intermediate step — tool call inputs, state mutations, debug metadata — into the agent's context stream. This made early agent architectures easy to debug, but at production scale it means the model spends significant attention reading its own plumbing. Log to external systems; inject only action-result pairs.
Fixed serialization format regardless of tool type treats structured database outputs the same as unstructured web search results. Database outputs benefit enormously from field extraction; web search results often need semantic summarization. Using the same injection strategy for both guarantees you're wrong about at least one.
A Tiered Compression Architecture
The most robust production approach treats injection as a tiered decision rather than a single policy.
The first tier is field extraction for all tools with stable schemas. Write extraction configs that map each tool's output schema to the fields your agent actually uses. This handles the majority of tool calls in most pipelines with minimal overhead.
The second tier is selective in-model compression for high-value, unstructured outputs. Not every tool result deserves a compression call — reserve this for cases where the output is large, unstructured, and where the relevant content is genuinely hard to identify programmatically.
The third tier is context window management at the session level. When total context exceeds a threshold (typically 80% of your limit), apply summarization to older tool interactions. Preserve recent turns verbatim — the last 10% of the context window is active working memory that should not be compressed. Older interactions can usually be distilled to action-result pairs without losing task-relevant information.
LangChain's implementation of deep agent context management follows this pattern: offload large tool results to a filesystem reference, summarize older history when context fills, and give the agent a compression tool it can invoke autonomously when it judges more context is needed. The "bitter lesson" insight applies here — giving agents more control over their own compression tends to produce better results than hand-tuned heuristics, because the agent has more information about what's relevant to its current task.
What to Measure
If you implement nothing else from this post, add these three metrics to your agent monitoring:
Context occupancy rate: peak tokens used divided by context window limit, tracked as a time series. A rising trend without rising task complexity is a compression problem.
Extraction coverage: for structured extraction, the fraction of tool responses where extraction succeeded versus fell back to raw injection. A declining rate means your schemas need updates or your extraction logic has gaps.
Accuracy by session length: if your agent's task accuracy degrades as sessions get longer, tool output accumulation is a likely cause. This requires an evaluation framework with tasks of varying session depth, but it's the most direct measurement of whether your compression strategy is working.
The teams that get this right don't think of tool output compression as an optimization. They treat it as a first-class design decision made at the same time as tool selection and agent architecture. The injection strategy you commit to early becomes the invisible constraint on everything your agent can do later.
- https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
- https://www.blog.langchain.com/context-management-for-deepagents/
- https://arxiv.org/html/2510.00615v1
- https://arxiv.org/html/2601.07190
- https://arxiv.org/html/2512.13278v1
- https://openai.github.io/openai-agents-python/context/
- https://redis.io/blog/llm-token-optimization-speed-up-apps/
- https://openrouter.ai/state-of-ai
