Silent Tool Truncation: The Default Cap Your Agent Reasons Over Without Knowing
A tool call returns a 142 KB JSON blob. Your agent framework drops everything past byte 8,192, hands the prefix to the model, and the model writes a confident answer based on a fragment it never knew was a fragment. Three weeks later a customer escalates. You scroll the trace, see "tool returned successfully," and the post-mortem turns into a hunt for which step "ignored" the evidence — except no step ignored it. The evidence was clipped before it ever reached the reasoner.
This isn't a hypothetical. Codex hardcodes tool output truncation at 10 KiB or 256 lines. Claude Code defaults to 25,000 tokens for tool results, with a separate display-layer cap that briefly clipped MCP responses at around 700 characters in 2025. OpenAI's tool-output submission caps at 512 KB. Each framework picked a number that seemed safe, and for short tool calls it is. The failure mode arrives when a single step's output crosses the line — quietly, without an exception, without a flag the model can see.
The deeper problem is not that limits exist. Limits have to exist; tool outputs can be unbounded and context windows can't be. The problem is that "drop the overflow and continue" is the wrong default for any system whose downstream consumer is a probabilistic reasoner. A deterministic program crashes on truncated JSON. A reasoner doesn't crash. It infers, fills gaps, and answers. The mode of failure shifts from a stack trace to a confidently wrong response — and the absence of a stack trace is exactly what makes the bug hard to find.
How Truncation Becomes Invisible
The path from "your tool returned a useful result" to "your model is hallucinating" runs through several sub-systems, each of which can clip without telling the next.
The framework layer. Most agent frameworks impose a per-result byte or token ceiling. Codex picks 10 KiB or 256 lines, whichever hits first, and preserves head and tail with a marker in the middle. That marker is informational text the model is free to ignore — there is no structural field on the tool response that says truncated: true. When a CSV gets cut through the middle, the model sees malformed rows and either repairs them by guessing or treats the visible rows as the whole dataset.
The transport layer. Even when the framework would accept a larger payload, a Node child-process buffer or a gRPC message limit can fail upstream. The Claude Code issue traced its symptom — 8,000-character MCP responses appearing as 700-character snippets — partly to RangeError [ERR_CHILD_PROCESS_STDIO_MAXBUFFER]: stdout maxBuffer length exceeded. The full response was generated, paid for in inference cost, written to a temp file, and never reached the model. The user could find it on disk. The agent could not.
The display layer. Some clips happen only in the developer-facing transcript. The model still gets the full response, but the engineer trying to debug a behavior sees a snippet and concludes "the tool returned the right data" — when what they're actually looking at is a UI truncation, not the model's view. This direction of asymmetry is rare but worse than the other one: it makes the engineer wrong about what the model knew.
The memory layer. LangGraph users have hit cases where a memory summarizer silently drops ToolMessage entries during summarization, breaking the AIMessage/ToolMessage pairing that the framework requires. The agent's apparent failure surfaces as a runtime error or a confused next step, but the root cause is upstream: the summarizer's truncation policy didn't know about the tool-call protocol's structural constraint.
Each of these failures is plausible in isolation. What makes them industrial-grade dangerous is that they compose. A 50 KB result is clipped by the framework to 25 K tokens, then summarized by a memory layer that drops a ToolMessage, then displayed to a developer in a UI that shows the first 700 characters. The engineer sees a truthful-looking transcript; the model sees a different truthful-looking transcript; neither matches the original tool output.
What the Model Does With Half a Result
Reasoners are good at answering questions with the evidence they have. That's the core skill. It's also the failure mode.
A reasoner handed half an API response will not say "this looks truncated, I should not answer." It will reason from the visible portion, weight that evidence at full confidence, and produce a complete-looking conclusion. If the missing rows happen to contain the answer the user actually needed, the answer comes back wrong with no uncertainty marker attached. Post-mortem traces show the model "ignoring evidence" that, from the model's perspective, was never there.
This is qualitatively different from how deterministic systems fail under truncation. A SQL client given a partial result set raises an exception. A JSON parser fails on an unclosed brace. A Python pandas.read_csv on a clipped file throws ParserError. Each of these halts before propagating bad data forward. The reasoner has no equivalent halt. Its loss function rewards producing a coherent answer; coherence is cheap when half the constraints are missing.
The asymmetry shows up in evals too. A test suite that checks "did the agent produce an answer" passes silently. A test suite that checks "did the agent produce the right answer for the full evidence set" requires constructing a case where the right answer depends on data past the truncation boundary — which most teams don't think to test until after the first incident.
The Detection Discipline That Has to Land
Most teams discover their truncation rate after a customer-visible incident, then find that 12% of tool calls have been silently clipped for months. The failure of the eval to catch it is structural: evals built from production traces over-represent calls whose responses fit, because calls whose responses got clipped failed quietly and never made it into the "interesting" pile.
Three things have to be in place before truncation becomes observable as data rather than as anecdote:
- https://github.com/anthropics/claude-code/issues/2638
- https://modelcontextprotocol.info/docs/tutorials/writing-effective-tools/
- https://github.com/openai/codex/issues/5913
- https://www.anthropic.com/engineering/code-execution-with-mcp
- https://forum.langchain.com/t/best-way-to-handle-very-large-tool-outputs-in-langgraph-avoid-llm-langsmith-overload/2445
- https://github.com/langchain-ai/langmem/issues/112
- https://community.openai.com/t/submit-tool-output-in-function-call-size-limit/744943
- https://community.openai.com/t/undocumented-truncation-of-function-tool-call-submission-output/611944
- https://github.com/anthropics/claude-code/issues/24785
- https://arxiv.org/html/2511.22729v1
- https://modelcontextprotocol.io/specification/2025-06-18/server/tools
