Streaming Tool Results Break Request-Response Agent Planners
A SQL tool ships rows as they come off the wire. The agent calls it expecting a result. The harness, written a year earlier when every tool was request-response, dutifully buffers the whole stream into a single string before invoking the model. Forty seconds later, the buffer is 200 KB, the context window is half-eaten, and the agent is reasoning about row 47,000 of a query it could have stopped at row 30. Nobody designed this failure — it falls out of treating "the tool returned" as the only event the planner reacts to.
The shift to streaming tools is happening below the planner's awareness. SQL engines emit progressive result sets. Document fetchers yield pages. Search APIs return hits in batches as relevance scores stabilize. MCP's Streamable HTTP transport, the 2025-03-26 spec replacement for HTTP+SSE, makes incremental responses a first-class transport mode rather than an exotic capability. The wire is ready. The planners on top of it are not.
Most agent harnesses still model a tool call as result = tool(args) — a function whose return type is one cohesive value. When the underlying tool streams, the harness's only honest options are to wait until EOF, to truncate at some hard byte limit, or to stop reading and discard the rest. None of those let the model participate in the decision. The interesting move is the one almost no one makes: let the model decide during the stream when it has seen enough.
The Buffer-and-Hand-Off Failure Mode
The default integration path turns a streaming tool into a synchronous one. The harness opens the stream, accumulates chunks into a buffer, waits for the close event, then hands the entire payload to the model as the next user message. From the model's perspective, the tool's behavior is indistinguishable from a slow request-response call. From the system's perspective, three things have quietly broken.
The first break is context-window math. A SQL query that returns 50,000 rows of moderately wide JSON can land north of 800 KB of tool output. With a 200K-token context window and a typical agent state that already includes system prompts, prior tool calls, and the user's history, a single greedy buffer consumes a third of the budget before the model has reasoned about the result. The second tool call in the trajectory has less room than the first. The fifth has to evict something — usually the part of the user's request the agent most needs to remember. Recent work on context window overflow in agents identifies tool-output bloat as a top-three failure mode, ahead of long instructions or scratchpad reasoning.
The second break is latency. The harness has no way to ask "is this enough yet?" because the model isn't in the loop until EOF. If the tool is paginating-on-demand — a document fetcher that yields a new page every time the previous one is acknowledged — the buffer-and-hand-off pattern produces a stream that never reaches EOF until some external timeout fires. The agent's decision to stop reading is delegated to a wall-clock deadline rather than to the agent's own judgment about whether row 30 already answered the question.
The third break is silent. Truncation at the harness layer is the standard fallback, and the issues queue at projects like OpenAI Codex tracks it as a recurring source of agent confusion: the tool result the model sees is a prefix of what the tool actually emitted, with a footer like [truncated] if the team got that part right and no marker at all if they didn't. The model reasons over a partial dataset as though it were complete. Output gets confidently wrong in ways the trace doesn't surface, because from the trace's point of view the tool returned a value and the model accepted it.
A Streaming-Tool Contract the Planner Can Reason About
The fix is not "stream into the model token-by-token" — that solves a UX problem (perceived latency on the response) and not the planning problem. The fix is to teach the planner that some tools emit incrementally, and to give it primitives to participate in the stream.
The minimum viable contract has four pieces:
A streaming flag in the tool descriptor. The tool catalog already declares names, parameters, and descriptions; add a boolean (or a richer mode enum: request_response, chunked, paginated, unbounded) so the planner knows ahead of time whether a tool can be expected to return one cohesive result. A planner that sees streaming: true should structure its prompt to model partial results, set up an interrupt path, and never expect a single "the tool returned X" message.
A running synopsis the model can read mid-stream. As chunks arrive, a wrapper accumulates a running summary — the count of rows seen so far, the schema, a sampled head, a min/max over key columns, an "is the data converging or still expanding the value range" signal. Every N chunks (or N seconds), the wrapper emits the synopsis to the model as a thinking-tier message. The model now has evidence under uncertainty rather than a final value. It can stop early, refine its plan, or commit to waiting longer based on what it has seen.
- https://modelcontextprotocol.io/specification/2025-03-26/basic/transports
- https://docs.langchain.com/oss/python/langchain/streaming
- https://developers.googleblog.com/en/beyond-request-response-architecting-real-time-bidirectional-streaming-multi-agent-system/
- https://arxiv.org/html/2511.22729v1
- https://dev.to/gabrielanhaia/tool-result-truncation-the-silent-bug-that-makes-agents-lie-3epe
- https://dev.to/gantz/handle-mcp-tools-that-return-too-much-data-49h9
- https://www.mindstudio.ai/blog/reduce-token-usage-ai-agents-mcp-optimization
- https://github.com/openai/codex/issues/6426
- https://risingwave.com/blog/mcp-streaming-database-connect-ai-agents-risingwave/
- https://blog.langchain.com/planning-agents/
- https://streamkap.com/resources-and-guides/agent-decision-latency-budget
- https://chatforest.com/guides/mcp-real-time-streaming/
- https://www.codeant.ai/blogs/poor-tool-calling-llm-cost-latency
