Skip to main content

Streaming Structured Output: Why Your Parser Hangs on Token 47

· 11 min read
Tian Pan
Software Engineer

The first time a team builds a streaming AI feature with structured output, the bug is always the same. The model is generating fine. The chunks are arriving fine. But somewhere around token 47, the parser hangs, the UI freezes, or — worse — a half-formed enum value gets routed to a downstream tool that quietly does the wrong thing. The team adds a try/catch around JSON.parse, considers themselves done, and ships. Two weeks later, a sibling team complains that the streaming UI feels janky after the response gets long. A quarter later, an incident review asks why a "Delete" tool call fired on a record that the model was still describing as "DeleteIfEmpty."

The bug is not in any single token. The bug is that token-streaming and structured output are architecturally at odds, and most frameworks paper over the conflict with prayer. A schema says "this is a complete object." A token stream says "here are the bytes one at a time." Every intermediate state between those two endpoints is, by definition, invalid against the schema. The team's job is to decide what to do during those intermediate states — and most teams have not made that decision explicitly.

This post is about the failure modes that show up when streaming and structured output meet: the naive parsers that choke on every partial state, the "is_complete" booleans that lie, the O(n²) retry loops that look like the model got slow, the partial-enum hazards that are silent and dangerous, and the architectural patterns that hold up.

The naive loop and why it stops working

The first version of streaming structured output usually looks like this. The team accumulates chunks into a buffer. On each chunk, they call JSON.parse(buffer). If it throws, they wait for the next chunk. If it succeeds, they hand the result to the UI.

This works in development because development inputs are short. It also works in the unit tests, which mock the stream as a single chunk. It breaks in production for two reasons that compound.

The first reason is that JSON.parse is binary: either the buffer is a complete document or it isn't. There is no partial result. So the streaming UI does not stream — it sits silent until the closing brace arrives, then dumps the whole object at once. The user sees a loading spinner for three seconds, then a wall of text. The "streaming" was always an illusion.

The second reason is performance. As the response grows, every new chunk re-parses the entire buffer from scratch. For a 12 KB response delivered in 5-character chunks, that is roughly 15 million characters of work to produce 12,000 characters of output. The wall-clock cost is real: under 3 KB it is invisible, by 5 KB the UI stutters, past 8 KB the parser is the bottleneck and the response feels frozen even though the model is still generating quickly. The team's latency dashboard reads "model is slow." It isn't. The parser is.

The fix the team usually reaches for next is the lie that buys six months of runway: an is_complete boolean.

The "is_complete" boolean is a lie

Once a team realizes that JSON.parse is binary, the natural workaround is to ask the model itself to tell them when the document is done. They add an is_complete: true field to the schema, instruct the model to set it last, and parse only when they see it. This is the "completion sentinel" pattern, and it has one good idea wrapped around two bad ones.

The good idea is that a model can mark its own output as terminal. A reserved key emitted last — "_done": true is the convention — is a useful signal even when the surrounding JSON is fully streamable, because it disambiguates "the model finished generating" from "the connection dropped at byte 4096." For long responses where partial completion is dangerous (a tool call where firing on partial arguments is unsafe), an explicit terminal marker is the right architectural primitive.

The bad ideas are subtle. The first is that an is_complete boolean tells you nothing about the intermediate states. The team still cannot render anything until the last token. They have replaced "wait for the closing brace" with "wait for the closing brace plus a flag," and the user-visible behavior is identical: silent spinner, wall-of-text reveal. The streaming UX did not improve — the team just convinced themselves the model was the reason it couldn't.

The second is that the boolean is a contract the model can violate. Models will, occasionally, emit is_complete: true on a malformed object, or emit it in the middle of a longer document, or hallucinate it onto a field where the schema didn't ask for it. The team is now trusting an unreliable signal to gate a parsing decision, and the failure mode looks like a parser bug rather than a model bug. Whatever signal you use to mark completion has to be enforced — by constrained generation, by structured-output mode, or by validating the shape on the server — not assumed.

The deeper issue is that an is_complete boolean is treating the symptom. The symptom is "I cannot tell if this is a complete document." The disease is "I am using a parser that does not understand partial documents." The fix is a parser that does.

Streaming-tolerant parsing is a state machine

A parser that handles partial JSON without reparsing from scratch is, in the literal sense, a state machine: it remembers where it was when the chunk ended. It tracks the nesting stack (are we inside an object, an array, both?). It tracks token state (mid-string, mid-number, mid-escape sequence, mid-Unicode codepoint). It buffers incomplete tokens and resumes them when the next chunk arrives. When a new chunk lands, it picks up at the saved position and processes only the new characters.

The complexity drop is dramatic. The same 12 KB response that took 16.7 seconds with the naive re-parse loop completes in 43 milliseconds with an incremental parser — almost 400× faster — because the work is now linear in the response size rather than quadratic. More importantly, the parser can emit partial results at every intermediate state. The UI can render a partial object the moment the first key-value pair lands, expand it as more pairs arrive, and finalize it when the closing brace closes the outermost object.

The libraries that have hardened around this problem in 2025 — partial-json-parser (Python and JS), PartialJSON (Swift), streamjson, json-river, gjp-4-gpt — share a common shape. They expose an Allow policy that says which kinds of incompleteness the caller is willing to surface (partial strings: usually yes; partial numbers: usually no, because "0.0" split across chunks parses as two floats; partial booleans: usually no; partial nulls: never). They maintain state across calls. They expose either an event API ("a key landed," "an array element finalized") or a snapshot API ("here's the deepest valid object given what we've seen so far") or both.

If you have written a tokenizer in a college compilers class, this is that machine, but with the resume button wired up.

The partial-enum hazard

The most dangerous failure in streaming structured output is also the easiest to miss: partial values that are syntactically valid prefixes of two different complete values. The canonical example is enums. The model is generating "DeleteIfEmpty" for a policy field. At the moment the streaming parser sees "Delete, it has a syntactically valid partial string. If the consuming code matches on prefix — or if the parser surfaces the partial string and a downstream switch statement does a startsWith("Delete") check — the system has already routed to the destructive branch. By the time the rest of the token arrives and the value is "DeleteIfEmpty", the side effect has fired.

This is not hypothetical. It is the structured-output version of the classic streaming bug where a half-loaded URL fires its fetch. The fix is a discipline rather than a library:

  • Never branch on a partial enum value. If the value is a member of a closed set, gate the branch on the _done sentinel for that field, or on the parser emitting the field as "finalized" (most streaming parsers expose this distinction).
  • Make tool-call argument parsing strictly terminal. Tool calls are not display state. They are side effects. The streaming parser should expose arguments only when the closing brace of the arguments object has landed. The OpenAI streaming events distinguish tool_calls.function.arguments.delta (display only) from the finalized arguments — use that distinction.
  • Validate enum values against the schema, not against prefix matches. A partial string is not a member of the enum until the parser confirms it is.

The right mental model: streamed JSON has two zones. The display zone is where partial values are useful and partial-rendering is the whole point. The decision zone is where partial values are catastrophic and the parser must treat anything pre-finalization as if it were missing. Most frameworks do not draw this line for you. You have to draw it yourself, on the consuming side, per field.

The schema-streaming compromise nobody documents

Provider SDKs have grown streaming primitives — OpenAI's tool_calls.function.arguments.delta, the Anthropic streaming events, the various streamObject / streamText helpers in agent frameworks — and they are useful, but they all dodge the same question: what does it mean to validate a partial document against a schema?

The honest answer is that you cannot, and the SDKs that pretend otherwise are loose with definitions. A partial object cannot satisfy a "required field" check. A partial array cannot satisfy a minItems constraint. A partial string cannot satisfy a regex. The Vercel AI SDK is one of the few that says this in its docs: "Partial outputs streamed via streamText cannot be validated against your provided schema." Most frameworks let validation pass silently on partial state and only error if the final state fails. That is fine when the consumer treats partial state as display-only. It is wrong when the consumer treats partial state as decision input.

The compromise that actually works in production is two parsers in the same code path: a streaming-tolerant parser that emits partial state for the display zone, and a schema-validating parser (the SDK's built-in or a Zod/Pydantic gate) that runs only on the finalized payload and is the gate for the decision zone. The display zone reads from the first parser. The decision zone reads from the second. Tool calls, persisted writes, and any user-visible commitment go through the second.

This sounds obvious written down. In practice, every team that builds a streaming agent rediscovers it the hard way, usually after the partial-enum incident.

What to do on Monday

Three concrete moves.

First, audit any streaming structured-output code path for the two-parser pattern. If the same parsed value is feeding both a streaming UI and a tool call, you have the partial-enum hazard latent in the code, regardless of whether it has fired yet. Split the parsers. Make the display parser tolerant of partial state and the decision parser strict on finalization.

Second, replace any "wait for JSON.parse to succeed in a try/catch loop" code with a real streaming-tolerant parser. The library choice is a detail; the architectural change is moving from O(n²) re-parse-from-zero to O(n) incremental. Measure the streaming latency on a long response (8 KB or more) before and after. If your dashboards do not currently show parse cost separately from inference cost, this is also a sign that the observability layer is not granular enough.

Third, pick a completion sentinel convention and stick to it for tool calls and any other terminal-only fields. _done: true as a reserved key the model emits last is a working pattern. The constrained-generation layer should enforce it, not the application layer trust it. The point of the sentinel is to make "the model is finished" a structural fact rather than a heuristic.

Streaming structured output is not magic and it is not impossible. It is a problem with a known shape, a known parser pattern, and a known failure mode that the framework defaults will not catch for you. The teams that get it right treat the parser as a first-class component of the AI feature, on the same architectural level as the prompt and the schema. The teams that get it wrong treat parsing as a JSON.parse call wrapped in a try/catch — and then spend a quarter chasing an incident that the parser, not the model, was always going to cause.

References:Let's stay in touch and Follow me for more thoughts and updates