Skip to main content

Streaming JSON Parsers: The Gap Between Tokens and Typed Objects

· 11 min read
Tian Pan
Software Engineer

The model is emitting JSON token by token. Your UI wants to render fields the moment they materialize — a confidence score before the long answer body, the arguments of a tool call as the model fills them in. Then someone wires up JSON.parse on every chunk and the whole thing falls over, because JSON.parse is all-or-nothing. It needs a balanced document to return anything. Until the model emits the closing brace, you have nothing to show.

This is not a parser problem you can fix with a try/catch. The standard JSON parser was designed against a content-length-known HTTP response. Partial input is not a state it models — it is "input error." When you treat a token stream as if it were an HTTP body, you inherit thirty years of "the document is either complete or invalid," and your UI pays the bill.

The fix is not to "parse harder." The fix is to recognize that structured output and streaming output are not orthogonal features you compose at runtime. They are a single design problem the model API and the client library have to solve together — and the team that ships first usually does it by writing their own partial parser, then writes it again three months later when they discover their first version was O(n²).

The All-or-Nothing Trap and What It Actually Costs

The first instinct of every team that hits this is the same: "Just buffer until we see a closing brace, then parse." It works in dev. It works in your eval set. It fails in production for two reasons.

The latency tax is bigger than you think. A 12 KB structured response from a frontier model takes seconds to fully emit. If your UI waits for the last token, the user stares at a spinner for the entire duration. Industry write-ups put the perceived-latency improvement from progressive rendering at roughly 60–70% — not because the response is faster, but because the user sees motion. The work happens at the same speed; the waiting changes shape. A field that fills in over five seconds feels alive. A spinner that resolves in five seconds feels broken.

The naive "incremental" fix is worse than the wait. The most common mistake teams make is to keep buffering, but call JSON.parse (or a JSON-repair library) on every new chunk to "see what we have so far." That gives you a parser that re-parses the entire prefix on every token. For a stream that arrives in 5-character chunks, processing a 12 KB response means walking roughly 15 million characters when you only have 12,000 — classic O(n²) behavior. One detailed engineering write-up measured this directly: the naive approach took 16.7 seconds for a 12 KB document and the final chunks were taking 19–20 milliseconds each; a stateful incremental parser ran the same workload in 43 milliseconds, with per-chunk latency under 30 microseconds. That is a 388x speedup, and the naive version became unusable around the 5 KB mark — the "janky zone" where typing visibly stalled.

The trap is not that the naive parser is wrong. The trap is that it works fine in your prototype, where the responses are 200 bytes, and degrades silently as your prompts grow. By the time it is bad, you have shipped it.

What a Real Partial Parser Has to Do

A partial JSON parser is not a relaxed JSON parser. It is a state machine that explicitly models the open positions in the document: which strings are mid-character, which arrays are mid-element, which objects have an unbalanced brace. From that state, it has to make decisions the standard parser never had to make.

Incremental tokenization that survives mid-string truncation. When the stream cuts off in the middle of a value — "answer": "Sure, the capi — the parser has to decide what to yield. Library options here include "yield the partial string as-is," "yield the partial string with a marker that it is incomplete," or "do not yield this field until the closing quote arrives." Different applications want different answers. A chat UI probably wants partial strings. A function-call argument probably does not, because executing a tool with a half-typed argument is worse than waiting one more chunk.

In-flight schema validation. If the model emits {"score": "high" and your schema says score is a number, you do not want to wait until the document closes to find out. A schema-aware streaming parser can flag the violation the moment the type is decided — which matters because the model has not yet emitted the rest of the document, and a re-prompt or retry can be cheap. Catching it after the closing brace is a wasted full generation.

Repair strategies for the truncation case. When the upstream connection drops mid-document, you have a choice: discard the partial, return whatever sub-structure parses cleanly, or attempt to close open structures heuristically. The third option is what libraries like best-effort-json-parser and partial-json-parser-js exist to do — return the parsed structure from [1, 2, {"a": "apple"] rather than throwing on the missing brackets. The right default is application-dependent, and it should be a parameter, not a parser-internal assumption.

Type-aware progressive yield. The parser should emit each field as soon as the field is decided — meaning the value's type is fixed and the value is either complete or growing in a known direction (a string getting longer, an array getting more elements). This is what lets the UI bind to fields individually and re-render as each one settles. The Vercel AI SDK's streamObject and useObject hooks expose a partialObjectStream that does exactly this: each yielded object is the schema-shaped state-so-far, with later fields populated as they arrive.

These are not independent features. A parser that does incremental tokenization but no schema validation is a tokenizer with extra steps. A parser that validates types but cannot yield partial strings is useless for chat UIs. The shape of the parser falls out of the shape of the contract you want to give the application.

Why Existing JSON Tools Were the Wrong Starting Point

Most JSON parsers in production were built for a world where input arrives with a content-length header and a finite end. Their state machines do not model "the input might continue arbitrarily." Their error paths assume that an unbalanced brace at EOF is a bug, not a not-yet. When you ask them to handle streams, they handle streams the same way they handle files: read the whole thing into a buffer, then parse.

This is why nearly every team that ships LLM-driven structured output ends up either writing its own partial parser, vendoring one of the half-dozen open-source ones (partial-json-parser, best-effort-json-parser, incomplete-json-parser, gjp-4-gpt, streamjson, openai-partial-stream), or getting a partial-parser-shaped abstraction for free from a higher-level SDK like Vercel AI's streamObject. The proliferation of these libraries is the diagnostic — if the standard tooling worked, none of them would exist.

Notice what they all have in common: each one is a streaming parser plus a yield contract. The interesting design choices are not in the parsing — context-free grammars for JSON are textbook material. The interesting choices are: when do you yield a partial value, what do you yield it as, how do you communicate "this is still being written," and how do you let the consumer opt into different repair strategies for different field types. Those choices are what differentiate a usable streaming parser from a clever one.

There is a parallel layer of "make the model emit valid JSON in the first place" — grammar-constrained decoding via libraries like XGrammar, llguidance, and Outlines. These mask out invalid tokens at decode time, so the stream is structurally correct by construction. But constrained decoding does not solve the parsing problem. It just guarantees that whatever you have parsed so far would be valid if the document closed here. You still need a streaming parser to extract values incrementally; constrained decoding only spares you the worry that the values are syntactically broken.

The Eval Discipline No One Has

Streaming parsers are subtly broken in ways that do not show up in unit tests. The standard test fixture is "give it a complete JSON document, watch it return the parsed value." The interesting failures live elsewhere.

Cut at every byte boundary. The right eval generates synthetic streams that arrive in chunks of every conceivable size — one byte at a time, two bytes, irregular bursts, the whole document at once. The parser must produce a coherent partial result or a clean error at every cut point, never undefined behavior, never a stale field, never a value that contradicts the next chunk's continuation. If your eval only covers "complete document arrives in one chunk," you have tested the boring case and shipped the bug.

Cut inside escape sequences. The fun ones are mid-\u-escape, mid-surrogate-pair, mid-multi-byte-UTF-8. A parser that handles these silently (does not yield the broken character, waits for the next chunk) is correct. A parser that yields a malformed code point and pushes the problem to the renderer is a bug that will show up as squares in someone's chat history.

Cut at the end of a string with no closing quote. Is "answer": "hello world complete or not? The parser cannot know. The right behavior is configurable — yield it provisionally, yield it as a trailing-truncation hint, or hold it back. The wrong behavior is "yield it as the final value of the field" because then the next chunk will arrive and you will have to retract it, which the consumer probably does not handle gracefully.

Schema violations at every prefix. If the schema says a field must be an enum of three strings, the parser should signal a violation the moment the model emits the fourth byte of a non-matching value, not when the document closes. This requires the schema validator to be prefix-aware — to know which validations can be evaluated on partial input and which require completion.

The reason most teams skip this eval is that the failures are quiet. A parser that returns wrong intermediate state but right final state passes most acceptance tests. The wrongness shows up as flicker in the UI, as a tool that fires with the wrong argument because the upstream parser yielded a half-typed value as the final one, as a retry loop that triggers because a transient partial state failed validation. None of these debug as "the JSON parser is broken." They debug as a hundred unrelated UI bugs.

The Architectural Realization

The lesson hiding under all of this: structured output and streaming output are not features you compose at the application layer. They are a single contract that has to be designed at the API layer.

Look at where the leverage actually sits. The model API decides the chunking. The model API decides whether to surface delta events for tool-call arguments, or to send them as one block at the end (which is the difference between a usable streaming tool-call UI and a frustrating one — and a difference that varies by provider, by model version, by whether the tool was registered with eager-streaming hints). The SDK decides the parser contract. The application is the consumer of decisions made two layers up.

When the model API and the SDK have not co-designed for this, the application gets bolted-together streams that almost work — tool-call arguments that stream tokens but not parsed deltas, structured outputs that stream characters but not fields, partial documents that the SDK cannot tell you whether to render or wait on. Every one of these gaps becomes a parser the application team has to build, an eval suite they have to write, a class of bugs they have to learn the hard way.

The teams who ship reactive, structured AI UIs are not the ones with the cleverest parsers. They are the ones who picked an SDK whose streamObject (or equivalent) yields type-shaped partial state out of the box, and whose model API exposes the field-level deltas the parser needs to do its job. Everyone else is rebuilding it badly, and discovering — usually under a launch deadline — that "stream tokens" was the easy half of the problem.

The next inflection point will probably not be a better parser. It will be model APIs that yield typed deltas instead of byte deltas — emit "the score field is now 0.87, the summary field is appending these tokens" rather than "here are the next 12 bytes." When the wire protocol speaks the application's vocabulary, the parser problem dissolves. Until then, write the eval that cuts at every byte boundary, and pick a library whose yield contract you can defend.

References:Let's stay in touch and Follow me for more thoughts and updates