Skip to main content

The Streaming UI That Committed a Partial Answer Your Model Never Finished

· 10 min read
Tian Pan
Software Engineer

The post-mortem read like a hallucination report. A user had acted on a confidently-worded recommendation that turned out to be wrong in a way the model would not have written if it had finished — except the trace showed the model had not finished. The provider connection dropped at token 412 of an expected 800. The client's error handler logged the failure. The persisted partial message, written to the conversation history as tokens arrived, sat in the user's UI looking exactly like every other complete answer. They acted on it. Support categorized the ticket as a content-quality issue. It took two weeks to route it to the platform team.

Nothing in this chain was a model failure. The model behaved correctly for the 412 tokens it produced. The failure was that the streaming UI and the durable conversation history had quietly disagreed about what counts as a message — and during the exact failure mode that streaming was supposed to make tolerable, the disagreement became the canonical record.

This is the contract between optimistic rendering and durable storage. Most chat products inherit it from a tutorial or a framework without thinking about it as a contract at all, and the gap shows up as a tail of incidents that look like model bugs and aren't.

Streaming Is a UX Optimization Layered Over a Transactional Contract

The reason streaming exists in chat UIs is perceived latency. A model that takes eight seconds to produce eight hundred tokens feels unusable if you wait eight seconds for the whole response, and entirely usable if you see the first token in 300 milliseconds. The optimization is real and the user impact is real. What the optimization does not change is the underlying question of when a message exists.

In a non-streaming world that question is easy. The HTTP call returns a complete response, the client persists it, the UI renders it. Persistence and display are downstream of one event. There is one moment at which a message becomes a message, and the entire system agrees on it.

Streaming pulls those operations apart. Now there is a sequence of events — tokens arrive, the UI updates, the client may or may not persist as it goes, the stream may complete cleanly with a message_stop event that carries a stop_reason, or it may drop. Each of those operations needs a position on the question of message existence, and the temptation is to let each subsystem answer it independently. The renderer says "I have tokens, I will render." The persistence layer says "I have tokens, I will persist." The error handler says "I have an error, I will log." Nobody is the contract owner.

The contract that matters is: a message exists in the conversation history if and only if the model finished producing it. Anything else is an in-flight artifact. The streaming UI is allowed to display in-flight artifacts as a UX courtesy. The history is not allowed to record them as completed messages. Once you write the contract down it is obvious. The reason teams get it wrong is that nobody writes it down, and the framework defaults treat the in-flight buffer and the persisted history as the same store.

How the Optimistic Write Becomes the Canonical Record

The pattern that produces the incident usually starts with a reasonable feature: don't lose the response if the user refreshes the page mid-generation. The implementation that follows is almost always the same. Persist each chunk to local storage or to the server as it arrives. On reload, reconstruct the conversation from whatever was persisted. The user gets back to where they were, and the streaming felt instant.

The trap is in the failure path. The stream completes by sending a terminal event that includes a stop_reasonend_turn for a natural completion, max_tokens for budget exhaustion, tool_use for a tool call boundary, stop_sequence for an explicit stop. A stream that ends without that terminal event has not produced a message; it produced a fragment. But the persistence path that wrote chunks as they arrived has no notion of terminal events. It wrote what it received. The error handler logs the disconnect but does not roll back the writes. The next reload sees a record indistinguishable from a successful generation, and there is nothing in the data model to mark it otherwise.

The user-visible failure is the worst possible one. The partial response is rendered with the same styling as a complete one. The conversation continues from it. The user acts on it. If the disconnect cut the model off mid-recommendation — "the safe choice here is to" — the partial reads as authoritative for the first half of a sentence the model never wrote. There is no warning in the UI because the UI had no way to know.

This is not a hypothetical. Bug reports against major chat products and SDKs document the exact pattern, and the operational fix in nearly every case is to add a finalization handshake that nobody had thought to need.

The Draft Slot Pattern

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates