The Streaming Response That Contradicts Itself
The model says "the answer is yes" in the first sentence. By the third paragraph it has walked it back to "actually, on reflection, no — and here is why." The end-state is correct. The user already left. They read the first paragraph, took it as the answer, and acted on it before the model finished revising. Your eval scored the response correct. Your user got the wrong one.
This is the failure mode streaming UX hides. Token-by-token rendering treats every chunk as if it were committed truth, but the model has no notion of commit. There is no boundary between hedge and conclusion, no signal that says "the next two paragraphs are going to overturn what I just said." The interface is shipping partial state as final state, and the longer the response, the worse the gap gets.
Streaming Is a Read-Uncommitted UI
Databases have a name for this: dirty reads. A transaction is mid-flight, the values on disk are not yet committed, and a reader gets a view of state that may roll back. The industry decided long ago that surfacing dirty reads without telling the user is a bug — most databases default to read-committed, where you see values only after the transaction finalizes.
Streaming LLM output is read-uncommitted by default. The model is writing into the visible buffer as it generates, and the reader has no isolation boundary. Every token is presented with the same visual weight as the final answer, because at render time there is no signal of provisionality. The reasoning model that uses the next 800 tokens to revise its position is, from the UI's perspective, just typing more words.
The original justification for streaming holds: time-to-first-token can drop from ten seconds to a few hundred milliseconds, and the perceived-latency win is real. Under one second of TTFT keeps the user's flow of thought intact. Ten seconds is the outer edge of attention. So streaming wins the perception game in exchange for a hidden cost: the user begins reading — and committing decisions — before the model has finished thinking.
In short responses this is invisible. The first sentence is the only sentence. In any response long enough to contain a revision, the trade reverses. The model spends its later tokens fixing itself, and the user has already moved on.
Where the Contradictions Come From
This is not a streaming bug. The contradictions are baked into how modern reasoning models generate text. Streaming just makes them visible at the wrong moment.
The pattern shows up in several places. Restoration errors are well-documented in chain-of-thought research: a model makes a mistake in step two, realizes it in step four, silently corrects it, and presents the corrected answer as if the error never happened. The reasoning trace is internally inconsistent, even when the final answer is right. Implicit post-hoc rationalization is worse: the model decides on an answer through one path and then constructs a confident-sounding justification that does not match how it actually got there. Self-contradiction within a single response is well-attested in evaluation suites — models will assert and deny logically incompatible claims inside the same paragraph, especially when the question invites hedging.
Reasoning models, specifically, are designed to think out loud. Their training rewards exploration in the visible reasoning, including walking back wrong intermediate positions. That is exactly the behavior you want during private deliberation. It is exactly the behavior that breaks when the deliberation is streamed verbatim into a user-facing interface.
Long-context tasks make it worse. The longer the response, the more the model has to walk back, qualify, or reverse. Summaries that lead with a confident take and then add "however, the data also shows…" three paragraphs later are not a quirky failure mode — they are the modal output for any task that involves weighing evidence.
The Eval Misses It Because the Eval Sees the Whole Response
The standard eval pipeline reads the full response after generation completes and scores against a reference. End-state correctness is the metric. By that metric, a response that says "yes — actually no — here's why no" scores the same as a response that says "no, here's why."
- https://arxiv.org/pdf/2311.09603
- https://arxiv.org/pdf/2503.08679
- https://arxiv.org/pdf/2402.17097
- https://arxiv.org/pdf/2502.14565
- https://arxiv.org/pdf/2505.15323
- https://redis.io/blog/ttft-meaning/
- https://redis.io/blog/streaming-llm-responses/
- https://futureagi.com/glossary/time-to-first-token/
- https://arxiv.org/pdf/2509.07676
- https://arxiv.org/html/2604.01029
