The Streaming Response That Contradicts Itself
The model says "the answer is yes" in the first sentence. By the third paragraph it has walked it back to "actually, on reflection, no — and here is why." The end-state is correct. The user already left. They read the first paragraph, took it as the answer, and acted on it before the model finished revising. Your eval scored the response correct. Your user got the wrong one.
This is the failure mode streaming UX hides. Token-by-token rendering treats every chunk as if it were committed truth, but the model has no notion of commit. There is no boundary between hedge and conclusion, no signal that says "the next two paragraphs are going to overturn what I just said." The interface is shipping partial state as final state, and the longer the response, the worse the gap gets.
