Skip to main content

Hyrum's Law for Streamed Reasoning: Pacing, Pauses, and Intermediate Tokens Are an Undocumented Contract

· 11 min read
Tian Pan
Software Engineer

A team upgrades from a frontier model to its faster successor. The eval suite is green. Final answers match. Tool-call schemas are identical. The structured outputs validate against the same JSON schema they always did. They ship. Within a day, support tickets pile up: "the assistant feels rushed," "it's not really thinking anymore," "something is off." The product manager pulls telemetry and finds task-completion rates unchanged. The engineering team double-checks the eval and the schema and finds nothing wrong. The complaint is real, but the contract — as the team defined it — is intact.

What changed is the texture of the stream. The old model paused for 800 milliseconds before calling a tool, emitted a "Let me check that..." preamble, and dribbled tokens at roughly 35 per second with natural-feeling clusters around clause boundaries. The new model emits tokens at 90 per second, never pauses, and skips the preamble entirely. None of that was in any documented contract. All of it was load-bearing.

This is Hyrum's law, and streaming makes its surface area enormous. Any observable behavior of your system will be depended on by somebody — and a streaming AI surface exposes far more observable behavior than the team realizes.

The hidden surface that streaming exposes

Hyrum's law was formulated for traditional APIs, where the observable behavior includes things like response field ordering, error message strings, the precise format of timestamps, and the rate-limiting behavior nobody documented. Teams that maintain widely-used libraries learn this the hard way: a "harmless" change to an undocumented behavior breaks somebody's production.

Streaming AI systems multiply that surface. The protocol-level contract — the SSE event types, the JSON schema of the final structured output, the tool-call signatures — is the part teams version and review. But streaming a response token-by-token exposes layer after layer of behavior the team never wrote down:

  • Token emission rate. The user sees a typewriter effect, and the rate at which characters appear shapes their perception of "thinking." Research shows users perceive streaming interfaces as roughly 40% faster than buffered responses even when total latency is identical, and the pacing is part of why.
  • Pause patterns. A 600-millisecond pause before the model calls a tool reads as deliberation. A 50-millisecond pause reads as instant lookup. The presence, absence, and length of these pauses become felt-meaning that users calibrate against.
  • Order of field materialization in structured output. When a model streams a JSON object, the order in which fields land matters to any consumer that starts rendering before the full payload arrives. A frontend that displays summary first and then details will look different if the model now emits details first.
  • Visible "thinking" text. Models that expose chain-of-thought reasoning emit an intermediate stream — sometimes labeled, sometimes implicit — that users read as the model's process. The phrasing, the hedging, the "wait, let me reconsider" turns are all observable.
  • Interim status messages. "Searching for relevant documents..." "Compiling results..." These are user-facing strings that downstream UIs match on, animate around, or simply trust to mean what they used to mean.

Each of these is a behavior the team would have called "implementation detail" if anyone had asked. None of them are detail to a downstream consumer who has come to depend on them.

How the dependency forms

The team that ships a streaming AI feature rarely sets out to depend on these behaviors. The dependency forms in two ways, and both are quiet.

The first is on the user side, and it forms through perception rather than code. A user who interacts with the product daily develops an intuition for what "the assistant is thinking carefully" looks like — and that intuition is anchored to specific observable signals. When the streaming feel changes, the intuition is wrong. The user doesn't know what their intuition was anchored to; they only know something feels off. They report it as "quality regressed" because that is the only vocabulary they have.

The second is on the consumer side, and it forms through code. A frontend animation team times a "loading dots" transition to the gap between the user's submit and the first non-whitespace token. A developer-tools team writes a regex against the intermediate "thinking" stream to extract intermediate tool-call arguments for a debugging panel. A logging pipeline parses status messages to label spans in a trace. An A/B-test analysis tool measures TTFT (time to first token) as a proxy for "model engagement," and a dashboard on someone's wall shows it as a KPI.

None of these consumers asked the AI team's permission. Most of them never told the AI team they existed. They just observed a behavior and depended on it — exactly as Hyrum's law predicts.

The model swap as a breaking change

When the AI team swaps the underlying model, the contract they think they're preserving is the structured output schema and the tool-call signatures. The contract that actually exists includes everything above. A model swap that preserves the documented surface and changes the undocumented one is, in Hyrum's-law terms, a breaking change — even though the team would call it a non-breaking change in any change-management framework.

The interesting part is that the breakage is rarely loud. The eval suite passes because the eval suite scored final outputs, not pacing. The integration tests pass because the integration tests parsed the final structured output, not the stream-as-it-arrives. The error-rate dashboards stay flat because nothing crashes. What surfaces is a slow trickle of qualitative complaints from users and one or two oddly-broken downstream features whose owners can't immediately tell why their thing stopped working.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates