Skip to main content

29 posts tagged with "streaming"

View all tags

The Streaming Abort Your Provider Billed Anyway: A 14% Gap Hiding in Your Invoice

· 10 min read
Tian Pan
Software Engineer

Your finance team filed a dispute and lost. The line item is "output tokens" and it exceeds your sum-of-delivered-tokens metric by fourteen percent. The provider's support engineer closed the ticket as "expected behavior under streaming cancellation," with a link to a documentation page that says "cancellation stops billing at the last delivered token." Both sentences are true, and the gap between them is the line of code you have not written.

The contract you read says one thing. The inference scheduler does another. The mismatch is not a bug, not a billing error, and not malice — it is a layered system in which the cancellation signal travels through three boundaries (browser, edge, GPU) and the billing meter sits at the third boundary while your "stop generating" button sits at the first. Closing the gap is an engineering project with a finance owner.

The Streaming UI That Committed a Partial Answer Your Model Never Finished

· 10 min read
Tian Pan
Software Engineer

The post-mortem read like a hallucination report. A user had acted on a confidently-worded recommendation that turned out to be wrong in a way the model would not have written if it had finished — except the trace showed the model had not finished. The provider connection dropped at token 412 of an expected 800. The client's error handler logged the failure. The persisted partial message, written to the conversation history as tokens arrived, sat in the user's UI looking exactly like every other complete answer. They acted on it. Support categorized the ticket as a content-quality issue. It took two weeks to route it to the platform team.

Nothing in this chain was a model failure. The model behaved correctly for the 412 tokens it produced. The failure was that the streaming UI and the durable conversation history had quietly disagreed about what counts as a message — and during the exact failure mode that streaming was supposed to make tolerable, the disagreement became the canonical record.

This is the contract between optimistic rendering and durable storage. Most chat products inherit it from a tutorial or a framework without thinking about it as a contract at all, and the gap shows up as a tail of incidents that look like model bugs and aren't.

The Interrupt UI That Taught Your Users to Never Interrupt the Agent

· 10 min read
Tian Pan
Software Engineer

The interrupt button on your streaming agent has a 0.4% click rate. The product team reads that number and concludes the feature is working as intended — most generations don't need to be interrupted, the implementation is fine, ship it, move on. The actual reading is that the interrupt button taught your users not to press it. Within a week of using the product, they figured out that pressing stop discards the partial response, clears the context, and dumps them back at an empty input box. The lesson they learned is to wait through a bad answer rather than risk losing the thread.

That 0.4% is not a usage signal. It is an aversion signal. Your users are not happy with the answers — they are afraid of the cost of trying to redirect them, and their adaptation is to sit quietly while the agent finishes saying something they already know is wrong. The engineering team treated "stop generation" as a model-call cancellation. The user treated it as "redirect, don't restart." The two definitions never met, and the product shipped a feature that quietly drained user agency from every long-running conversation.

The Streaming Abort That Left the Side Effect Billable

· 11 min read
Tian Pan
Software Engineer

A user is watching your agent stream a response. Two hundred milliseconds in, they hit stop. The UI clears the bubble, the spinner disappears, and the product behaves as if the request never happened. It did happen. The agent already called send_invoice_email. The vendor's mail relay returned 250 OK. The customer received a draft invoice the user never approved. Your billing meter charged the user for the tokens that streamed before the abort. It cannot bill back the email.

This is the failure mode every team with streaming tool use ships at least once, and most teams never even detect. The stream layer reports cancelled. The tool layer reports succeeded. Your customer-facing log picks one of them based on whichever subsystem flushes last, and the two halves of the same request now disagree about whether it occurred.

The Streaming Response Your Backend Infrastructure Was Not Built For

· 12 min read
Tian Pan
Software Engineer

Streaming was a product decision. Somebody on the design team watched a competitor's chat UI tick out tokens like a typewriter, watched a user's shoulders relax when the first character appeared two hundred milliseconds in instead of after a four-second blank stare, and the decision was made: we stream. The pull request changed three files in the API gateway. The model output now flushes incrementally over Server-Sent Events. The launch went out on a Tuesday and the satisfaction score moved up by a measurable amount on a Wednesday. Nobody opened a ticket against infrastructure.

A month later the on-call engineer is staring at three dashboards that no longer agree with each other. The autoscaler is provisioning twice as many pods as the CPU graphs say it should need. The p99 latency dashboard is broken — not malfunctioning, but uninterpretable, because the histogram buckets stop at five seconds and most spans now live in the overflow. The capacity model that priced the previous quarter's bill said the service could handle twelve hundred requests per second per node. The graph in front of the on-call says it is handling four hundred and falling over.

The Streaming Response That Contradicts Itself

· 8 min read
Tian Pan
Software Engineer

The model says "the answer is yes" in the first sentence. By the third paragraph it has walked it back to "actually, on reflection, no — and here is why." The end-state is correct. The user already left. They read the first paragraph, took it as the answer, and acted on it before the model finished revising. Your eval scored the response correct. Your user got the wrong one.

This is the failure mode streaming UX hides. Token-by-token rendering treats every chunk as if it were committed truth, but the model has no notion of commit. There is no boundary between hedge and conclusion, no signal that says "the next two paragraphs are going to overturn what I just said." The interface is shipping partial state as final state, and the longer the response, the worse the gap gets.

The Streaming Response That Committed Before the User Said Yes

· 12 min read
Tian Pan
Software Engineer

The user is reading the agent's reasoning as it streams in. Around token 1200, the model decides to call send_email, then create_ticket, then kick_off_deploy. The user, watching the partial output and realizing the agent has misread the request, hits the stop button half a second too late. The email is already sent. The ticket is already filed. The deploy is already running. The stop button cancelled the next token, not the consequences of the last one.

The bug is not in the cancel handler. The bug is the assumption — borrowed from every other streaming UI on the team's roadmap — that an incrementally rendered output is an incrementally reversible one. Tool calls do not honor that contract. They are point-in-time commits that the streaming layer happily fires while the rest of the response is still being generated, and the cancel button has no way to chase them down the wire.

This is one of those failure modes that nobody owns because it lives in the seam between two teams that each shipped their half cleanly. The UX team shipped streaming because it tested better in user studies. The platform team shipped tool calls because the framework supports them. Neither team had a meeting where someone asked: what is "stop" supposed to mean when the response has already left the building?

Streamed Tokens Are a Promise You Can't Take Back

· 9 min read
Tian Pan
Software Engineer

The model has streamed seventy percent of a confident-sounding answer to the user's screen. Then the tool call it was about to make returns an error, or no rows, or a 429. You now get to pick between two losses: let the model finish gracefully by inventing the rest, or stop mid-sentence with no clean way to walk it back. Neither is a recovery — both are damage.

This is the part of streaming UX that nobody priced when they turned the feature on. Streaming was framed as a perceived-latency win: time-to-first-token is the metric, the user starts reading sooner, the app feels alive. What the framing leaves out is that every token you stream is a commitment. You have published a draft of an answer that you do not yet know is correct, and the back half of your system has not yet finished running. When it finishes and disagrees, your UI has no native way to retract what it already showed.

The Streaming Rollback Problem: You Can't Un-Say a Token

· 10 min read
Tian Pan
Software Engineer

Watch someone use a chat product for the first time and you'll notice they start reading before the model finishes. That reading-as-it-appears behavior is the entire reason streaming exists: it turns a multi-second wait into something that feels like a conversation. It is also the reason your output guardrails are quietly broken.

Here is the uncomfortable sequence. The model generates token 1, token 2, token 150. Each one is rendered the instant it arrives. At token 200, the model produces a hallucinated dosage, a leaked email address, or a sentence that violates your content policy. Your output-side guardrail fires correctly and immediately. But "immediately" is too late — the user has already read 200 tokens. You cannot un-render them. The guardrail did its job, and the violation still reached a human being.

The Streaming Token the User Acted On Too Soon

· 9 min read
Tian Pan
Software Engineer

A user asked your assistant whether a config change was safe to ship. The model streamed back: "Yes, you can deploy this safely." Three hundred milliseconds later it continued: "— except in the us-east region, where the old connection pool is still draining." But the user had already read the first half, felt the relief of a green light, and clicked deploy. The qualification arrived to an empty room.

Nobody made a mistake here. The model was correct. The user read what was on screen. The renderer faithfully displayed every token the moment it arrived. And yet the outcome was a bad deploy, because streaming turned the model's intermediate state into something the user treated as final.

The Streaming Response That Returns 200 Then Fails: How Mid-Stream Errors Break Your SLOs

· 10 min read
Tian Pan
Software Engineer

Your availability dashboard says 99.95%. Your users say the answer stopped mid-sentence. Both are correct, and that is the problem.

The HTTP-era reliability stack was built on a single assumption: the status code arrives at the end of a request and summarizes its fate. A 200 means success. A 5xx means retry. The load balancer counts the ratio, the SLO dashboard aggregates it, the alerting fires on the burn rate. Every layer of that stack reads the header and trusts it.

Streaming inverts the assumption. The moment your server flushes the first token, it has already committed to a 200. Everything that goes wrong after that — a provider timeout at token 400, a content filter trip mid-paragraph, a dropped TCP connection, a malformed tool-call fragment — happens after the verdict has been rendered and cannot be retracted. The request failed. The status code says it succeeded. And nothing in your reliability tooling is built to notice the difference.

The AI Feature With Two Latencies: You Measure One, Your Users Feel the Other

· 9 min read
Tian Pan
Software Engineer

A traditional HTTP request has one latency that matters: the time from request to response. The p95 of that number is the contract. SRE watches it, the SLO is written against it, and when it regresses someone gets paged. One number, one dashboard, one truth.

A streaming AI feature broke that model the moment the response became a stream, and most teams haven't noticed. There are now two latencies, and they diverge. Time-to-first-token is how long the user stares at a spinner before anything happens. Time-to-completion is how long until the answer is fully written. They are shaped by different forces, fixed by different levers, and felt by the user at completely different emotional weights — and almost every team instruments only the second one, because that's the number the HTTP framework hands them for free.