Skip to main content

The Cancellation Tax: Your Inference Bill After the User Hits Stop

· 9 min read
Tian Pan
Software Engineer

Your stop button is a lie. When a user clicks it, your UI stops rendering tokens; your provider, in most configurations, keeps generating them. The bytes never reach a browser, but they reach your invoice. The gap between what the user saw and what you paid for is the cancellation tax, and it is the single most under-reported line item on AI cost dashboards.

The reason the tax exists is structural. Autoregressive inference is a GPU-bound pipeline: by the time your client closes the TCP connection, the model has already been scheduled, KV-cached, and is emitting tokens at 30–200 per second. Most serving stacks do not check for client liveness between tokens. They finish the job, log the usage, and bill you. The client saw ten tokens; the log recorded eight hundred. Langfuse, Datadog, and every other observability platform will faithfully report the eight hundred, because that's what the provider's usage block reported.

Why stopping the stream doesn't stop the generation

There are three places where a cancellation could happen: in the client, in the proxy, and in the inference engine. Only the third one actually saves money. The first two are theater.

In client-side JavaScript, an AbortController attached to a fetch call will unsubscribe from the SSE event stream, but it does not guarantee a TCP FIN reaches the provider. It certainly does not signal the inference engine to stop the decode loop. Even if the FIN does propagate, the engine has no contract with the HTTP layer about what to do with an in-flight request. vLLM, the most common open-source serving stack, exposes an abort(request_id) method, but whether your gateway actually calls it on client disconnect is a configuration question most teams have never asked. LiteLLM has an open feature request from late 2025 because, by default, "the LLM provider's response is completely tracked despite the fact that I break the iterator and close the stream."

Anthropic and OpenAI handle this inconsistently. OpenAI's Responses API, as of its current shape, only supports responses.cancel for background responses — synchronous streaming responses have no cancel endpoint at all; you disconnect the socket and hope. The official GitHub issue on this is explicit: "if a user cancels a run, we do not want to pay for extra compute on the next ResponseCompletedEvent," followed by an engineer asking whether this already happens automatically. It does not. Anthropic's SSE messages lack timeouts, heartbeats, or a server-side cancel primitive; the community is still asking for streaming idle timeouts, never mind cancellation semantics.

The practical result: when you paste abortController.abort() into your frontend, you can measure its effect in two ways. Tail latency on the client drops to zero immediately. Your provider bill, measured two weeks later, does not move.

The silent multiplication in your cost dashboard

Most AI cost dashboards are built on the usage object in the provider's response. That object reports total input and output tokens — including the ones generated after the client gave up. Your dashboard's "cost per conversation" line is therefore the cost of the tokens the provider produced, not the cost of the tokens the user read. In a low-abandonment surface (batch summarization, async agents), these are close. In high-abandonment surfaces, they diverge by multiples.

Consider the features where abandonment is structural:

  • Chat with editable input. A user types, hits send, reads the first three sentences, realizes they phrased the question badly, edits, and resends. The first generation ran to completion in the background.
  • Voice interfaces with barge-in. The LLM begins streaming a response. The user interrupts. Your TTS stops, your STT restarts, and a new LLM call kicks off. The interrupted one is still decoding.
  • Agent UIs with a prominent cancel button. The agent starts a plan, the user reads it, kills it, refines the task. If your agent harness doesn't propagate cancellation to the tool-call inference request, the model finished the plan for no one.
  • Search-style UIs with autocomplete-triggered generations. Every keystroke past a threshold fires a request. Most of those requests are superseded before the user ever sees their output.
  • Mobile backgrounding. A user on a flaky 4G connection swipes to another app. The TCP connection drops. The inference keeps running until the model hits the stop token or the max-tokens cap.

The ratio of paid tokens to delivered tokens is the number that tells you how bad this is. Call it the PTDR. If it's 1.0, every token the user was willing to be billed for showed up on their screen. If it's 1.3, you are paying for a third more inference than the user ever experienced. In consumer chat surfaces we've seen the ratio drift past 1.5 without anyone noticing because nobody was calculating it. The cost-per-active-user metric silently inflates; FinOps asks why the LLM line grew faster than traffic; engineering blames the model swap and nobody suspects the stop button.

Abort-aware accounting: the contract your pipeline needs

Fixing this requires treating cancellation as a first-class event in your observability, not an edge case in your frontend. Three pieces have to line up.

First, client-side accounting that reconciles with provider-side usage. When the user aborts, log how many tokens your client actually rendered — the "delivered" count. When the provider's final usage event arrives (it usually does, even on aborted streams, if you keep the socket long enough to read it), log that as "paid." Store both. Your dashboards should have three metrics, not one: tokens_delivered, tokens_paid, and the ratio.

Second, server-side cancellation where your stack supports it. If you self-host on vLLM or SGLang, wire your gateway's on_disconnect handler to the engine's abort(request_id). This is free money and most teams haven't done it because it requires knowing the request ID at the HTTP layer, which requires threading the ID through your proxy. If you use hosted APIs, use background mode for anything the user might cancel — OpenAI's Responses API lets you cancel background responses by ID, something the streaming endpoint does not. The latency trade-off is real but smaller than people assume, especially if you also maintain the ability to stream.

Third, per-route abandonment-rate dashboards. Abandonment is not a model problem. It is a product surface problem. The chat-with-editable-input surface and the agent-with-cancel-button surface have different abandonment profiles, and they warrant different mitigations. Tag every inference request with its originating route, and track abandonment rate and PTDR per route. When a PM proposes a feature whose interaction pattern invites abandonment, you'll have the data to price the feature correctly.

Product design is where the tax gets paid

Engineering can stanch the bleeding, but the size of the wound is a product decision. The features with the highest cancellation tax are also some of the most important features in a modern AI product. You won't make them go away. You can make them cheaper.

Delay the expensive part of the inference. The first hundred tokens of most responses are cheap. The last two thousand are expensive (output tokens are typically 3–5× the price of input tokens at the flagship tier). If your product lets users preview and commit, bias the commit gate to fire before the expensive generation starts, not after it finishes. A "continue" button that explicitly asks the user to authorize further output is worth ten UX research sessions on why people hit stop.

Cap max-output tokens aggressively on interactive surfaces. If your average response is 400 tokens, setting max_tokens=4096 is an invitation to pay for a 3600-token runaway when the user bails at token 50. Set it to the 90th-percentile response length, not the theoretical upper bound. The tax on an abandoned runaway is disproportionate.

Prefer non-streaming for short completions. The mental model where "streaming is always better" is outdated. For completions shorter than ~200 tokens, the user experience difference between streamed and batch is negligible. The cost difference under abandonment is significant: a batch request can be cancelled pre-execution via a queue if you see staleness signals (user typed a new prompt, user left the screen) before the request is dequeued. Once streaming has started, those signals come too late.

Treat speculative generation as what it is — speculation with a budget. Products that pre-generate responses on partial input (voice agents listening to a partial transcript, agents running exploratory tool calls) are essentially running a draft-model pattern without the accounting. Most of those generations will be discarded. Track a budget for speculative spend per session, and cap it. The discipline of treating pre-generation as a separate cost bucket is the same discipline that keeps speculative decoding from blowing up serving budgets at the infrastructure layer.

The metric that belongs on every AI cost dashboard

The single number worth instrumenting this quarter is the paid-to-delivered token ratio, bucketed by route. Under 1.1, you have a disciplined product. Between 1.1 and 1.3, you have normal levels of abandonment and your mitigation budget should be moderate. Above 1.3, you are funding workflows nobody completes, and the fix is a mix of engineering plumbing (server-side abort propagation, background-mode cancellation) and product redesign (max-token caps, commit gates, non-streaming fallbacks).

The reason this metric isn't already standard is that the industry defaulted to treating the provider's usage block as the ground truth. It isn't. It's the ground truth of what you paid for. The ground truth of what you delivered lives in your client, in the number of tokens that actually got rendered before the user moved on. Those two numbers should be logged side by side. The distance between them is your cancellation tax, and it is a tax you can lower — not by renegotiating with your provider, but by noticing it exists.

References:Let's stay in touch and Follow me for more thoughts and updates