The Streaming Abort Your Provider Billed Anyway: A 14% Gap Hiding in Your Invoice
Your finance team filed a dispute and lost. The line item is "output tokens" and it exceeds your sum-of-delivered-tokens metric by fourteen percent. The provider's support engineer closed the ticket as "expected behavior under streaming cancellation," with a link to a documentation page that says "cancellation stops billing at the last delivered token." Both sentences are true, and the gap between them is the line of code you have not written.
The contract you read says one thing. The inference scheduler does another. The mismatch is not a bug, not a billing error, and not malice — it is a layered system in which the cancellation signal travels through three boundaries (browser, edge, GPU) and the billing meter sits at the third boundary while your "stop generating" button sits at the first. Closing the gap is an engineering project with a finance owner.
This post is about the specific shape of that gap, why the documented contract and the implemented behavior diverge, and the four patterns — server-side cancellation, per-request token caps, reconciliation dashboards, and contract addenda — that move the dispute from "expected behavior" to a refund or a refactor.
The three layers your abort signal travels through
AbortController.abort() does one thing in the browser: it triggers a DOMException on the fetch promise and unsubscribes any reader from the body stream. That is the entire client-side contract. Whether anything downstream notices is a property of every intermediate layer, none of which are obligated to forward the signal.
Layer one is the TCP connection between the browser and your edge. When abort() fires, the browser closes its socket. Depending on the SSE proxying configuration, the edge may notice quickly (TCP RST or FIN propagates to the upstream socket) or slowly (the proxy holds the upstream connection open and discovers the client disconnect only when it tries to write the next chunk). Most reverse proxies fall in the second category by default. Cloudflare, for example, only forwards a client disconnect once it attempts a write to the closed socket — meaning the upstream connection survives until the next SSE event, which on a slow stream could be hundreds of milliseconds later.
Layer two is the connection from your edge to the provider. Assuming the edge eventually closes that socket, the provider's API gateway receives a TCP FIN. The gateway is now in the same position your edge was a moment ago: it must decide what to do with the upstream inference request. In the open-source case (vLLM, TGI, TensorRT-LLM), the answer is "it depends on whether the request middleware is wired to call abort(request_id) on disconnect." vLLM has open issues from late 2025 documenting cases where request.is_disconnected() returns False even after the client has hung up, because user-installed middleware breaks Starlette's disconnect propagation. The cancellation never reaches the engine.
Layer three is the GPU. Even if the abort signal arrives, the decode loop is running on a pod that processes requests in continuous batches. The scheduler picks up new requests and drops finished ones at iteration boundaries — typically every few tens of milliseconds, but the boundary is not synchronous with your cancellation signal. Worse: under high batch occupancy, the scheduler has no incentive to evict an in-flight request immediately, because doing so would waste the KV cache it has already allocated. Some serving stacks honor cancellation within one iteration; others delay it until the next prefill window; a few simply run the request to its max_tokens ceiling and bill the lot.
The provider's documentation describes the contract at one of these layers. The bill reflects what happened at another.
The numbers behind a 14% gap
The 14% figure is not invented. A team running a chat assistant at moderate scale instrumented their client-side rendered-token counter, set it against the usage block from the provider, and discovered that the per-month divergence ran between 9% and 18% with a median around 14%. The breakdown was instructive:
- About 4 points came from straightforward abandonment: user closes the tab, agent dies, mobile backgrounds. The connection closes without an explicit abort and the provider runs the request to completion.
- About 6 points came from explicit "stop generating" clicks. The abort fired, the connection closed, and the provider billed an average of 180 tokens of post-abort generation per cancelled request. At the batch-boundary scheduler running at ~50ms iterations, that is consistent with cancellation propagating after roughly three to four scheduling steps.
- About 4 points came from a long tail of edge cases: SSE events buffered in the provider's response queue that the client never read but the meter still counted, retries on a transient error where the failed attempt's tokens were billed, and a small population of requests where the scheduler appeared to ignore cancellation entirely and ran to
max_tokens.
In each category the provider's behavior matched some interpretation of the documentation. The 14% is not a single bug; it is the sum of every place where the contract was ambiguous and the implementation chose the option that favored the meter.
The team's finance dispute failed not because the numbers were wrong but because no single category exceeded the threshold the support team treated as a billing error. "Expected behavior" is a phrase that survives any individual cancellation event examined in isolation. The aggregated cost only becomes legible if you are the one aggregating it.
Why "stops billing at the last delivered token" is a half-true contract
Provider documentation tends to phrase cancellation as if it were a single instantaneous event. "Cancellation stops billing at the last delivered token." "Closing the connection terminates generation." The phrasing borrows from a synchronous mental model — request-in, response-out — that does not match how streaming inference is actually scheduled.
What the documentation usually means is: at the moment the inference engine observes the cancellation, it stops emitting further tokens and the billing meter freezes there. What it does not say is how long it takes the cancellation to propagate from the layer where you signalled it to the layer where the engine observes it. The phrase "stops billing at the last delivered token" reads as a guarantee about your invoice; it is actually a guarantee about the engine's internal accounting after the cancellation arrives. The propagation latency is the entire problem and the documentation does not name it.
This is the same shape as the durability-vs-availability gap in distributed storage: a system that claims to "stop accepting writes when the replica fails" is making a guarantee about the moment after failure is detected, not about the moment failure occurred. The interesting number — the window during which the guarantee did not hold — is the one nobody writes down.
The fix is to make the propagation latency a contractual term, not a hidden implementation detail. A useful addendum reads: "Cancellation stops billing within N milliseconds of the client signal, measured at the API boundary. Tokens generated after that window are not billed and appear on the invoice as a reconciliation credit." That sentence forces the provider to commit to a number, and gives the customer a metric to dispute against.
Four patterns that close the gap
The team that wires the stop button to a TCP close has wired user intent to a metric the provider does not commit to. Four patterns, applied together, move the contract from implied to enforced.
Use the explicit cancellation API where it exists. OpenAI exposes responses.cancel for background responses, vLLM exposes abort(request_id), OpenRouter forwards the abort for supported upstream providers. Synchronous streaming responses on OpenAI's main API have no cancellation endpoint at all, and Anthropic's Messages API offers no server-side cancel primitive — meaning for those paths, TCP close is the only signal available, and the propagation latency is whatever the provider's gateway and scheduler decide. Where a real cancellation endpoint exists, call it from your server, not from the browser; relying on the client's TCP teardown to traverse three layers is the source of most of the gap.
Cap max_tokens per request. This is the cheapest pattern and the most often skipped. If the cancellation does not propagate and the request runs to its ceiling, you want the ceiling to be low. A chat surface where typical responses are 400 tokens and max_tokens is left at the model default of 4096 is paying for a ten-times overrun on every cancellation that fails to land. Set max_tokens to a small multiple of the 95th percentile of your delivered response length; treat the gap between actual and capped as your worst-case cancellation tax.
Build a finance dashboard that subtracts delivered from billed. Your observability platform almost certainly reports the usage block. That tells you what the provider billed. It does not tell you what the user received. Add a second metric — tokens actually rendered on the client, or actually consumed by the downstream pipeline — and emit it as a counter. The ratio between the two is the metric your finance team needs: it isolates the cancellation gap from raw traffic growth, makes it visible per-feature, and gives you a number to take to the provider when the gap exceeds an agreed threshold.
Negotiate the addendum before you need it. A finance dispute filed after the invoice has shipped argues from weakness; the same dispute filed as a contract amendment before the contract renews argues from leverage. The addendum should name the propagation latency budget explicitly (e.g. "200ms from API boundary to billing stop"), require the provider to certify it quarterly, and define overruns as automatic credits rather than negotiated refunds. Providers will resist naming the number, which itself tells you what the number is — the resistance is data.
What the architectural realization actually buys you
The architectural realization underneath all of this is that "stop generating" is a request the provider may honor at its own schedule. The button in your UI looks like a switch; it is actually a packet that travels through three layers, any of which may delay or drop it, and the meter sits at the layer furthest from you.
This is unromantic but it has a useful consequence: the cancellation gap is a known, bounded, measurable thing, not a mystery. It is not a function of model temperament or scheduler magic. It is the integral of (token-generation-rate × propagation-latency) across cancelled requests. You can instrument it, you can cap it with max_tokens, you can negotiate the latency budget, and you can route around it for providers that offer explicit cancellation endpoints. The team that treats it as a fixed cost of doing business is leaving 5–15% of inference spend on the table. The team that treats it as a soluble engineering problem with a finance owner gets that number back.
The deeper lesson is one that recurs across LLM serving: the contracts you read are written at a layer you do not touch, the bills you pay are computed at the same layer, and the user intent you care about lives at a layer four hops away. Closing the loop between user intent and billing is not a single feature; it is the discipline of treating every hop as a contract surface, with its own latency budget, its own reconciliation metric, and its own dispute path. The stop button is one of those surfaces. There will be others.
- https://community.openai.com/t/api-billing-for-streaming-if-i-close-connection-midway/624323
- https://community.openai.com/t/if-we-stop-streaming-output-stream-before-it-finishes-do-we-still-get-billed-for-the-tokens-that-werent-ouputted/859904
- https://community.openai.com/t/streaming-interruption-billing-clarification-needed/928978
- https://community.openai.com/t/cancel-the-openai-api-request-without-deducting-the-cost-from-the-balance/719556
- https://github.com/openai/openai-agents-js/issues/995
- https://github.com/vllm-project/vllm/issues/20798
- https://github.com/vllm-project/vllm/issues/10806
- https://github.com/vllm-project/vllm/issues/10087
- https://github.com/BerriAI/litellm/issues/17364
- https://github.com/crmne/ruby_llm/issues/607
- https://docs.vllm.ai/en/v0.4.3/dev/engine/async_llm_engine.html
- https://blog.vllm.ai/2025/09/05/anatomy-of-vllm.html
- https://openrouter.ai/docs/api/reference/streaming
- https://platform.claude.com/docs/en/build-with-claude/streaming
- https://www.baseten.co/blog/continuous-vs-dynamic-batching-for-ai-inference/
- https://www.anyscale.com/blog/continuous-batching-llm-inference
