The autoscaler that scaled to zero mid-decode: when inference is treated like stateless web traffic

June 2, 2026 · 12 min read

Software Engineer

The cluster did exactly what we told it to. Traffic dropped to zero for forty-five seconds, the queue-depth metric flatlined, KEDA flipped the replica count from one to zero, and the node autoscaler reclaimed the H100 pod ninety seconds later. The graph looked clean. The Slack channel was quiet. The cost dashboard ticked down half a cent.

An hour and twelve minutes later, a customer support ticket arrived: a long-running document-analysis job — a 180k-token reasoning task that was budgeted for twenty-eight minutes of decode — had vanished. No error in their client SDK. No exception in our application logs. Only a single 499 line buried in the gateway access log, timestamped roughly when the scheduler had decided the pod was idle and reaped it.

That 499 was the whole story compressed into three digits. NGINX (and most reverse proxies that inherited its convention) writes 499 when the client closes the connection before the upstream finishes responding. From the gateway's perspective, the client gave up. From the customer's perspective, the gateway gave up. From the autoscaler's perspective, nothing was happening on that pod at all — the request had been streamed in seconds ago, no new connections were arriving, and the cost-driven scale-down policy did exactly the job it was hired to do.

The postmortem reframed the incident from "an autoscaler bug" to "we encoded the wrong definition of busy." Everything downstream — graceful shutdown windows, queue metrics, PDBs, the cost model — turned out to be wrong in a coordinated way, because the team had imported assumptions from stateless web serving into a workload whose unit of work is a thirty-minute streaming decode.

"Request rate" and "queue depth" are lies on streaming endpoints

The first thing you discover when you instrument an inference pod honestly is that requests/sec is meaningless. A pod that accepted one long-context request twenty-five minutes ago and is currently producing token number 4,812 of an expected 8,000 has a request-rate of zero and a queue depth of zero. The HPA scrapes the metric, sees no demand, and starts the wind-down clock.

The Google Kubernetes Engine inference docs and the SGLang observability guide both say the quiet part out loud: raw GPU utilization percent is a poor primary signal for autoscaling LLM workloads, and so is incoming request count. What you actually want to track is the in-flight state of the engine: how many sequences are currently being decoded, how full the KV cache is, how many requests are queued behind the batch. SGLang exposes new_token_ratio, eviction_duration_seconds, and load_back_duration_seconds precisely so you can tell the difference between "this pod is idle" and "this pod is busy making 8,000 tokens for a single customer." vLLM exposes similar gauges through its Prometheus endpoint.

The mistake is upstream of which metric you scrape. The mistake is using a single scalar — utilization, queue depth, RPS — to summarize a workload whose work-units have unbounded duration. The autoscaler controller is a closed-loop system that needs to know when it is allowed to act, and "no new requests" is not the same as "no work in progress" for any engine that does streaming decode or continuous batching.

A useful operational rule: scale-down decisions should be gated on a "no in-flight sequences AND queue is empty AND has been so for N seconds" predicate, not just a smoothed average of RPS. The cost team will push back because the scalar metric is easier to chart, but the chart hides the failure mode entirely.

`terminationGracePeriodSeconds` is the SLA you didn't realize you signed

When the scheduler decides to evict a pod, it sends SIGTERM and starts a timer. When the timer expires, it sends SIGKILL. The timer is terminationGracePeriodSeconds, and its default is thirty seconds. For an HTTP API serving sub-second requests, thirty seconds is luxurious. For a 180k-token reasoning job, thirty seconds is a rounding error against twenty-eight minutes of decode.

You have three options, and none of them are free:

Set the grace period to the longest tolerable request duration. If your P99 request decodes in twenty minutes, set the grace period to twenty-five minutes. This is the simplest fix and the one most teams end up with. The trade-off is that node drains, OS patches, rolling deploys, and cluster autoscaler reclaims all become twenty-five-minute operations per pod. Nobody on the platform team will be happy. Kubernetes documents a hard ceiling for PodDisruptionBudget-respected grace periods during node upgrades — Google Kubernetes Engine, for example, only honors them for up to an hour during automatic node upgrades — so this approach has an upper bound.

Make the engine cancel in-flight work on SIGTERM and propagate a structured error. This is honest: the client knows the request was killed, the pod terminates promptly, the scheduler stays unblocked. The trade-off is that customer-facing failures now happen on every deploy, and you have to design retries and idempotency around them. Worth noting: there is an outstanding vLLM issue (#24584) where the runtime fails to honor HTTP context cancellation during streaming — the pod keeps generating tokens for three or more minutes after the client disconnects. If your engine has this class of bug, "cancel on SIGTERM" is harder than it sounds.

Make long jobs resumable and treat eviction as an expected event. This is the only option that actually composes with cost-driven autoscaling, but it requires the work to be checkpointable. For document analysis, batch summarization, or any pipeline where the unit of work can be redriven from a work queue, this is the right answer: the worker writes intermediate state to durable storage every N tokens, the queue marks the work as in-progress with a visibility timeout, and a replacement pod picks up from the checkpoint after preemption. Batch-inference platforms have been doing this for years against spot instances — work queues track completed versus pending items, and interruption simply returns unfinished work to the queue.

The first option is a tax. The second is a contract change with your customers. The third is an architectural commitment that requires rewriting the worker. Picking one is an organizational decision dressed up as a configuration knob.

PodDisruptionBudgets only cover the disruptions the cluster initiates

A common reflex after this kind of incident is "we'll add a PDB." A PDB tells the eviction API and the node drain controller to respect a minimum availability when they initiate voluntary disruptions. It is a real and useful guardrail for rolling upgrades, node maintenance, and cluster autoscaler reclaims.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

The autoscaler that scaled to zero mid-decode: when inference is treated like stateless web traffic

"Request rate" and "queue depth" are lies on streaming endpoints

`terminationGracePeriodSeconds` is the SLA you didn't realize you signed

PodDisruptionBudgets only cover the disruptions the cluster initiates

Recommended Reading

About Tian Pan

"Request rate" and "queue depth" are lies on streaming endpoints​

terminationGracePeriodSeconds is the SLA you didn't realize you signed​

PodDisruptionBudgets only cover the disruptions the cluster initiates​

Recommended Reading

About Tian Pan

"Request rate" and "queue depth" are lies on streaming endpoints

`terminationGracePeriodSeconds` is the SLA you didn't realize you signed

PodDisruptionBudgets only cover the disruptions the cluster initiates