Skip to main content

One post tagged with "autoscaling"

View all tags

The autoscaler that scaled to zero mid-decode: when inference is treated like stateless web traffic

· 12 min read
Tian Pan
Software Engineer

The cluster did exactly what we told it to. Traffic dropped to zero for forty-five seconds, the queue-depth metric flatlined, KEDA flipped the replica count from one to zero, and the node autoscaler reclaimed the H100 pod ninety seconds later. The graph looked clean. The Slack channel was quiet. The cost dashboard ticked down half a cent.

An hour and twelve minutes later, a customer support ticket arrived: a long-running document-analysis job — a 180k-token reasoning task that was budgeted for twenty-eight minutes of decode — had vanished. No error in their client SDK. No exception in our application logs. Only a single 499 line buried in the gateway access log, timestamped roughly when the scheduler had decided the pod was idle and reaped it.