4 posts tagged with "kubernetes"

The autoscaler that scaled to zero mid-decode: when inference is treated like stateless web traffic

June 2, 2026 · 12 min read

Software Engineer

The cluster did exactly what we told it to. Traffic dropped to zero for forty-five seconds, the queue-depth metric flatlined, KEDA flipped the replica count from one to zero, and the node autoscaler reclaimed the H100 pod ninety seconds later. The graph looked clean. The Slack channel was quiet. The cost dashboard ticked down half a cent.

An hour and twelve minutes later, a customer support ticket arrived: a long-running document-analysis job — a 180k-token reasoning task that was budgeted for twenty-eight minutes of decode — had vanished. No error in their client SDK. No exception in our application logs. Only a single 499 line buried in the gateway access log, timestamped roughly when the scheduler had decided the pod was idle and reaped it.

The GPU Reservation Your Batch Workload Starved Your Real-Time Path On

June 2, 2026 · 9 min read

Tian Pan

Software Engineer

The nightly fine-tune job starts at 02:00 UTC. It walks into the shared GPU pool, takes every slot it can find, and holds them. By 09:30, when the first inference traffic of the business day arrives, the autoscaler tries to claim capacity that has been continuously occupied for seven and a half hours. The first ninety minutes of the morning run at roughly four times the baseline p99 latency. The dashboard reports a "noisy morning tail" that the inference team attributes to user behavior, because the actual contention lives in a job queue nobody on the inference team owns.

This is the GPU-sharing failure mode that the cost-attribution slide in your capacity review does not capture. The sharing was sold as a utilization win — train at night, serve in the day, fill the trough. What actually shipped was a latency tail you cannot escape until the pool is partitioned by latency class, not by team or by clock.

Why AI-Generated Terraform and Kubernetes Configs Are Silently Wrong

May 6, 2026 · 11 min read

Tian Pan

Software Engineer

Most platform engineers have a version of the same story: they asked an AI assistant to scaffold a Terraform module or a Kubernetes deployment manifest, it came back looking completely reasonable, the CI pipeline went green, and weeks later something bad happened. An IAM role with wildcard permissions. An S3 bucket that wasn't supposed to be public. A Kubernetes pod running as root because nobody checked the security context.

The core problem isn't that LLMs write bad syntax — they rarely do. The problem is that IaC correctness has almost nothing to do with syntax. A Terraform file that terraform validate accepts can still deploy a security disaster. A Kubernetes manifest that kubectl apply --dry-run=client accepts can still schedule pods with dangerous capabilities. The tools your CI pipeline uses to check the code are mostly checking the wrong things.

GPU Scheduling for Mixed LLM Workloads: The Bin-Packing Problem Nobody Solves Well

April 14, 2026 · 10 min read

Tian Pan

Software Engineer

Most GPU clusters running LLM inference are wasting between 30% and 50% of their available compute. Not because engineers are careless, but because the scheduling problem is genuinely hard—and the tools most teams reach for first were never designed for it.

The standard approach is to stand up Kubernetes, request whole GPUs per pod, and let the scheduler figure it out. This works fine for training jobs. For inference across a heterogeneous set of models, it quietly destroys utilization. A cluster running three different 7B models with sporadic traffic will find each GPU busy less than 15% of the time, while remaining fully "allocated" and refusing to schedule new work.

The root cause is a mismatch between how Kubernetes thinks about GPUs and what LLM inference actually requires.

About Tian Pan