The GPU Reservation Your Batch Workload Starved Your Real-Time Path On
The nightly fine-tune job starts at 02:00 UTC. It walks into the shared GPU pool, takes every slot it can find, and holds them. By 09:30, when the first inference traffic of the business day arrives, the autoscaler tries to claim capacity that has been continuously occupied for seven and a half hours. The first ninety minutes of the morning run at roughly four times the baseline p99 latency. The dashboard reports a "noisy morning tail" that the inference team attributes to user behavior, because the actual contention lives in a job queue nobody on the inference team owns.
This is the GPU-sharing failure mode that the cost-attribution slide in your capacity review does not capture. The sharing was sold as a utilization win — train at night, serve in the day, fill the trough. What actually shipped was a latency tail you cannot escape until the pool is partitioned by latency class, not by team or by clock.
The instinct to share GPUs across workloads is correct on a spreadsheet and structurally wrong in queueing theory. CPUs context-switch in microseconds; the scheduler can interleave bursty workloads at sub-millisecond granularity without anybody noticing. GPUs context-switch in seconds, because the working set that has to be evicted is gigabytes of weights, KV cache, and optimizer state living in HBM, and the cost of paging it out and a new tenant's state in is measured in wall-clock seconds — not the cycles the CPU world taught us to budget for. The scheduling primitives that worked for stateless heterogeneous CPU workloads do not generalize to accelerators whose context-switch cost dominates the latency budget of the workload trying to run.
Training and Serving Are Not the Same Workload Class
A training job is tolerant of long preemption-free holds. It batches forward and backward passes, expects to occupy the same GPU for minutes to hours, and is willing to absorb queueing delay because the cost of releasing and reacquiring its weights and optimizer state is worse than the cost of waiting. The job's owner measures success in throughput and cost per training token. Its scheduler wants to maximize the time the GPU is doing useful work.
An inference path is the opposite shape. Requests are short, bursty, and latency-sensitive. A recommendation backend can see a ten-times traffic spike inside a minute, and a chatbot user will close the tab before a three-minute scheduling delay finishes. The serving path wants fast eviction, burst headroom, and predictable tail latency. Its scheduler wants the GPU to be ready, not to be busy.
When the two share a reservation, the training scheduler wins every uncontested admission because it asks first, asks for more, and holds longer. The result is not a fair split. It is a structural inversion: the workload that does not need low latency consumes the capacity that the latency-sensitive workload needed reserved.
The Kubernetes community spent 2024 and 2025 working through this — Kueue's gang admission, Volcano's queue-based priority, Run:ai-style fractional GPU accounting, the maturation of Dynamic Resource Allocation in 1.31 — and the consistent lesson is that the partition has to be expressed at admission time, not at scheduling time. Once the batch job is in the cluster holding a GPU, the serving path's eviction options are bounded by the cost of paging out the batch tenant, which is the same number that made you want to share the GPU in the first place.
The Cost-Attribution Trick That Stops Working at Night
The accounting model that justified the shared pool was straightforward: training and serving had complementary diurnal profiles, the reservation was paid for by serving's budget, and training used "idle" capacity for free. The model worked when "idle" was a property of the GPU and the cost of context-switching was approximately zero.
It stops working the moment the morning ramp begins. The batch tenant is not paying for the latency tail it is creating on the serving path, because the latency tail does not show up in the batch tenant's cost line. It shows up in the serving team's p99 dashboard, in their on-call burnout, in their customer churn — none of which the capacity model attributes back to the batch team. The serving team is internalizing an externality that the batch team is not pricing.
The instinct to fix this with a "polite" batch tenant — one that releases capacity before the morning ramp — runs into the second-order problem. The training scheduler does not know when the morning ramp will arrive, because the morning ramp is statistical, not scheduled. Even with a release at 09:00 UTC, the eviction cost is paid synchronously by the next training step that has to re-load state, which makes the batch tenant's owner reluctant to release. And even if they do release, the inference autoscaler has to acquire and warm new replicas — KV cache, model weights, JIT-compiled kernels — which takes minutes on a cold GPU and is itself a tail-latency event the dashboard will misattribute.
The Partition Has to Be Physical
The fix that survives contact with production is to stop treating accelerator capacity as a fungible pool. It is not a unified pool with a utilization metric. It is at least two pools with different queueing dynamics, different eviction costs, and different latency contracts. The accounting has to follow.
For serving, the reservation is non-negotiable: batch is forbidden from claiming it regardless of utilization. The metric to watch is not GPU utilization on the serving reservation but headroom against the burst envelope the serving path can absorb without spilling to a slower tier. Sixty percent utilization on a serving pool is not waste — it is the headroom that protects the p99 from the next traffic spike. If the capacity model treats it as waste, it will eventually be reclaimed by a batch tenant, and the next spike will land on a pool that cannot absorb it.
For batch, the reservation is the opposite shape: it bids on spot capacity outside the serving reservation, accepts preemption as a normal operational state, and is priced against checkpoint-recovery overhead rather than against the cost of a fully-reserved GPU. Volcano's queue-based admission and Kueue's cohort borrowing both express this — let the batch workload borrow capacity it can be evicted from, never capacity the serving path's SLO depends on.
The hardware affordance that makes this enforceable is MIG. An H100 partitioned into seven instances gives each instance dedicated streaming multiprocessors, L2 cache, memory channels, and a fault domain. The walls are physical. A training tenant on one slice cannot evict a serving tenant on another, because the eviction primitive does not exist at the hardware layer. You give up the option of one tenant absorbing the other's idle capacity, and you get back a tail latency that does not move when the neighbor moves. For a serving path with a p99 SLO, that is the trade you want.
For workloads that do not need MIG-class isolation — dev clusters, low-criticality batch — time-slicing is the cheaper option. It gives you context-switching at the driver layer with no memory isolation between replicas, which is fine for a dev namespace and catastrophic for a serving pool sharing capacity with a training job. The taxonomy is not "MIG vs. time-slicing vs. whole-GPU." It is "latency class A gets physical isolation, latency class B gets soft sharing, latency class C runs on spot," and the cluster expresses all three as separate device classes the scheduler admits against.
The Capacity Planner Has to Stop Modeling One Pool
The architectural realization is that an AI cluster is not one pool with three workload types. It is three pools with one purchasing department. The capacity planner that models it as a single pool with utilization targets will keep recommending consolidation, because consolidation always looks like a win when the latency cost is paid in someone else's column.
The change that holds the line is to model accelerator capacity as latency-class partitions from the first reservation:
- Serving-class GPUs have a headroom target, not a utilization target. The accounting question is "how much burst can we absorb," not "how full is it."
- Batch-class GPUs have a throughput target and accept preemption. The accounting question is "what does a checkpoint-recovery cost," not "what is the queueing delay."
- Spot or interruptible capacity is a separate class that batch can bid on and serving never touches.
The cost-attribution model has to follow. The team whose workload causes a tail-latency incident on the serving path is the team whose budget pays for it. If the batch team's incident-cost line item is zero because the serving team's p99 dashboard absorbs it, the batch team will keep asking for more shared capacity, and the serving team will keep paying for it in burnout. Once the externality is priced, the batch team will discover its own preference for partitioned capacity, because the partition is cheaper than the incident bill.
The deeper realization is that GPU sharing is not the CPU sharing pattern with a different cost coefficient. It is a different operational discipline. The scheduling primitives that work for stateless CPU workloads — fair share, work-conserving queues, fast preemption — assume context-switch costs that GPUs do not have. The team that priced GPU sharing as a utilization win shipped a latency tail that cannot be optimized away in software. The partition has to be physical, the accounting has to be honest, and the capacity planner has to stop pretending one pool can serve two contracts.
The morning tail will not show up in the post-mortem as a scheduling decision. It will show up as "user behavior" or "intermittent provider degradation" or "an autoscaler tuning gap" — anything other than the architectural choice to share a reservation between two workloads whose queueing dynamics were never compatible. The team that wants the tail to go away has to name the choice, partition the pool, and price the externality. Until then, every morning at 09:30, the serving team will pay for the batch team's overnight run, and the dashboard will not tell them why.
- https://debugg.ai/resources/kubernetes-gpu-scheduling-2025-kueue-volcano-mig
- https://scaleops.com/blog/ai-infra-for-production-why-gpu-resource-management-in-kubernetes-demands-a-new-approach/
- https://www.techplained.com/kubernetes-gpu-scheduling
- https://stacksimplify.com/blog/gpu-scheduling-kubernetes-ml/
- https://arxiv.org/abs/2503.09304
- https://arxiv.org/html/2508.20274v1
- https://developer.nvidia.com/blog/minimizing-dl-inference-latency-with-mig/
- https://docs.nvidia.com/datacenter/tesla/mig-user-guide/introduction.html
- https://www.nvidia.com/en-us/technologies/multi-instance-gpu/
- https://scaleops.com/blog/kubernetes-gpu-sharing/
- https://rafay.co/ai-and-cloud-native-blog/demystifying-fractional-gpus-in-kubernetes-mig-time-slicing-and-custom-schedulers
- https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/gpu-sharing.html
- https://www.coreweave.com/blog/kueue-a-kubernetes-native-system-for-ai-training-workloads
- https://github.com/kubernetes-sigs/kueue
- https://www.spheron.network/blog/llm-inference-slo-ttft-itl-latency-budget-guide-2026/
