Skip to main content

The Nightly Batch Job That Quietly Became a Latency-Critical Service

· 10 min read
Tian Pan
Software Engineer

It started as a cron job. Every night at 2 a.m., a script woke up, pulled the day's records, ran them through a model, wrote the results to a table, and went back to sleep. It was the simplest possible shape for the problem, and for a year it was exactly the right shape. Nobody thought about it because nobody needed to.

Then someone asked if the results could be ready by 8 a.m. instead of noon. Then someone asked if a user could trigger a run for a single record on demand. Then a product manager asked if it could "feel instant" inside the app. Each request was reasonable. Each change was small. And at no point did anyone open a document titled "Re-architecting the inference pipeline," because at no point did any single change feel like a rewrite.

Eighteen months later you have a latency-critical online service wearing the body of a batch job. It has a p99 nobody measures, a queue nobody drains, and a failure mode where one bad record stalls a user-facing request because the pipeline was built to retry the whole batch. This is one of the most common architectural failures in AI systems, and it almost never shows up as a decision. It shows up as a slow accumulation of reasonable yeses.

The three requests that rewrote your architecture

Architectural drift is the gradual, unintentional deviation of a system from the design it started with. The dangerous part is that it is made of locally rational decisions. The engineer who shortened the cron interval was responding to a real complaint. The engineer who added an HTTP endpoint to trigger a single-record run was unblocking a real feature. Nobody did anything wrong. The architecture changed anyway, because individual decisions compound and no one was measuring their aggregate effect on the shape of the system.

Watch how the slide happens in practice. Request one: "Can we get this sooner?" You change the cron schedule from nightly to hourly. Still batch, still fine. Request two: "Can users trigger it for their own record?" You wrap the batch function in an endpoint that processes a batch of size one. It works in the demo. Request three: "Can it feel instant?" Now you are staring at a function that was tuned to amortize startup cost across ten thousand records, being called with one record while a human watches a spinner.

Each step preserved the previous code. That is exactly the problem. The batch function still loads the entire model into memory on every invocation because that cost used to be divided across ten thousand rows. The pipeline still writes results in one transaction at the end because partial writes never mattered when the consumer was a downstream report. The retry logic still re-runs the whole job on any failure because, for a nightly batch, retrying everything at 2 a.m. costs nothing. Online traffic violates every one of those assumptions, and the assumptions are invisible because they were never written down. They are just the natural shape of code that grew up as a batch job.

Batch and online inference optimize for opposite things

The reason you cannot slide smoothly from one to the other is that they are not two points on a spectrum. They are two designs that optimize for goals in direct tension.

Batch inference optimizes for cost per item. It is allowed to be slow. It can wait until it has accumulated enough work to use a GPU efficiently, run everything in large groups, and amortize every fixed cost — model loading, connection setup, container spin-up — across thousands of records. In LLM serving, grouping requests together can cut per-token cost by roughly 85% while adding only around 20% latency. For a batch job that tradeoff is free money. Throughput is the only number that matters, and a retry is genuinely free because no one is waiting.

Online inference optimizes for p99 latency. It cannot wait to accumulate work, because the work is one user and the user is already waiting. It has to retrieve context, run the model, validate the output, and return over a network inside a budget measured in hundreds of milliseconds. The hard number is not the average — it is the tail. A latency-first design deliberately leaves GPUs underutilized during quiet periods so that a burst does not blow the p99, and accepts a higher cost per request as the price of predictability.

These goals genuinely conflict. The batching that makes batch inference cheap is the same batching that destroys tail latency, because interleaving the prefill and decode phases of grouped requests makes it structurally hard to be both fast and high-throughput at once. A retry that is free in batch is a doubled latency budget in online serving. The efficiency move and the latency move point in opposite directions. A system that drifted from one to the other without a redesign is now being asked to satisfy both targets with a design that was built to satisfy only one — and it will quietly fail the one it was not built for.

What batch systems assume that online traffic violates

Three assumptions are baked so deeply into batch pipelines that engineers stop seeing them as assumptions. Each one becomes a production incident when real-time traffic arrives.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates