Skip to main content

The Batch-Tier Inference Question: When 50% Off Reshapes Your Architecture

· 11 min read
Tian Pan
Software Engineer

The cheapest inference dollar in your bill is the one you're paying twice. Every major model provider now offers a batch tier at roughly half the price of synchronous inference in exchange for accepting a completion window measured in hours rather than milliseconds. Most engineering organizations either ignore the option entirely, or shove a single nightly cron at it and declare the savings booked. Both responses leave 30–50% of total inference spend on the floor — not because the discount is small, but because batch isn't a coupon. It is a different product surface with its own SLAs, its own retry semantics, and its own failure modes, and the teams that treat it as a billing optimization end up either underusing it or shipping subtle regressions that take weeks to attribute.

The technical question is not "should we use batch?" The technical question is which actions in your system are actually synchronous in the user-perceived sense, which ones the engineering org has accidentally treated as synchronous because the developer experience was easier, and which ones can be re-shaped into jobs without a downstream consumer assuming the result is fresh. Answering that requires a workload audit, an architectural shift from request-shaped to job-shaped contracts, and an honest mapping of every agent action to a latency tier based on user expectation rather than developer convenience.

The Price Tag Is the Easy Part

The headline numbers are real and consistent across providers. OpenAI's Batch API offers a flat 50% discount on every model — the GPT-5 family, mini and nano variants, the o-series reasoning models, and embeddings — in exchange for accepting a 24-hour completion window. Anthropic's Message Batches API offers the same 50% on both input and output tokens, with batches of up to ten thousand queries each processed within 24 hours. Both providers note that most batches actually finish in one to six hours, with the 24-hour figure functioning as a worst-case ceiling rather than an expected duration.

Stack the batch discount on top of prompt caching and the savings compound. A workload that already benefits from a 90% cache hit on its system prompt will see the cached input billed at the batch tier on top of the cache discount. For high-volume enrichment pipelines, the effective input cost can drop by an order of magnitude relative to naïve synchronous calls against the same model. This is the single largest cost lever available to most production teams, and it requires no model swap, no quality regression, and no new capability — just a willingness to wait.

The catch is that the willingness to wait is not a property of the team. It is a property of the workload, and most teams have not done the work of distinguishing the two.

The Workload Audit Nobody Runs

Walk through the inference call graph of a typical agent product and ask, for each LLM call, what would happen if the result arrived four hours late. Most engineers will instinctively answer "the user is waiting" — and for the agent's primary turn-by-turn loop, that's correct. But the call graph for a mature product is dominated by paths the user is not waiting on:

  • Pre-computed summaries. Daily or weekly digests that the user reads when they open the app, not at the moment they're generated.
  • Classification backfills. Re-tagging a corpus after a taxonomy change, scoring historical content against a new policy, labeling support tickets that arrived overnight.
  • Eval scoring runs. The judge model that grades production traces against quality rubrics, the regression suite that runs against a candidate prompt, the offline calibration job that produces this week's confidence thresholds.
  • Embedding refreshes. Re-embedding a knowledge base after a model upgrade, vectorizing a newly ingested document corpus, running periodic drift checks against an embedding baseline.
  • Content moderation queues. Flagging user-generated content for review where the SLA is "before a human looks at it" rather than "before the user sees their own post."
  • Retrospective enrichment. Pulling structured fields out of historical free-text records, generating alt-text for images that have already been published, normalizing log lines for downstream analytics.

In a typical agent product, these workloads collectively account for a meaningful fraction of total inference spend, and a startling fraction of them were originally implemented as synchronous calls because the developer who shipped the feature reached for the same SDK pattern they used everywhere else. The audit isn't about finding the obvious nightly job. It's about finding the calls that look synchronous in the code but where no human is on the other end of the wire.

From Request-Shaped to Job-Shaped Contracts

Once you've identified a workload, migrating it to the batch tier is not a matter of swapping messages.create for batches.create. The contract between caller and callee changes shape, and several engineering disciplines that were optional under the request model become mandatory under the job model.

Idempotency at the job level, not the call level. A synchronous request that fails can be retried in-line and the caller assumes single execution because the response either arrives or it doesn't. A batch job can partially complete, time out, or succeed silently while the caller has crashed and restarted. Every job needs an idempotency key that survives caller restarts, every result write needs to be safe under repeated delivery, and every consumer needs to be able to re-derive its state from the job ledger without double-counting. The teams that skip this step ship duplicate notifications, double-charged enrichment runs, and analytics tables with phantom rows.

Observability that handles late results without paging anyone. Your existing alerting almost certainly treats "this call took longer than five seconds" as an incident. A batch job that takes six hours is healthy. The on-call rotation needs new dashboards that distinguish "batch is running long but within SLA" from "batch is stuck and the queue is growing" from "batch finished but the downstream consumer hasn't picked up the result." Per-job timing distributions, queue depth over time, and stale-result-age metrics replace the per-request latency histograms you're used to.

Failure semantics that match the latency tier. A synchronous call has two outcomes: success or error. A batch job has at least four: queued, in-progress, completed-with-partial-failures, and expired-without-completion. The completed-with-partial-failures case is the one most teams forget — a batch of ten thousand queries can return ninety-five hundred successful responses and five hundred errors, and the consumer needs a deterministic policy for how to surface, retry, or drop the failed slice. The Anthropic and OpenAI batch endpoints both surface per-request errors in the result file; the engineering work is in the consumer's policy for what to do with them.

Job orchestration that's separate from the request path. Workflow engines like Temporal, Airflow, or even durable Kubernetes Jobs become load-bearing infrastructure rather than nice-to-haves, because the lifecycle of a batch job is now measured in hours and survives process restarts. Embedding the batch submit-and-poll loop inside a request handler is the anti-pattern that turns a 50% discount into a 200% latency regression for the next user who happens to share the worker.

The Freshness Regression Nobody Saw Coming

The most expensive mistake in batch migration is not a cost mistake. It is a correctness mistake disguised as a performance optimization. A team migrates a single workflow — say, a content recommendation scoring job — from synchronous to batch, runs it on a six-hour cadence, and books the savings. Three weeks later, a downstream consumer notices that recommendations are subtly worse. Two weeks after that, someone correlates the regression with the migration. By then the team has shipped four other features on top of the new batch output, and the rollback is no longer a one-line revert.

The failure mode is structural: the upstream pipeline changed its freshness contract from "this result reflects the world as of seconds ago" to "this result reflects the world as of up to six hours ago," but the downstream consumer was never told. The recommendation model that consumed the scores was implicitly assuming a recency that no longer holds. The dashboard that compared today's scores to yesterday's was suddenly comparing two timestamps that overlapped. The user-facing experience that surfaced "recently scored" content silently started surfacing content that was scored before lunch.

The discipline that prevents this is explicit freshness contracts at every workflow boundary. Every result a batch job produces should carry a produced_at and an effective_until timestamp. Every consumer that reads a batch result should declare the maximum staleness it can tolerate, ideally enforced by a check at read time. When the freshness contract changes — and migrating sync to batch is exactly such a change — the consumer's declared tolerance has to be revisited as part of the migration, not discovered six weeks later in a postmortem. This is the kind of work that feels like overhead during the migration and looks like the obvious thing to have done after the regression.

The Cost-Tier Decision Matrix

Once the audit and the contract work are done, the operating model that emerges is a decision matrix, not a binary. Every agent action gets assigned to one of three tiers based on user-perceived urgency:

  • Synchronous. The user is waiting on this turn-by-turn. Latency budget is sub-second to a few seconds. Pay full price. Examples: the agent's response to the user's current message, an inline tool call the user is watching.
  • Near-real-time. The user expects results in minutes, not seconds, and the system can stage progress. Latency budget is one to fifteen minutes. Pay full price but optimize for throughput, not tail latency. Examples: a research agent that returns a multi-step report, an analysis pipeline the user kicked off and will check back on.
  • Batch. The user is not waiting at all, or the result feeds a process that runs on its own cadence. Latency budget is hours. Pay the batch tier. Examples: nightly enrichment, embedding refreshes, eval grading, retrospective tagging, anything that backfills.

The decision rule is "user-perceived urgency" rather than "developer convenience" because the convenience axis defaults everything to synchronous. The engineer writing the feature picks the SDK pattern that's easiest, the result feels fast in their dev loop, and nothing in the code review surfaces that the user wasn't actually going to look at the output for two hours. Forcing the tier choice to be explicit in the design doc — even as a single field in a feature spec — surfaces the question early enough to architect around the answer.

The matrix also clarifies a class of features that don't fit cleanly anywhere: workloads that are user-initiated but have no hard latency requirement, like "summarize my last quarter's notes." For these, the right answer is often a hybrid — submit to the batch tier with a callback or notification when the result is ready — and the product surface has to be designed to support it. "We'll email you when it's ready" is a perfectly good UX for a class of features that today are implemented as synchronous calls because the team didn't ask the question.

Batch Is a Product Surface, Not a Discount

The realization that consolidates all of the above: the batch tier is not a billing line item. It is a different product surface that the model providers built because the underlying compute has different characteristics — idle GPU capacity that can be filled opportunistically without disturbing the latency-sensitive synchronous queue. The 50% price reflects that economic reality, not a marketing promotion. The SLAs, retry semantics, and failure modes that come with the discount are not bugs in the offering; they are honest signals about what the underlying infrastructure can promise.

Treating batch as a product surface means designing for it intentionally — with job orchestration, freshness contracts, observability for long-running work, and a tier-aware decision matrix — rather than retrofitting it into a system designed entirely around synchronous calls. The engineering investment is real, in the range of a few engineer-weeks per major workload to do the audit, build the orchestration, and migrate consumers cleanly. The payoff is a 30–50% reduction in inference spend on the workloads that should have been batch from the start, plus a side benefit: the synchronous tier becomes faster and more reliable because the queue is no longer contending with work that didn't need to be there.

The teams that figure this out in the next year will have a structural cost advantage on every long-running, high-volume LLM workload they ship. The teams that don't will keep paying the synchronous premium on jobs the user was never waiting on, and discover at their next budget review that the line item they thought was a model-choice problem was actually a tier-choice problem all along.

References:Let's stay in touch and Follow me for more thoughts and updates