The Batch-Tier Inference Question: When 50% Off Reshapes Your Architecture
The cheapest inference dollar in your bill is the one you're paying twice. Every major model provider now offers a batch tier at roughly half the price of synchronous inference in exchange for accepting a completion window measured in hours rather than milliseconds. Most engineering organizations either ignore the option entirely, or shove a single nightly cron at it and declare the savings booked. Both responses leave 30–50% of total inference spend on the floor — not because the discount is small, but because batch isn't a coupon. It is a different product surface with its own SLAs, its own retry semantics, and its own failure modes, and the teams that treat it as a billing optimization end up either underusing it or shipping subtle regressions that take weeks to attribute.
The technical question is not "should we use batch?" The technical question is which actions in your system are actually synchronous in the user-perceived sense, which ones the engineering org has accidentally treated as synchronous because the developer experience was easier, and which ones can be re-shaped into jobs without a downstream consumer assuming the result is fresh. Answering that requires a workload audit, an architectural shift from request-shaped to job-shaped contracts, and an honest mapping of every agent action to a latency tier based on user expectation rather than developer convenience.
The Price Tag Is the Easy Part
The headline numbers are real and consistent across providers. OpenAI's Batch API offers a flat 50% discount on every model — the GPT-5 family, mini and nano variants, the o-series reasoning models, and embeddings — in exchange for accepting a 24-hour completion window. Anthropic's Message Batches API offers the same 50% on both input and output tokens, with batches of up to ten thousand queries each processed within 24 hours. Both providers note that most batches actually finish in one to six hours, with the 24-hour figure functioning as a worst-case ceiling rather than an expected duration.
Stack the batch discount on top of prompt caching and the savings compound. A workload that already benefits from a 90% cache hit on its system prompt will see the cached input billed at the batch tier on top of the cache discount. For high-volume enrichment pipelines, the effective input cost can drop by an order of magnitude relative to naïve synchronous calls against the same model. This is the single largest cost lever available to most production teams, and it requires no model swap, no quality regression, and no new capability — just a willingness to wait.
The catch is that the willingness to wait is not a property of the team. It is a property of the workload, and most teams have not done the work of distinguishing the two.
The Workload Audit Nobody Runs
Walk through the inference call graph of a typical agent product and ask, for each LLM call, what would happen if the result arrived four hours late. Most engineers will instinctively answer "the user is waiting" — and for the agent's primary turn-by-turn loop, that's correct. But the call graph for a mature product is dominated by paths the user is not waiting on:
- Pre-computed summaries. Daily or weekly digests that the user reads when they open the app, not at the moment they're generated.
- Classification backfills. Re-tagging a corpus after a taxonomy change, scoring historical content against a new policy, labeling support tickets that arrived overnight.
- Eval scoring runs. The judge model that grades production traces against quality rubrics, the regression suite that runs against a candidate prompt, the offline calibration job that produces this week's confidence thresholds.
- Embedding refreshes. Re-embedding a knowledge base after a model upgrade, vectorizing a newly ingested document corpus, running periodic drift checks against an embedding baseline.
- Content moderation queues. Flagging user-generated content for review where the SLA is "before a human looks at it" rather than "before the user sees their own post."
- Retrospective enrichment. Pulling structured fields out of historical free-text records, generating alt-text for images that have already been published, normalizing log lines for downstream analytics.
In a typical agent product, these workloads collectively account for a meaningful fraction of total inference spend, and a startling fraction of them were originally implemented as synchronous calls because the developer who shipped the feature reached for the same SDK pattern they used everywhere else. The audit isn't about finding the obvious nightly job. It's about finding the calls that look synchronous in the code but where no human is on the other end of the wire.
From Request-Shaped to Job-Shaped Contracts
Once you've identified a workload, migrating it to the batch tier is not a matter of swapping messages.create for batches.create. The contract between caller and callee changes shape, and several engineering disciplines that were optional under the request model become mandatory under the job model.
Idempotency at the job level, not the call level. A synchronous request that fails can be retried in-line and the caller assumes single execution because the response either arrives or it doesn't. A batch job can partially complete, time out, or succeed silently while the caller has crashed and restarted. Every job needs an idempotency key that survives caller restarts, every result write needs to be safe under repeated delivery, and every consumer needs to be able to re-derive its state from the job ledger without double-counting. The teams that skip this step ship duplicate notifications, double-charged enrichment runs, and analytics tables with phantom rows.
- https://openai.com/api/pricing/
- https://www.anthropic.com/news/message-batches-api
- https://platform.claude.com/docs/en/build-with-claude/batch-processing
- https://help.openai.com/en/articles/9197833-batch-api-faq
- https://venturebeat.com/ai/anthropic-challenges-openai-with-affordable-batch-processing
- https://www.cloudzero.com/blog/inference-cost/
- https://reintech.io/blog/llm-batch-processing-handling-large-scale-inference-jobs-efficiently
- https://stevekinney.com/writing/anthropic-batch-api-with-temporal
- https://developers.openai.com/api/docs/pricing
