When to Skip Real-Time LLM Inference: The Production Case for Async Batch Pipelines
There's a team somewhere right now watching their LLM spend grow 10x month-over-month while their p99 latency hovers around four seconds. The engineers added more retries. The retries hit rate limits. The rate limits triggered fallbacks. The fallbacks are also LLM calls. Nobody paused to ask: does this feature actually need to respond in real time?
Most AI product teams architect for the happy path — user sends a message, model responds, user sees it. The synchronous call pattern is what the API SDK demonstrates in its first code sample, and so that's what ships. But a surprisingly large share of production LLM workloads have nothing to do with a user waiting at a keyboard. They're document enrichment jobs, content classification pipelines, embedding generation tasks, nightly digest generation, and background quality scoring. For those workloads, real-time inference is the wrong tool — and the price you pay for using it anyway is real money, cascading failures, and operational complexity you'll spend months untangling.
The Default That Costs You 50%
When you call an LLM API synchronously, you're paying for two things: the compute that processes your tokens, and the guarantee that idle capacity is standing by ready to respond to you within milliseconds. That second part is expensive. Providers maintain headroom to absorb traffic spikes, and the cost of that headroom gets priced into every real-time call you make.
Batch APIs exist precisely because most requests don't need that guarantee. Both Anthropic and OpenAI offer a 50% flat discount on all tokens — input and output — in exchange for a 24-hour completion window. The provider routes your batch through spare capacity during off-peak hours. You get identical output quality, identical model behavior, and half the bill. Anthropic's batch pricing today runs 7.50 output per million tokens for Sonnet versus 15.00 real-time; the discount stacks with prompt caching, bringing costs as low as 5% of a standard non-cached call for cache-heavy workloads.
For a team processing 100,000 documents monthly at 5,000 tokens each, that's 300/month — on Haiku alone. At Sonnet prices the gap widens further. The math is not subtle.
Where the Break-Even Actually Is
The question teams should ask before every new LLM feature isn't "how do I call the API?" It's "what is the cost of a 15-minute delay in this feature?"
If the answer is "the user sees a spinner too long," you need real-time. If the answer is "the database row gets populated 15 minutes later," you should be on batch. If the answer is "honestly, a 2-hour delay would be fine as long as it completes before morning standup," you should be on batch with the kind of generous time window that lets the provider use their cheapest capacity.
In practice, most batches complete well under an hour despite the 24-hour guarantee — providers don't deliberately hold your jobs until the last minute, they just won't guarantee faster. The distinction between a 10-minute actual completion and a "real-time" call that returns in 4 seconds matters a lot less than engineers assume, especially when the result is going into a database column rather than onto a screen.
The latency tolerance categories that matter in production:
- Under 5 seconds: User is watching. Real-time, no choice.
- 5–60 seconds: User is waiting but will tolerate it. Consider async-with-polling plus streaming to show progress.
- Minutes to an hour: User submitted something and walked away. Async is correct.
- Hours or overnight: Background pipeline. Batch is mandatory; real-time here is waste.
The Feature Categories That Should Never Have Been Synchronous
Analysis of production LLM deployments reveals a consistent pattern: the features where teams most frequently over-engineer real-time infrastructure are the ones whose users don't actually need it.
Document enrichment and extraction: Invoice parsing, contract analysis, structured data extraction from unstructured text. The user uploaded a file. They'll check back in a few minutes. The extraction result goes into a database. This is batch work. Running it synchronously means holding an HTTP connection open for as long as the LLM takes, across however many documents are in the queue, under the assumption that the user is staring at a loading indicator — which they aren't.
Content moderation at scale: ByteDance processes billions of videos daily through multimodal LLM pipelines on batch-oriented infrastructure. Content moderation doesn't need to complete before a video goes live if your architecture queues the video and surfaces it to users after the moderation window — a pattern that eliminates the synchronous dependency entirely rather than engineering around it.
Embedding generation: Building or refreshing a vector index for a RAG system is not a real-time operation. You're processing a corpus, not responding to a user. Embedding 100,000 documents synchronously means either batching them manually in your own code or overloading your API rate limits. The batch API handles this with a single call and a polling loop.
Email digests and ambient agents: Any agent that runs on a schedule rather than in response to a user action — email triage every 10 minutes, nightly report generation, weekly analytics summaries — is inherently async. There's no user waiting. The synchronous API adds no value here and costs twice as much.
Model evaluation and offline testing: Running your eval suite against a new prompt variant shouldn't compete with your production traffic for rate limit capacity. Batch evaluation runs allow you to fire off hundreds of test cases and consume results when they're ready, without touching your production throughput headroom.
What Breaks When You Over-Rely on Synchronous Calls
The failure modes of synchronous LLM architectures don't announce themselves until they're expensive.
Retry storms: A naive retry-on-timeout pattern looks harmless in a single-service unit test. In production, three retries at each layer across a five-service request chain means 3^5 = 243 backend calls per original user request during a degraded period. The LLM provider's rate limit is a shared resource; the retry storm from one customer's traffic spike degrades everyone else on the same tier. Teams that discovered this pattern typically found it by examining their spend graph after an outage, not by design.
Timeout cascades: LLM calls are slow — often 10–30 seconds for complex generations. HTTP timeouts on upstream proxies, load balancers, and API gateways are often set in the 30–60 second range and were configured when the most latency-sensitive call in the system was a database query. An LLM call that legitimately takes 25 seconds is indistinguishable from a hung call to infrastructure that wasn't designed with it in mind.
Rate limit blowups: LLM providers enforce both requests-per-minute (RPM) and tokens-per-minute (TPM) limits. Teams that monitor only RPM are surprised when their TPM limit cuts out under a normal-seeming request volume — because their agents have learned to produce verbose outputs, or a new feature added tool-calling that inflates context size. Batch APIs operate under different, often higher, throughput limits, and the provider controls scheduling so your jobs don't need to stay within real-time rate ceilings.
Runaway costs without circuit breakers: One documented case involved an agent loop that ran unchecked for 11 days, growing API spend from 47,000/week. The retries were doing their job faithfully — retrying each timeout — but nobody had set a spend circuit breaker or an iteration budget. Synchronous APIs make this failure mode easy to reproduce because every call blocks until it completes or times out, creating natural checkpoints where a human might notice. Async architectures require you to build monitoring explicitly — but they also make it easy to add budget checks as part of the job submission step.
Queue-Backed Batch Architecture in Practice
A production batch pipeline for LLM workloads doesn't require exotic infrastructure. The pattern is:
- Submission: Your application code compiles a batch of requests and submits them via the batch API. Each request gets a
custom_idyou assign — this is the only reliable key for matching results, since batch results arrive in arbitrary order. - Polling or webhook: Either poll the batch status endpoint at 30-second intervals or configure a webhook for completion notification. For scripts and notebooks, polling is simpler. For production systems where latency matters within the async window, webhooks with exponential-backoff retry on delivery failure are worth the setup.
- Result processing: When the batch completes, download results, match by
custom_id, and write to your database.
For orchestration at scale, frameworks like Temporal and Airflow integrate directly with batch APIs, handling batch ID tracking, polling state, and retry management as first-class concerns rather than application-layer plumbing.
Idempotency is mandatory. Batch jobs fail partway through. Retrying a partial batch means some requests will be submitted twice. If your downstream write isn't idempotent — inserting a record, sending a notification, charging a payment — a retry produces duplicate side effects. Use custom_id as your idempotency key and ensure your result-processing code is safe to re-run.
Partial failure handling: Unlike a synchronous call (all-or-nothing), a batch job can complete with some requests errored. Build a dead letter queue for failed items and distinguish between retryable errors (rate limits, transient timeouts) and permanent failures (invalid request format, authentication errors). Re-submitting permanent failures burns quota without recovery.
The Practical Decision
The rule that holds across production deployments is simple: if the result goes into a database rather than a user interface, use batch. If someone is actively waiting on the response, use real-time.
This covers the vast majority of ambiguous cases. Document enrichment goes into the database. Embedding generation goes into the vector store. Moderation scores get written to a content record. Eval results land in a benchmark table. None of these have a human on the other end of a blocking connection.
The teams that get this right early don't optimize for real-time as the default and carve out exceptions. They treat real-time as the expensive special case it is — justified when the UX genuinely demands it, avoided everywhere else. The 50% baseline savings from batch APIs compounds quickly across a large workload, and the operational simplicity of not needing to engineer around synchronous failures compounds even faster.
The batch API is the right tool for a majority of production LLM workloads, and the reason most teams don't use it is that the SDK shows them a synchronous call first. The 50% discount is the obvious argument, but the more durable one is that async architectures eliminate entire categories of failure — retry storms, timeout cascades, runaway cost spirals — by removing the assumption that every LLM call needs to complete before the next line of code runs.
- https://sutro.sh/blog/no-need-for-speed-why-batch-llm-inference-is-often-the-smarter-choice
- https://www.finout.io/blog/anthropic-api-pricing
- https://www.finout.io/blog/openai-vs-anthropic-api-pricing-comparison
- https://www.anyscale.com/blog/continuous-batching-llm-inference
- https://docs.anyscale.com/llm/batch-inference/llm-batch-inference-basics
- https://www.zenml.io/blog/what-1200-production-deployments-reveal-about-llmops-in-2025
- https://inference.net/blog/batch-vs-real-time-llm-apis-when-to-use-each
- https://platform.claude.com/docs/en/build-with-claude/batch-processing
- https://jangwook.net/en/blog/en/anthropic-message-batches-api-production-guide/
- https://stevekinney.net/writing/anthropic-batch-api-with-temporal
- https://latitude-blog.ghost.io/blog/scaling-llms-with-batch-processing-ultimate-guide/
- https://portkey.ai/blog/retries-fallbacks-and-circuit-breakers-in-llm-apps/
- https://www.together.ai/blog/batch-api
- https://estuary.dev/blog/batch-processing-vs-stream-processing/
