The Batch LLM Pipeline Blind Spot: Offline Processing and the Queue Design Nobody Talks About
Most teams building with LLMs optimize for the wrong workload. They obsess over time-to-first-token, streaming latency, and response speed — then discover that 60% or more of their LLM API spend goes to nightly summarization jobs, data enrichment pipelines, and classification runs that nobody watches in real time. The latency-first mental model that works for chat applications actively sabotages these offline workloads.
The batch LLM pipeline is the unglamorous workhorse of production AI. It's the nightly job that classifies 50,000 support tickets, the weekly pipeline that enriches your CRM with company descriptions, the daily run that generates embeddings for new documents. These workloads have fundamentally different design constraints than real-time serving, and treating them as slow versions of your chat API is where the problems start.
The OLTP/OLAP Split Comes to LLM Infrastructure
Database engineers learned this lesson decades ago: transactional workloads and analytical workloads need different architectures. You don't run your nightly reporting queries against your production OLTP database, and you don't optimize your data warehouse for single-row inserts.
The same split exists in LLM infrastructure, but most teams haven't internalized it yet. Real-time LLM serving optimizes for latency — keeping GPUs warm, maintaining cache hits, minimizing time-to-first-token. Batch LLM processing optimizes for throughput and cost — packing requests densely, tolerating higher latency per request, and exploiting off-peak compute pricing.
When teams apply the real-time mental model to batch workloads, they make predictable mistakes:
- Over-provisioning for latency they don't need. A nightly classification job that runs in 2 hours instead of 20 minutes is perfectly fine if it cuts the bill in half.
- Building synchronous retry loops instead of durable queues. A
forloop withtry/exceptandtime.sleepis not batch infrastructure. - Ignoring failure granularity. When item 847 of 10,000 fails, do you retry the entire batch or just that item? Most teams don't have an answer until they hit the problem at 3 AM.
- No cost attribution per job. The monthly LLM bill arrives as one number. Nobody knows if the summarization pipeline costs $200 or $2,000.
Job Sizing: The First Design Decision Most Teams Skip
How big should a batch job be? The answer shapes everything downstream — failure blast radius, checkpoint frequency, memory pressure, and retry cost.
Too small (one LLM call per job) and you drown in queue overhead. Job management, status tracking, and result aggregation consume more engineering effort than the actual inference.
Too large (the entire dataset in one job) and a single failure means reprocessing everything. You also lose the ability to parallelize across workers or providers.
The sweet spot depends on your failure tolerance and provider constraints. Native batch APIs like OpenAI's and Anthropic's accept up to 50,000 and 10,000 requests per batch respectively, but that doesn't mean you should use those as your job sizes. Consider these factors:
- Checkpoint frequency. Save state every N completions. If your process crashes, you lose at most N items of work. A checkpoint interval of 25–100 items balances I/O cost against recovery speed.
- Memory pressure. Accumulating results in memory for 50,000 items with rich outputs will eventually cause problems. Stream results to storage as they complete.
- Provider rate limits. Even batch APIs have throughput limits. Sizing jobs to match your rate allocation prevents the queue from backing up.
- Cost visibility. A job that takes 6 hours to run and costs an unknown amount is operationally invisible. Breaking it into sub-batches with per-batch cost tracking makes the economics legible.
A practical starting point: batch sizes of 500–2,000 items, with checkpointing every 25–50 completions. Adjust based on your per-item latency and failure rate.
The Queue Architecture That Actually Works
The canonical "batch processing" implementation in most LLM codebases looks like this: a Python script with a for loop, rotating API keys, exponential backoff, a CSV file for tracking progress, and a prayer that the process doesn't crash overnight. Teams report spending hundreds of engineering hours building and maintaining these fragile scripts.
A proper batch LLM pipeline needs four components:
1. A durable job queue. Redis-backed queues (BullMQ, Celery, RQ) or cloud-native options (SQS, Cloud Tasks) give you persistence, retry semantics, and concurrency control. The key property: if your worker crashes, no work is lost. Jobs remain in the queue and get picked up by another worker.
2. Configurable concurrency and rate limiting. LLM APIs have per-minute token and request limits. Your queue worker needs to respect these limits while maximizing throughput within them. Running 8 concurrent workers with a rate limit of 40 requests per minute is a reasonable starting point for most APIs.
3. Dead letter handling. After exhausting retries, permanently failed items need to go somewhere inspectable. Content policy violations, malformed inputs, and unexpected schema changes all produce non-retryable errors. A dead letter queue separates these from transient failures (rate limits, timeouts, server errors) that will succeed on retry.
4. Result aggregation with exactly-once semantics. Use the job's unique ID as an idempotency key. If the same item gets processed twice due to a retry, the second result should overwrite the first, not create a duplicate. This is especially important for data enrichment pipelines where duplicates corrupt downstream analytics.
Checkpoint-Resume: The Pattern That Saves Your Night
The most common batch pipeline failure mode: your 10,000-item job crashes at item 6,847 due to a transient API error, and you have no way to resume from where you left off. You reprocess 6,846 items you've already paid for.
Checkpoint-resume is the fix, and it's simpler than most teams assume. The pattern:
- Assign a unique ID to each item before processing begins. A composite key like
batchId:itemIndexworks. - Persist completed item IDs to durable storage (a database row, a checkpoint file, a Redis set) after each completion or every N completions.
- On startup, load the checkpoint and skip items already marked complete.
- Use atomic writes for checkpoint persistence. Write to a temp file, then rename. This prevents corrupted checkpoints from partial writes during a crash.
The checkpoint file should track: completed item IDs, cumulative token usage, cumulative cost, error counts by type, and a timestamp. This gives you both resume capability and a running cost ledger for the batch.
For teams using native batch APIs (OpenAI Batch API, Anthropic Message Batches), the provider handles checkpointing internally — you submit the batch and poll for completion. But you lose visibility into per-item progress and can't implement custom retry logic for individual failures within the batch.
Cost Attribution: Making the Invisible Visible
Here's a number that surprises most teams: a daily enrichment pipeline processing 10,000 items through Claude Sonnet at ~500 input tokens and ~100 output tokens per item costs roughly $22.50 per run in real-time pricing, or $11.25 using the batch API. Over a month of daily runs, that's $337–$675.
Now multiply by every batch pipeline in your organization. The content classification job. The document summarization pipeline. The embedding generation run. The synthetic data augmentation job. Each one has a cost, and without per-job attribution, you're flying blind.
Implement cost tracking at three levels:
- Per-item: Log input tokens, output tokens, and model used for each request. This lets you identify items that cost disproportionately more (long documents, complex classifications).
- Per-batch: Aggregate item costs into a batch total. Compare across runs to catch cost drift — a schema change in your source data might double your average token count without anyone noticing.
- Per-pipeline: Track weekly and monthly costs by pipeline. This is what goes into your FinOps dashboard and capacity planning spreadsheets.
Attach metadata tags (pipeline name, team, environment) to every LLM request. Observability platforms like Langfuse, Braintrust, and LangWatch can then break down spend by any dimension. The alternative — trying to reverse-engineer costs from a single monthly API bill — is a problem that gets worse every month as you add more pipelines.
The Failure Taxonomy That Changes Your Retry Strategy
Not all batch failures are equal, and your retry strategy should reflect this. LLM batch pipelines encounter three distinct failure classes:
Transient failures (retry immediately or with backoff): HTTP 429 rate limits, HTTP 5xx server errors, connection timeouts, network resets. These succeed on retry. Use exponential backoff with jitter to avoid thundering herds when the provider recovers.
Deterministic failures (don't retry, route to dead letter): Content policy violations, malformed input that will never parse, schema validation errors. Retrying these wastes money and quota. Detect them by error code and route directly to your dead letter queue.
Quality failures (the dangerous category): The API returns a 200, but the output is garbage — hallucinated classifications, truncated summaries, JSON that parses but contains wrong data. These are invisible to standard retry logic. You need output validation: schema checks for structured output, length bounds for free text, confidence thresholds for classifications.
The quality failure category is where batch pipelines diverge most from real-time serving. In a chat application, the user provides quality feedback by rephrasing or expressing confusion. In a batch pipeline, bad outputs flow silently into your database and corrupt downstream systems. A validation gate between the LLM response and your data store is not optional.
When the Latency-First Mental Model Breaks Down
The most expensive mistake in batch pipeline design is optimizing for latency you don't need. Here's how it typically manifests:
Streaming responses for batch jobs. Streaming adds implementation complexity and provides no benefit when the output goes directly to a database. Use non-streaming endpoints for batch workloads — they're simpler and sometimes cheaper.
Keeping prompt caches warm for infrequent jobs. Prompt caching helps when the same prefix is used repeatedly within a short window. A nightly job that runs once every 24 hours won't benefit from a cache with a 5-minute TTL. Don't architect around cache hits you'll never get.
Using the fastest model when a slower one suffices. A classification task that works fine with Haiku doesn't need Sonnet. Batch processing gives you the luxury of running A/B comparisons across models on real data — use that to find the cheapest model that meets your quality bar.
Real-time monitoring for batch jobs. You don't need a Grafana dashboard updating every 5 seconds for a job that runs once a day. A completion notification with summary statistics (items processed, errors, cost, duration) is sufficient. Save the real-time dashboards for your serving infrastructure.
The Architecture That Scales
A production-grade batch LLM pipeline converges on a common shape regardless of the specific queue technology:
A scheduler triggers jobs on a cron schedule or in response to events (new data arriving, upstream pipeline completing). It creates job definitions — the input data, the model configuration, the output destination — and submits them to the queue.
A worker pool pulls jobs from the queue, calls the LLM API with concurrency and rate limiting, validates outputs, and writes results to storage. Workers are stateless; any worker can pick up any job. This enables horizontal scaling and fault tolerance.
A result store accumulates outputs with idempotency guarantees. Whether it's a database, a data lake, or an object store, writes are keyed by item ID so duplicates are impossible.
A monitor tracks batch progress, cost accumulation, error rates, and duration. It sends alerts when error rates exceed thresholds or when costs deviate significantly from historical baselines.
The scheduler and worker should be separate processes. This lets you submit jobs from your application while workers process them independently, and either side can scale or restart without affecting the other. It also prevents the common failure mode where a worker crash takes down the scheduler and no new jobs get created.
Making the Investment Case
If your team processes more than a few hundred LLM requests per day in any non-interactive context, you have a batch workload. Treating it as one saves money (50% on native batch APIs, more with off-peak self-hosted inference), improves reliability (durable queues beat retry loops), and makes costs visible (per-pipeline attribution beats monthly bill archaeology).
The shift isn't technically difficult. The hard part is organizational: recognizing that the nightly script somebody wrote in a Jupyter notebook six months ago is now a critical data pipeline that deserves queue infrastructure, monitoring, and cost tracking. The batch LLM pipeline is production infrastructure. It's time to treat it that way.
- https://sutro.sh/blog/no-need-for-speed-why-batch-llm-inference-is-often-the-smarter-choice
- https://inference.net/blog/batch-vs-real-time-llm-apis-when-to-use-each/
- https://www.grizzlypeaksoftware.com/library/batch-processing-with-llm-apis-oe2dad34
- https://medium.com/next-token/scaling-llm-workloads-with-openais-batch-api-a-guide-for-data-and-ai-engineers-7c706713c02d
- https://www.zenml.io/blog/the-ultimate-guide-to-llm-batch-inference-with-openai-and-zenml
- https://docs.vllm.ai/en/latest/examples/offline_inference/batch_llm_inference/
- https://www.traceloop.com/blog/from-bills-to-budgets-how-to-track-llm-token-usage-and-cost-per-user
