Skip to main content

The Batch LLM Pipeline Blind Spot: Offline Processing and the Queue Design Nobody Talks About

· 11 min read
Tian Pan
Software Engineer

Most teams building with LLMs optimize for the wrong workload. They obsess over time-to-first-token, streaming latency, and response speed — then discover that 60% or more of their LLM API spend goes to nightly summarization jobs, data enrichment pipelines, and classification runs that nobody watches in real time. The latency-first mental model that works for chat applications actively sabotages these offline workloads.

The batch LLM pipeline is the unglamorous workhorse of production AI. It's the nightly job that classifies 50,000 support tickets, the weekly pipeline that enriches your CRM with company descriptions, the daily run that generates embeddings for new documents. These workloads have fundamentally different design constraints than real-time serving, and treating them as slow versions of your chat API is where the problems start.

The OLTP/OLAP Split Comes to LLM Infrastructure

Database engineers learned this lesson decades ago: transactional workloads and analytical workloads need different architectures. You don't run your nightly reporting queries against your production OLTP database, and you don't optimize your data warehouse for single-row inserts.

The same split exists in LLM infrastructure, but most teams haven't internalized it yet. Real-time LLM serving optimizes for latency — keeping GPUs warm, maintaining cache hits, minimizing time-to-first-token. Batch LLM processing optimizes for throughput and cost — packing requests densely, tolerating higher latency per request, and exploiting off-peak compute pricing.

When teams apply the real-time mental model to batch workloads, they make predictable mistakes:

  • Over-provisioning for latency they don't need. A nightly classification job that runs in 2 hours instead of 20 minutes is perfectly fine if it cuts the bill in half.
  • Building synchronous retry loops instead of durable queues. A for loop with try/except and time.sleep is not batch infrastructure.
  • Ignoring failure granularity. When item 847 of 10,000 fails, do you retry the entire batch or just that item? Most teams don't have an answer until they hit the problem at 3 AM.
  • No cost attribution per job. The monthly LLM bill arrives as one number. Nobody knows if the summarization pipeline costs $200 or $2,000.

Job Sizing: The First Design Decision Most Teams Skip

How big should a batch job be? The answer shapes everything downstream — failure blast radius, checkpoint frequency, memory pressure, and retry cost.

Too small (one LLM call per job) and you drown in queue overhead. Job management, status tracking, and result aggregation consume more engineering effort than the actual inference.

Too large (the entire dataset in one job) and a single failure means reprocessing everything. You also lose the ability to parallelize across workers or providers.

The sweet spot depends on your failure tolerance and provider constraints. Native batch APIs like OpenAI's and Anthropic's accept up to 50,000 and 10,000 requests per batch respectively, but that doesn't mean you should use those as your job sizes. Consider these factors:

  • Checkpoint frequency. Save state every N completions. If your process crashes, you lose at most N items of work. A checkpoint interval of 25–100 items balances I/O cost against recovery speed.
  • Memory pressure. Accumulating results in memory for 50,000 items with rich outputs will eventually cause problems. Stream results to storage as they complete.
  • Provider rate limits. Even batch APIs have throughput limits. Sizing jobs to match your rate allocation prevents the queue from backing up.
  • Cost visibility. A job that takes 6 hours to run and costs an unknown amount is operationally invisible. Breaking it into sub-batches with per-batch cost tracking makes the economics legible.

A practical starting point: batch sizes of 500–2,000 items, with checkpointing every 25–50 completions. Adjust based on your per-item latency and failure rate.

The Queue Architecture That Actually Works

The canonical "batch processing" implementation in most LLM codebases looks like this: a Python script with a for loop, rotating API keys, exponential backoff, a CSV file for tracking progress, and a prayer that the process doesn't crash overnight. Teams report spending hundreds of engineering hours building and maintaining these fragile scripts.

A proper batch LLM pipeline needs four components:

1. A durable job queue. Redis-backed queues (BullMQ, Celery, RQ) or cloud-native options (SQS, Cloud Tasks) give you persistence, retry semantics, and concurrency control. The key property: if your worker crashes, no work is lost. Jobs remain in the queue and get picked up by another worker.

2. Configurable concurrency and rate limiting. LLM APIs have per-minute token and request limits. Your queue worker needs to respect these limits while maximizing throughput within them. Running 8 concurrent workers with a rate limit of 40 requests per minute is a reasonable starting point for most APIs.

3. Dead letter handling. After exhausting retries, permanently failed items need to go somewhere inspectable. Content policy violations, malformed inputs, and unexpected schema changes all produce non-retryable errors. A dead letter queue separates these from transient failures (rate limits, timeouts, server errors) that will succeed on retry.

4. Result aggregation with exactly-once semantics. Use the job's unique ID as an idempotency key. If the same item gets processed twice due to a retry, the second result should overwrite the first, not create a duplicate. This is especially important for data enrichment pipelines where duplicates corrupt downstream analytics.

Checkpoint-Resume: The Pattern That Saves Your Night

The most common batch pipeline failure mode: your 10,000-item job crashes at item 6,847 due to a transient API error, and you have no way to resume from where you left off. You reprocess 6,846 items you've already paid for.

Checkpoint-resume is the fix, and it's simpler than most teams assume. The pattern:

  1. Assign a unique ID to each item before processing begins. A composite key like batchId:itemIndex works.
  2. Persist completed item IDs to durable storage (a database row, a checkpoint file, a Redis set) after each completion or every N completions.
  3. On startup, load the checkpoint and skip items already marked complete.
  4. Use atomic writes for checkpoint persistence. Write to a temp file, then rename. This prevents corrupted checkpoints from partial writes during a crash.
Loading…
References:Let's stay in touch and Follow me for more thoughts and updates