The Preprocessing Bottleneck That Kills AI Pipeline Throughput
A team builds a RAG-backed feature, measures end-to-end latency, finds it unacceptably slow, and immediately starts optimizing the model call. They try a smaller model, batch requests, tune temperature and token limits. After two sprints of work, latency drops by 15%. The feature is still too slow. What they never measured: the 600ms they're spending chunking text and generating embeddings before the LLM ever receives a prompt.
This pattern is common enough that it has a name in distributed systems: optimizing the wrong component. In AI pipelines, the LLM call is visible and easy to measure. Everything before it is invisible until you explicitly instrument it — and that's exactly where throughput dies.
Where Latency Actually Lives
A typical production RAG pipeline looks like this: a query arrives, gets embedded, is used to search a vector store, relevant chunks are assembled into a prompt, and the LLM generates a response. Engineers tend to think of this as "retrieval + generation," and they profile generation.
The actual latency breakdown is less flattering. Query embedding alone runs around 85ms on CPU — that's before the vector search starts. Document ingestion pipelines are worse: PDF parsing with standard tools runs 5–30 seconds per page; even fast specialized parsers run around 6 seconds per page consistently. Chunking adds another 100–500ms per document. Metadata enrichment — extracting tables, headers, source provenance — stacks additional passes on top of that.
In a well-instrumented production system, the distribution often looks something like:
- Query embedding: 15–25% of query-path latency
- Vector DB search: 5–10%
- Prompt assembly: 2–5%
- LLM inference (time to first token): 15–30%
- LLM generation: 10–40%
- Network, orchestration, serialization: the rest
The model call is one contributor among several, and it's frequently not the largest. Yet teams instrument the model and leave the rest dark.
The Three Categories of Preprocessing Bottleneck
Not all preprocessing bottlenecks are the same. They fall into three categories with different solutions.
Ingestion-path bottlenecks are the most severe. These occur during document parsing, chunking, embedding generation, and vector store upserts. A benchmark of 800+ documents found that document type alone causes parser accuracy to vary by 55 percentage points — legal documents parse at ~95% accuracy with the right specialized tool, while academic papers with complex layouts land at 40–60%. The performance implications are equally dramatic: a frontier-LLM-based parser achieves 0.03–0.1 pages per second; a specialized tool like Uni-Parser runs 20 pages per second on modern GPU hardware. Choosing the wrong tool makes the ingestion bottleneck one to two orders of magnitude worse.
Embedding generation is often the rate-limiting step for large-scale ingestion. Real-time embedding APIs impose rate limits that batch processing APIs don't. Teams that use the real-time embedding endpoint for bulk ingestion — because it's the same endpoint they use at query time and it's convenient — end up with 2–3x higher costs and 5–10x slower ingestion than teams using batch APIs.
Query-path bottlenecks appear at query time and directly affect user-facing latency. Query embedding and tokenization run on every request. For most Western-language text, tokenization is fast enough to ignore. For morphologically-rich languages or structured data formats like JSON, tokenization latency can add 50–200ms per request — enough to dominate the budget in low-latency APIs. This is the category that surprises teams most because it's not obvious that "text normalization before embedding" is a hot path.
Architectural mismatch bottlenecks are the subtlest. These occur when the pipeline architecture was designed for one usage pattern but is deployed for another. The clearest example: building a synchronous, request-blocking ingestion pipeline that works fine for tens of documents but becomes the throughput ceiling at tens of thousands. When document ingestion and query serving compete for the same compute budget, every large ingestion job degrades query latency. This isn't an optimization problem — it's a structural problem.
Profiling First: Finding the Real Bottleneck
The only way to identify which category you're in is to trace every stage explicitly. This sounds obvious but is uncommon in practice. Most teams have observability on the model call (via the provider dashboard or SDK-level logging) and observability on the HTTP layer (via application monitoring). The middle — the preprocessing pipeline — is frequently dark.
The right approach uses OpenTelemetry with per-stage spans. Each major step in the pipeline (parse → chunk → embed → upsert for ingestion; tokenize → embed → search → assemble for query path) gets its own span with timing recorded. Collect 95th-percentile latencies, not averages — preprocessing bottlenecks often show up in tail latency before they appear in the mean.
A few things to watch for when reading your trace data:
- CPU-bound vs I/O-bound stages: If chunking is slow, it's CPU-bound and will benefit from parallelism. If the vector store upsert is slow, it's likely I/O-bound and will benefit from batching.
- Serialization overhead: Converting data to the format a vector database expects (e.g., constructing batch objects for Qdrant) can be slower than the insertion RPC itself. This is a CPU-bound operation and benefits from multiprocessing, not asyncio.
- Per-document vs per-batch timing: Report latency both ways. A stage that looks acceptable per document may be the bottleneck per batch because it doesn't parallelize.
Don't optimize before you have stage-level data. The cases where the model call was the bottleneck all along are real — but so are the cases where 600ms of query embedding was blamed on "slow inference."
Async Preprocessing: Decoupling Ingestion from Query Serving
The structural fix for ingestion-path bottlenecks is architectural separation. Ingestion and query serving should run as independent services with a queue connecting them.
In a queue-backed ingestion pipeline, documents are pushed to a message queue (Kafka, RabbitMQ, or cloud-native alternatives) as they arrive. A separate worker fleet consumes the queue asynchronously: parse, chunk, embed, upsert — with each step independently scalable. The query service reads only from the already-populated vector store and never touches the ingestion pipeline. There is no resource contention.
The benefits compound. Ingestion workers can be scaled up during heavy onboarding or batch import jobs without affecting query latency. Failures in the parsing step don't cascade to query serving. You can redeploy or upgrade the ingestion workers without touching the query path.
The same principle applies to embedding generation at scale. When processing millions of documents, batch embedding services (available via Databricks, SageMaker, and similar platforms) offer 3–5x cost reduction over real-time embedding APIs and remove the rate-limit ceiling. The trade-off is added latency before embeddings are available, which is acceptable for most ingestion workflows and irrelevant for query-path embeddings (which still run in real-time).
One non-obvious implication: async ingestion requires that the query service tolerate stale embeddings. A document uploaded 30 seconds ago may not yet be searchable. Whether this is acceptable depends on the use case. For most enterprise knowledge-base applications it is; for real-time document collaboration it isn't. Design the architecture to match the latency requirements of the specific product, not just the general case.
Multi-Level Caching: The Fastest Preprocessing Is No Preprocessing
Once the pipeline is instrumented and the ingestion/query paths are decoupled, caching is the highest-ROI optimization. The general principle: any preprocessing result that is stable over time and expensive to compute should be cached.
Query embedding cache is the easiest and highest-return layer. Users of any production application repeat queries with high frequency — product search terms, common support questions, frequently checked document topics. Caching the embedding vector for a query string (with a TTL of 30–60 minutes) eliminates the embedding call for repeat queries. Cache hit rates of 30–50% are common in production applications with real users. At 85ms per embedding on CPU, a 40% hit rate reduces average query-path preprocessing latency by ~34ms.
Retrieval result cache stores the vector search results for a query embedding. For systems with stable document corpora, this eliminates both the embedding call and the vector DB lookup. Invalidation is the challenge: when documents are added or updated, cached retrieval results become stale. The right approach is per-corpus-hash cache keys: if the corpus hasn't changed since the retrieval was cached, return the cached result.
Parsed document cache stores the intermediate output of the parsing step. Parsing is by far the most expensive preprocessing step per document, and documents rarely change after ingestion. Storing the parsed AST or intermediate representation (not just the final chunks) allows reprocessing (re-chunking with different parameters, re-embedding with a new model) without repeating the parse. This is especially valuable during the iterative experimentation phase when chunking strategy is still being tuned.
Multi-level caching strategies combining embedding cache, retrieval cache, and reranking cache have been measured at up to 80% latency reduction in workloads with significant query overlap, with typical production impact in the 40–60% range.
Implement in this order: query embedding cache first (low effort, high hit rate), retrieval cache second (moderate effort, depends on corpus stability), parsed document cache third (highest engineering cost, highest value for document-heavy workloads).
The Chunking Strategy Overhead
Chunking strategy deserves specific attention because its impact is often misattributed. Teams that see poor RAG accuracy often blame the model or the embedding model, and invest in upgrading both. In many cases, the accuracy problem is the chunk strategy, not the model — and the latency problem is that the expensive semantic chunking approach they adopted to fix accuracy is slower than necessary.
The research evidence is clear on defaults: 512 tokens with 10–20% overlap achieves 69% accuracy across a broad corpus of academic documents. This is the right starting point. Semantic chunking — which aligns boundaries to topic transitions rather than token counts — improves accuracy to ~87% in specialized domains like clinical decision support, but adds non-trivial compute cost. A January 2026 analysis found that sentence-level chunking matches semantic chunking in accuracy up to ~5,000 tokens, at a fraction of the cost.
The failure mode: teams ship semantic chunking in the name of accuracy without profiling its runtime overhead, discover that chunking is now 3–5x slower than their fixed-size baseline, and end up with a pipeline that's more accurate but slower than the requirements allow. Profile chunk strategy latency under production document volume before committing to it.
What to Do This Week
If your AI pipeline has a latency problem you haven't diagnosed:
- Add per-stage tracing using OpenTelemetry spans wrapping each preprocessing step. Collect 95th-percentile timing for at least 100 real requests before drawing conclusions.
- Check your embedding call pattern — are you using the real-time embedding endpoint for bulk ingestion? Switch to a batch endpoint.
- Add a query embedding cache — even a simple in-memory LRU cache with a short TTL will surface your actual cache hit rate and demonstrate whether the 85ms embedding call is happening on every request.
- If ingestion and query serving share compute — separate them. A queue-backed ingestion pipeline is a one-time architectural change that removes an entire class of resource contention.
The LLM call is the most expensive thing you can see. That doesn't mean it's the most expensive thing in your pipeline.
The preprocessing bottleneck problem is well-understood in the literature but still surprises teams in practice because the model call is so visible and so easy to measure. The fix requires instrumentation first, then architecture — and the instrumentation is the part most teams skip.
- https://arxiv.org/html/2412.11854v1
- https://hackernoon.com/designing-production-ready-rag-pipelines-tackling-latency-hallucinations-and-cost-at-scale
- https://www.tribe.ai/applied-ai/reducing-latency-and-cost-at-scale-llm-performance
- https://docs.anyscale.com/rag/production-scalability
- https://people.csail.mit.edu/suvinay/pubs/2025.hermes.arxiv.pdf
- https://arxiv.org/html/2603.06976
- https://weaviate.io/blog/chunking-strategies-for-rag
- https://apxml.com/courses/optimizing-rag-for-production/chapter-4-end-to-end-rag-performance/async-processing-batching-rag
- https://www.databricks.com/blog/decoupled-design-billion-scale-vector-search
- https://opensearch.org/blog/scaling-vector-generation-batch-ml-inference-with-opensearch-ingestion-and-ml-commons
- https://datacream.substack.com/p/beyond-prompt-caching-5-more-things
- https://latitude.so/blog/how-to-detect-latency-bottlenecks-in-llm-workflows
- https://arxiv.org/html/2603.10765v1
- https://www.applied-ai.com/briefings/pdf-parsing-benchmark/
- https://www.snowflake.com/en/engineering-blog/impact-retrieval-chunking-finance-rag/
- https://pmc.ncbi.nlm.nih.gov/articles/PMC12649634/
- https://jina.ai/news/late-chunking-in-long-context-embedding-models/
- https://latitude.so/blog/ultimate-guide-to-preprocessing-pipelines-for-llms
