Skip to main content

Profiling LLM Pipelines: The Bottlenecks That Aren't Inference

· 8 min read
Tian Pan
Software Engineer

Your team just spent three weeks optimizing inference. You swapped to a quantized model, tuned your batching policy, squeezed out 12% off time-to-first-token, and shipped it. Then you looked at the actual user-facing latency and it barely moved.

This is the inference trap. It's the most common profiling failure mode in LLM-powered applications, and it happens because engineers measure what's easy to measure — GPU utilization, inference throughput, tokens per second — rather than what's actually slow. In a typical RAG pipeline, inference accounts for around 80% of latency when you include everything that touches the GPU. But that remaining 20% is often distributed across six or seven stages that nobody is tracing. Each one seems small in isolation, but together they dominate the optimization opportunity.

The Anatomy of a Real Pipeline Latency Budget

To understand where time goes, you need an end-to-end trace. Not a GPU profiler. Not a single-stage timer. A distributed trace that captures every stage from when the user's query arrives to when the last token streams back.

In a representative production RAG system, a latency breakdown looks roughly like this:

  • LLM generation: 650ms (81%)
  • Overhead (context assembly, orchestration, serialization): 70ms (9%)
  • Vector retrieval: 45ms (6%)
  • Embedding the query: 35ms (4%)

At first glance, this seems to confirm the inference-first instinct — 81% of time is in generation, so optimize generation. But look at what you're working with. Shaving 10% off generation saves 65ms. Eliminating the serialization overhead entirely saves 70ms — and often takes less engineering effort. Meanwhile, converting the retrieval and embedding stages to run in parallel, which a single async refactor enables, can eliminate them from the critical path entirely.

The real question is not "which stage is biggest?" but "which stages are worth attacking given ROI per engineering week?"

Where the Hidden Bottlenecks Live

Document Preprocessing and Ingestion

Most teams add timing instrumentation to their LLM calls and nothing else. The stages upstream of retrieval — document parsing, chunking, format conversion, deduplication — are invisible. This is where the first surprises appear.

Document preprocessing is CPU-bound and often runs synchronously before anything else in the pipeline can start. If your ingestion pipeline needs to parse PDFs, normalize whitespace, split on paragraph boundaries, and filter noise before generating embeddings, you're looking at tens of milliseconds of serial work happening before any I/O request fires. In a high-throughput system where 50 requests arrive per second, that CPU contention becomes a queue.

The common mistake is loading chunked content into the embedding model one document at a time rather than batching. Batched embedding requests of even moderate size (16–32 documents) reduce per-document overhead dramatically, but teams default to sequential processing because it was simpler to write.

Token Count Estimation Running Twice

One specific anti-pattern shows up repeatedly in production pipelines: tokenizing the same text twice. Teams validate prompt length before calling the model (to avoid exceeding context limits), then the inference library tokenizes again internally. In pipelines where prompts are assembled from many retrieved chunks, this double tokenization adds measurable overhead — and it's entirely wasted work.

The fix is to pass the pre-computed token count to the client library or cache the tokenization result, but you can only make this fix if you know the double-pass is happening. Without per-stage tracing, you never find it.

Synchronous Retrieval Blocking Generation

The most impactful single fix in most RAG pipelines is parallelizing independent stages. A typical sequential pipeline runs five stages in series: receive the query, embed it, retrieve documents, assemble the prompt, then call the model. Stages two and three are both network I/O, and neither depends on the other completing first. They can be dispatched concurrently from the moment the query arrives. Converting them to async parallel dispatch eliminates the sum of embedding and retrieval latency from the critical path, replacing it with the max.

If embedding takes 35ms and retrieval takes 45ms, sequential execution costs 80ms. Parallel execution costs 45ms. You didn't touch the model, the index, or the embedding service — you just changed the call order.

This is the structural change that delivers the most latency per hour of engineering time, and it's invisible until you have a latency waterfall trace.

Serialization Between Pipeline Stages

Agents and multi-stage pipelines frequently serialize intermediate results to pass between stages — JSON-encoding retrieved chunks, marshaling tool call results, or packaging context for a handoff between sub-agents. These operations are fast in isolation but compound in pipelines with many hops. A five-stage agent that serializes and deserializes at each boundary adds a meaningful fixed tax to every request.

The fix is rarely complex: pass structured objects directly when stages are in the same process, and serialize only at actual service boundaries. But you only know you have a problem when serialization shows up as a span in your traces.

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates