Coalesce Before You Call: The LLM Request Batching Pattern That Cuts Costs Without Slowing Users Down
Most teams discover request coalescing the same way: through a surprisingly large invoice. They ship an LLM-backed feature, usage grows, and then the billing dashboard shows they're paying for fifty thousand requests a day when closer examination reveals that roughly thirty thousand of them were asking the same thing in slightly different words. Each paraphrase of "summarize this document" hit the model separately. Each near-duplicate triggered a full inference cycle. The cost scaled with traffic volume, not with the semantic diversity of what users actually wanted.
Request coalescing is the pattern that fixes this. It is not one technique but a layered architecture: in-flight deduplication to prevent concurrent duplicates, exact caching for repeated identical prompts, and semantic batching to catch the paraphrased variations in between. The order matters, the thresholds matter, and understanding where the pattern breaks down — particularly around streaming — is what separates a working implementation from one that saves money on a staging server but causes subtle bugs in production.
The Three Layers, and Why You Need All of Them
Treating request coalescing as a single problem is where most implementations go wrong. There are actually three distinct failure modes, each requiring a different fix.
Layer 1: In-flight deduplication. This is the easiest problem to miss and the most embarrassing when it causes an incident. When multiple concurrent users send identical prompts — say, a product launch causes a spike of "what are the shipping costs?" queries — each request races to the cache, finds nothing, and independently fires off an LLM call. You get fifty identical inference jobs running simultaneously, each paying full price, each returning the same result. Coalescing at this layer means making the first request "own" the pending work while subsequent identical requests wait and then receive the same result. No redundant API calls. The pattern uses a deferred/promise structure: the first caller registers a pending computation, subsequent callers subscribe to that same future, and the LLM gets called exactly once regardless of how many requests arrived in that burst window.
Layer 2: Exact-match caching. After deduplication, the next layer is straightforward: hash the full prompt with SHA-256, store the result, look it up before making any external call. This costs sub-millisecond and has zero false positive risk. In production, exact caching typically handles 15–30% of traffic — users who return to the same document, system prompts that repeat verbatim, or internal tooling that queries the same classification prompt repeatedly. The implementation is trivial; the mistake is skipping it and jumping straight to semantic search.
Layer 3: Semantic caching. This is where the real cost savings come from, and where the complexity lives. Production traffic studies consistently find that 30–60% of incoming LLM requests are near-duplicates — paraphrased queries that would return the same response if handled by the model. Semantic caching converts each incoming prompt into a vector embedding, computes cosine similarity against previously cached embeddings, and returns the cached response when the similarity exceeds a threshold. For a 1,500/month saved, with infrastructure costs staying under 5% of that savings.
The Similarity Threshold Is a Business Decision, Not a Technical One
The cosine similarity threshold determines what counts as "close enough" to return a cached response. It is the most consequential configuration in the entire system, and teams routinely get it wrong by treating it as a technical parameter to tune rather than a product decision to make explicitly.
The operational range runs from approximately 0.85 to 0.98:
- At 0.85, the cache is aggressive. Prompts that are topically related but semantically distinct will match. This works well for FAQ systems and support bots where close-enough answers are acceptable, but it will cause silent failures in any context where precise phrasing matters.
- At 0.92, you catch clear rephrasings — "how do I cancel my subscription" and "what's the process to cancel my account" — while rejecting queries that share vocabulary but differ in intent. This is the sweet spot for most production deployments.
- At 0.98, you're effectively doing exact-match caching with a small tolerance for punctuation variation. Useful when you need the behavior of exact caching but want to handle minor normalization differences.
The practical test: give your threshold to your support team and ask them to find five example pairs where the cached answer would be wrong. If they can find them easily at your current threshold, lower it. If they can't find any at a much more aggressive threshold, you may be leaving money on the table.
Two additional complications: thresholds should vary by context length and conversation state. Short prompts are more sensitive to threshold misconfiguration because a single word change has a proportionally larger effect on semantics. Multi-turn conversations should generally skip semantic caching entirely after a few turns, because the accumulated context makes similarity scores unreliable indicators of answer equivalence.
The Time Window Trade-off
Request coalescing and semantic caching work on existing traffic patterns — they optimize what's already arriving. The complementary technique is dynamic batching: holding new requests in a queue for a short window (typically 20–100ms) and processing whatever accumulates together as a single inference call.
The economics are compelling. Batching 32 requests together can reduce per-token costs by as much as 85%, because you're amortizing fixed overhead across many more tokens and GPU utilization rises dramatically. Continuous batching — where the model processes a batch and immediately slots in new requests as slots free up — has been shown to maintain 90%+ GPU utilization compared to roughly 40% for static batching.
The trade-off is latency. A 50ms batching window means the first request in any batch waits up to 50ms before inference begins, even if it arrives when the server is completely idle. For most background and asynchronous workflows, this is invisible. For interactive chat interfaces, users notice delays above roughly 100ms in time-to-first-token. The practical threshold: if your feature shows typing indicators or progressive loading, batching windows above 30ms start to feel sluggish. If you're running document processing pipelines or async classification jobs, windows of 200–500ms are unremarkable.
The implementation mirrors a bus schedule: the batch departs either when the window closes or when it fills to capacity, whichever comes first. A half-empty bus still runs on time; a full bus doesn't wait for the next stop.
Why This Pattern Is Fundamentally Incompatible with Streaming
Streaming is the single largest architectural incompatibility with request coalescing, and it's one that teams frequently discover only after implementing both features separately.
Streaming works by having the model emit tokens as they're generated — the client receives a sequence of small chunks rather than waiting for the complete response. This creates a user experience that feels fast even when total response time is the same. The model starts almost immediately, and users see progress.
Request coalescing requires the opposite approach. To coalesce requests, you need to know whether an incoming request matches something pending or cached before the model starts generating. With in-flight deduplication, the second identical request must wait for the first to complete — it cannot subscribe to a partially-generated token stream in a general way. With semantic caching, you're returning a complete cached response, not a stream. With dynamic batching, you're holding requests in a queue, which means the model doesn't start until the window closes.
The practical resolution: pick one or the other based on the feature's latency requirements.
- High-throughput, latency-tolerant workflows (document processing, classification, async summarization): use coalescing and batching aggressively. Stream to the client only after the full response is available, which you can implement as a pseudo-stream that flushes the complete response at once.
- Interactive, latency-sensitive features (chat interfaces, real-time autocomplete): stream directly and accept the higher cost. Apply only exact-match caching with very high similarity thresholds, since the latency cost of waiting for batch windows outweighs the savings.
Trying to implement streaming on top of a coalescing layer creates a class of subtle bugs where some users receive cached responses at streaming speed and others receive fresh responses at streaming speed, but the time-to-first-token distribution is bimodal and difficult to explain to users who notice the inconsistency.
The Organizational Friction Problem
The technical implementation of request coalescing is solvable. The harder problem is organizational: to coalesce requests effectively, you need a shared request pool across features.
Consider a typical product: a search feature, a chat feature, a recommendation engine, and a document summarizer, all using the same LLM provider. Each team built their integration independently. Each has their own caching layer, their own rate limit budget, and their own view of what constitutes a "request." The search team cannot see the chat team's pending requests. Batching opportunities that span features — a user viewing a document while the search bar is processing a similar query — are invisible to either team.
Building a shared LLM gateway that pools requests across features requires teams to give up local autonomy over a system they've already shipped. The search team doesn't want their latency budget affected by the recommendation engine's batch windows. The chat team doesn't want their cache poisoned by the document summarizer's low-quality outputs. These are reasonable objections, and they explain why most organizations end up with per-feature caches that each capture 10–20% of traffic instead of a centralized layer that could capture 40–60%.
The organizational pattern that works: implement the gateway as infrastructure, not as a shared service owned by one team. Make it opt-in at the feature level with per-feature configuration for thresholds and windows. Give each team visibility into their own cache hit rates and cost attribution. Charge back at the API call level so teams have direct incentive to improve their hit rates. Organizations that treat the gateway as a cost center for platform to own usually see adoption; organizations that treat it as a platform service that product teams are expected to integrate usually see resistance.
Calibrating for Your Traffic Shape
Not all LLM traffic is equally coalesceable, and over-optimizing the wrong features wastes engineering time.
Expected hit rates by feature type:
- FAQ and support bots: 40–60%. Users ask the same questions with minor variation. Semantic caching at threshold 0.88–0.92 captures most of this.
- Classification and routing: 50–70%. The input space is constrained; most real-world inputs resemble something already seen. Exact caching alone handles a significant fraction.
- RAG retrieval augmentation: 15–25%. The retrieved context varies enough per query that most responses aren't reusable, but the retrieval step itself is cacheable separately.
- Open-ended chat: 10–20%. Users in conversation are actively trying to get novel responses; coalescing here mostly catches retry storms and burst traffic.
- Code generation: 5–15%. Prompts vary significantly and small prompt differences produce large output differences.
Start with the features at the top of this list. A support bot hitting 45% cache hit rate on a $3,000/month inference spend saves more in absolute terms than reducing code generation costs by 10%. The ROI math should drive prioritization.
What This Looks Like in Practice
Teams that implement all three layers — in-flight deduplication, exact caching, semantic batching — typically see 40–60% cost reduction on high-traffic features within the first quarter. The first 20% comes quickly from exact caching and deduplication; it's low-risk, requires no threshold calibration, and has no latency impact. The next 20–40% requires getting the similarity threshold right and accepting the latency budget implications of batch windows.
The architectural pattern to implement: build or adopt an LLM gateway that sits between your application code and the provider API. Configure deduplication at the gateway level so it's transparent to application code. Add exact caching as a thin in-memory layer. Add semantic caching as a second-tier lookup against a vector store (Redis with vector search, Qdrant, or Pinecone all work for this). Measure cache hit rates and cost per feature per week; review them as part of your regular infrastructure costs review.
The mistake to avoid: conflating "we have caching" with "we have coalescing." A cache that returns results for repeated prompts does nothing for the burst of fifty concurrent requests that all arrive before the first one completes. In-flight deduplication is a separate mechanism and needs to be implemented explicitly, or you'll see your LLM costs spike in exactly the traffic patterns where you'd expect caching to help most.
- https://medium.com/@TomasZezula/llm-caching-strategies-from-na%C3%AFve-to-semantic-and-batched-6b5816e7488a
- https://preto.ai/blog/semantic-caching-llm/
- https://www.getmaxim.ai/articles/how-to-optimize-llm-cost-and-latency-with-semantic-caching/
- https://bentoml.com/llm/inference-optimization/static-dynamic-continuous-batching
- https://www.anyscale.com/blog/continuous-batching-llm-inference
- https://arxiv.org/html/2411.05276v2
- https://www.together.ai/blog/batch-api
