Skip to main content

Compound AI Systems: When Your Pipeline Is Smarter Than Any Single Model

· 9 min read
Tian Pan
Software Engineer

There is a persistent assumption in AI engineering that the path to better outputs is a better model. Bigger context window, fresher training data, higher benchmark scores. In practice, the teams shipping the most capable AI products are usually doing something different: they are assembling pipelines where multiple specialized components — a retriever, a reranker, a classifier, a code interpreter, and one or more language models — cooperate to handle a task that no single model could do reliably on its own.

This architectural pattern has a name — compound AI systems — and it is now the dominant paradigm for production AI. Understanding how to build these systems correctly, and where they fail when you don't, is one of the most important skills in applied AI engineering today.

Why Single Models Have a Ceiling You Keep Hitting

The intuition behind compound systems is not complicated. A language model is a single, general-purpose component. It has a fixed context window, fixed training data, fixed cost per token, and fixed latency characteristics. For narrow, well-defined tasks, a single model call is often all you need. But most production tasks are not narrow.

Consider a financial document QA system. The user asks a question about a quarterly report. A single LLM call, even with a large context window, faces an immediate practical constraint: you cannot stuff every document in your corpus into a single prompt. So you retrieve. But raw embedding similarity is noisy — semantically similar sentences are not always the most relevant ones. So you rerank. The retrieved context may contain structured tables the LLM misreads without preprocessing. So you normalize. The final answer may need to be grounded to a specific passage to be auditable. So you cite.

At each step you have added a specialized component that does one thing better than a general-purpose model doing everything. The compound system is not just faster or cheaper — it is more correct in ways that a monolithic approach cannot match. FactSet reported formula generation accuracy jumping from 55% to 85% after moving to a compound retrieval architecture. Microsoft's Medprompt, which composes dynamic few-shot retrieval, self-generated chain of thought, and ensemble voting over multiple passes, reached 90.6% accuracy on MultiMedQA — outperforming models that had been fine-tuned specifically on medical data.

The deeper principle here is that component specialization beats parameter scaling when the task has natural decomposition points. AlphaGeometry 2 solved 84% of International Mathematical Olympiad geometry problems by combining a language model with a symbolic reasoning engine — exceeding average gold-medalist human performance. No single model trained on more data was going to close that gap. The gains came from architectural composition, not from scaling.

The Seam Is Where Systems Break

When you move from a single model to a compound system, you introduce seams: handoff points between components where the output of one becomes the input of the next. Each seam is a potential failure point, and failures at seams have a property that makes them particularly dangerous in production: they are often silent.

The most common seam failure is format mismatch. One component outputs YAML; the next expects JSON. One component returns a list of passages; the next assumes a single string. These failures do not always produce exceptions — they produce subtly wrong inputs that flow through the rest of the pipeline and generate plausible-looking but incorrect outputs. By the time the degradation surfaces, it has been amplified by every downstream component.

Context window overflow is a related failure that scales with pipeline complexity. A retriever returns 50 passages. The reranker keeps the top 10. Each passage is long. By the time the LLM prompt is assembled, you are at 90,000 tokens for a model with a 100,000-token limit, and the crucial passage gets truncated. One team building a materials science workflow found that a naive pipeline consumed 20 million tokens on a single workflow run and failed. The same workflow, refactored to pass memory pointers instead of full document contents, used 1,234 tokens and succeeded. The component-level logic was identical. The seam design was different.

Error propagation compounds through deep pipelines in a way that is structurally similar to vanishing gradients in neural networks. TextGrad, a framework for text-based automatic differentiation, documented exponential signal degradation in long chains — feedback from late stages in the pipeline becomes incoherent by the time it reaches early stages. The fix, borrowed directly from ResNets, is explicit residual connections that preserve signal across multiple hops.

Hallucinated consensus is the most insidious failure mode in multi-agent compound systems. Multiple agents process a query independently and converge on a fabricated data point. Because each agent "agrees," the system reports high confidence. The error is not caught until a human audits the output — which is usually never. This is worse than a visible crash. A crash tells you something is wrong. Hallucinated consensus tells you everything is fine.

Designing the Seams, Not Just the Components

The engineering discipline that separates teams that succeed with compound systems from teams that struggle is explicit seam design. Components are relatively easy to swap out as better models emerge. Seams, once implicit, accumulate technical debt that is hard to pay down.

Define interface contracts at every handoff. What is the schema of the output? What happens when the upstream component returns an empty result, a low-confidence result, or an error? These contracts should be explicit in code, not assumed. Classifiers should return structured metadata, not free-form text. Retrievers should return relevance scores alongside passages. Rerankers should signal when nothing in the candidate set clears a minimum threshold.

Add routing logic that matches the uncertainty. Cascade routing — routing queries through progressively more expensive components, stopping when confidence is sufficient — is one of the most underused patterns in production AI. FrugalGPT demonstrated 98% cost reduction on some workloads through cascades that route low-complexity queries to cheaper models. The same principle applies within a pipeline: if a fast classifier can reject 60% of queries before they reach the expensive LLM, you should exploit that.

Build for termination, not just execution. Multi-step pipelines that do not have explicit termination conditions will eventually find themselves in loops. An agent that reruns indefinitely because each iteration's output does not satisfy a vague success criterion is a cost sink. Hard caps on iterations, tokens, and wall-clock time are not premature pessimism — they are engineering discipline. The right abstraction is explicit terminal states: SUCCESS, FAILED, and UNCERTAIN, each with defined downstream behavior.

Instrument every seam, not just every component. Component-level metrics — the latency of your retriever, the accuracy of your reranker — tell you how each part performs in isolation. They do not tell you whether the pipeline is delivering correct answers end to end. End-to-end evaluation, with ground truth labels for a representative sample of real queries, is the only measurement that matters for production quality.

End-to-End Latency Is the Only Latency That Counts

Optimizing individual components without profiling the full pipeline is one of the most reliable ways to waste engineering time. Amdahl's Law applies directly: if your retriever accounts for 10% of total pipeline latency and you make it 10x faster, you have reduced overall latency by about 9%. If your LLM generation accounts for 80% of latency and you make it 2x faster, you have cut nearly half of the total.

The corollary is that local optimization can shift bottlenecks rather than eliminating them. Swapping a slower, more accurate reranker for a faster, noisier one reduces the reranker's contribution to latency — but may increase the number of tokens passed to the LLM because more irrelevant passages get through, which increases generation time. The net effect depends on the specific workload, and you cannot measure it without end-to-end profiling.

The other latency trap is premature parallelization. Parallelizing retrieval across multiple indexes is a real optimization, but if you are bottlenecked by the decode phase of your LLM — which is memory-bandwidth-bound for small batch sizes — no amount of retrieval optimization changes the ceiling. Tools like NVIDIA Nsight Systems and PyTorch Profiler can tell you where time is actually spent. The roofline model, which plots achieved compute against hardware limits, will tell you whether you are compute-bound or memory-bound. These are not advanced techniques; they are table stakes for any engineer doing serious AI systems work.

The Framework Question

The proliferation of orchestration frameworks for compound AI systems — DSPy, LangGraph, LlamaIndex, Semantic Kernel — reflects real engineering need. Building robust pipelines from scratch requires solving problems that most teams should not be solving repeatedly: prompt management, retry logic, state serialization, component versioning, evaluation harnesses.

DSPy takes a different conceptual approach than most. Rather than treating prompts as string templates to be manually tuned, it treats the system as a program with typed signatures and optimizes prompts algorithmically using labeled examples. For teams that have evaluation data, this yields measurable improvements; the OPTIMAS framework demonstrated 11.9% relative improvement across five benchmarks by optimizing component composition rather than individual components.

The important caveat is that frameworks reduce friction for the common case and add it for the uncommon case. A framework that makes two-stage RAG easy will make it difficult to implement the custom verification logic your domain requires. Pick frameworks for the seams where standardization is genuinely helpful — state management, tool calling, evaluation — and build custom for the seams where your domain has specific requirements.

A Systems-First Mental Model

The frame shift required for compound AI systems is moving from "what model do I use?" to "what does each stage of this pipeline need to produce, and what can go wrong at each handoff?"

Start with the task decomposition. What are the natural subtasks, and which components are best suited for each? Retrieve before you generate. Verify before you return. Route on confidence, not just on query category.

Instrument every seam before you optimize anything. You cannot improve what you do not measure, and component-level measurement will mislead you about system-level behavior.

Treat failure modes as design inputs, not afterthoughts. Silent failures — hallucinated consensus, loop-induced context explosion, format mismatches that degrade quietly — are not edge cases. They are the default outcome of seams that were not deliberately engineered.

The practical implication is that the most valuable AI engineering skill is not prompt writing or model selection. It is the systems design discipline to build pipelines where every component understands its role, every seam has an explicit contract, and the whole behaves more reliably than any of the parts.

References:Let's stay in touch and Follow me for more thoughts and updates