Skip to main content

Compound AI Systems: Why Your Best Architecture Uses Three Models, Not One

· 10 min read
Tian Pan
Software Engineer

The instinct is always to reach for the biggest model. GPT-4o, Claude Opus, Gemini Ultra — pick the frontier model, point it at the problem, and hope that raw capability compensates for architectural laziness. It works in demos. It fails in production.

The teams shipping the most reliable AI systems in 2025 and 2026 aren't using one model. They're composing three, four, sometimes five specialized models into pipelines where each component does exactly one thing well. A classifier routes. A generator produces. A verifier checks. The result is a system that outperforms any single model while costing a fraction of what a frontier-model-for-everything approach would.

This isn't theoretical. AlphaCode 2 composes an LLM with a sampling engine and a scoring model to reach the 85th percentile on competitive programming — performance no single model achieves alone. Medprompt combines GPT-4 with nearest-neighbor search and ensembling to exceed GPT-4's own baseline by 9% on medical exams. The pattern is the same: the system is smarter than its smartest component.

The Single-Model Ceiling

Every production team eventually hits the same wall. You've optimized your prompt. You've added few-shot examples. You've tried chain-of-thought. The model still fails on 15% of inputs — and those failures are the ones that matter most, because they're the edge cases your users actually encounter.

The problem is that a single model must simultaneously be good at understanding intent, generating output, and knowing when it's wrong. These are fundamentally different cognitive tasks. Asking one model to handle all three is like asking a single engineer to be the architect, the developer, and the QA lead on the same ticket. It can work on small problems. It collapses at scale.

Compound AI systems solve this by decomposing the problem. Instead of one model doing everything adequately, you get three models each doing one thing exceptionally well. The classifier doesn't need to generate — it just needs to route correctly. The generator doesn't need to self-evaluate — it just needs to produce high-quality output for its specific input type. The verifier doesn't need to create — it just needs to catch errors.

This decomposition creates a reliability multiplier. If your generator is 85% accurate and your verifier catches 80% of errors, your pipeline accuracy jumps to 97%. Two mediocre components compose into something better than a single excellent one.

The Classifier-Generator-Verifier Pattern

The most common compound architecture in production follows a three-stage pattern that mirrors how engineering teams actually work.

Stage 1: Classification and routing. A small, fast model (often a fine-tuned BERT variant or a lightweight LLM like Haiku) examines the incoming request and decides where it should go. Is this a simple FAQ lookup, a complex reasoning task, or a creative generation request? The classifier doesn't need to be smart — it needs to be fast, cheap, and accurate at categorization. Research from UC Berkeley and Canva shows that intelligent routing delivers 85% cost reduction while maintaining 95% of frontier model performance.

Stage 2: Specialized generation. Based on the classifier's output, the request routes to a specialized handler. Simple queries go to a small model or even a cached response. Complex reasoning goes to a frontier model. Domain-specific tasks go to a fine-tuned specialist. Each handler is optimized for its specific task type, which means better prompts, better few-shot examples, and better output quality than a one-size-fits-all approach.

Stage 3: Verification. A separate model (or set of rules plus a model) evaluates the generator's output. Does the response answer the question? Is it factually grounded in the provided context? Does it conform to the required schema? The verifier can be a different model entirely — often a cheaper one with a focused evaluation prompt. If verification fails, the system can retry with a different generator, escalate to a more capable model, or return a graceful failure.

This pattern is powerful because each stage fails independently and recovers independently. A misclassification sends a query to a suboptimal generator, but the verifier catches the low-quality output. A generator hallucination gets flagged before it reaches the user. The system degrades gracefully instead of failing catastrophically.

The Routing Decision Matrix

The hardest engineering problem in compound systems isn't building the components — it's deciding when to route where. Get routing wrong and you're paying frontier-model prices for simple queries while sending complex ones to models that can't handle them.

Effective routing considers three dimensions simultaneously:

Complexity. How many reasoning steps does this query require? Single-hop factual lookups don't need chain-of-thought from a 400B parameter model. Multi-step reasoning with ambiguous constraints does. The classifier needs to estimate cognitive complexity, not just topic classification.

Risk. What's the cost of a wrong answer? A product recommendation can tolerate some inaccuracy. A medical dosage calculation cannot. High-risk queries should route to more capable models with verification, even when the query itself is simple.

Latency budget. Interactive chat needs sub-second responses. Batch processing can afford 30-second generation cycles. The same query might route to different models depending on whether a user is waiting or a background job is running.

The routing classifier itself should be the cheapest component in your system. A fine-tuned distilBERT running locally adds under 10ms of latency. If your router is an LLM call, you've already lost — you're paying inference costs before you've even started working on the actual task.

Coordination Overhead: The Silent Killer

Here's where most compound system designs die: coordination costs scale non-linearly with component count.

Two components require one integration point. Three components require three. Five components require ten. Each integration point is a potential failure mode, a latency addition, and a debugging surface. Handoff latency between components ranges from 100ms to 500ms per interaction. A workflow requiring ten handoffs adds 1-5 seconds of pure coordination overhead.

The research is clear on this. Single-component systems achieve 99.5% success rates. Equivalent multi-component implementations observe 97% success rates due to coordination failures alone — before you even account for individual component errors. That 2.5% gap sounds small until you realize it means 25 additional failures per thousand requests.

The coordination tax follows a rule of thumb: if handoffs add more than 40% to your end-to-end latency, your architecture is too complex. You've crossed the threshold where coordination overhead consumes more value than specialization creates.

This means compound systems have a sweet spot. Two to four components is usually optimal. Beyond that, you're fighting the coordination overhead more than you're benefiting from specialization. The classifier-generator-verifier triad works precisely because it's the minimal decomposition that captures the three distinct cognitive tasks (understanding, producing, evaluating) without introducing unnecessary coordination.

When Compound Systems Lose

Not every problem benefits from decomposition. Compound systems underperform a single good model in several specific scenarios.

Low-volume applications. If you're processing fewer than 500 requests per day, the engineering overhead of maintaining a routing classifier, multiple model deployments, and integration tests exceeds the cost savings. Just use the best model you can afford and move on.

Uniform query distribution. If 95% of your queries are the same type and complexity, routing adds latency without adding value. A well-prompted single model with a verification step (two components, not three) is the right architecture.

Rapid iteration phases. During early product development, when your understanding of user queries is changing weekly, hardcoding routing logic creates technical debt faster than it creates value. Start with a single model, collect distribution data, and introduce routing once you understand your traffic patterns.

Highly creative tasks. Tasks that require sustained coherent generation — long-form writing, complex code architecture, nuanced conversation — often perform worse when decomposed. The classifier-generator-verifier pattern introduces seams that break coherence. A single model maintaining context across a long generation usually outperforms a pipeline that fragments the task.

The honest benchmark: single-agent systems with good tool augmentation handle 80% of use cases better than multi-component alternatives. Compound architectures earn their complexity for the remaining 20% — but that 20% is often where the highest business value concentrates.

Building Compound Systems That Survive Production

If you've decided a compound architecture is right for your use case, four engineering practices determine whether it survives contact with real traffic.

Design for independent deployment. Each component should be deployable, testable, and rollbackable independently. If upgrading your generator requires retraining your classifier, your architecture is coupled wrong. Version your inter-component contracts (input/output schemas) the same way you version APIs.

Build the monolith first. Start with a single model handling everything. Instrument it heavily — log query types, complexity estimates, failure modes, and latency breakdowns. After you have a month of production data, the routing boundaries will be obvious. Premature decomposition is the most common compound system failure mode because you end up splitting along the wrong boundaries.

Make routing observable. Every routing decision should be logged with the classifier's confidence score, the selected handler, and the downstream outcome. When quality drops, you need to distinguish between routing errors (right model, wrong prompt), classification errors (wrong model), and capability errors (right model, genuinely hard query). Without this observability, debugging compound systems is nearly impossible.

Set a complexity budget. Before adding a new component, calculate the marginal value it provides against the coordination cost it introduces. If a new verification step catches 3% more errors but adds 400ms of latency and a new failure mode, it probably isn't worth it. The best compound systems are disciplined about staying minimal — not because minimalism is elegant, but because every component is a maintenance burden and a failure surface.

The Emerging Pattern: Adaptive Composition

The most sophisticated compound systems in 2026 aren't static pipelines — they're adaptive. The system itself decides how many components to engage based on the query.

A simple query might skip verification entirely: classify, generate, return. A medium-complexity query adds verification. A high-risk query adds generation with self-consistency sampling (multiple generations compared for agreement), verification by a separate model, and human-in-the-loop escalation if confidence is low.

This adaptive approach captures the key insight of compound systems — different tasks need different architectures — and applies it recursively. The system doesn't just route between models; it routes between pipeline configurations. Easy queries get the fast, cheap path. Hard queries get the thorough, expensive path. The pipeline itself becomes a parameter you optimize.

DSPy and similar frameworks are pushing this further, automatically tuning pipeline configurations — prompt instructions, few-shot examples, component selection — to maximize end-to-end performance on target metrics. The pipeline becomes a differentiable program, even when individual components (search engines, code interpreters, external APIs) are non-differentiable.

The Three-Model Minimum

If you take one thing from this piece, it's this: the minimum viable compound system has three components, and they map to three distinct cognitive functions.

One model understands. One model produces. One model evaluates.

This is the decomposition that consistently outperforms single-model approaches in production, because it mirrors how reliable systems work in every other domain. Code review exists because the person who writes code is the worst judge of its correctness. QA exists because building and testing require different mindsets. The classifier-generator-verifier pattern is the same principle, applied to AI.

The frontier model arms race will continue. Models will get bigger and more capable. But the systems that deliver the most reliable, cost-effective AI in production will continue to be the ones that compose smaller specialists into architectures where the whole exceeds the sum of its parts. The best model is three models.

References:Let's stay in touch and Follow me for more thoughts and updates