Skip to main content

Agentic RAG: When Your Retrieval Pipeline Needs a Brain

· 10 min read
Tian Pan
Software Engineer

Ninety percent of agentic RAG projects failed in production in 2024. Not because the technology was broken, but because engineers wired up vector search, a prompt, and an LLM, called it a retrieval pipeline, and shipped — without accounting for the compounding failure costs at every layer between query and answer.

Classic RAG is a deterministic function: embed query → vector search → stuff context → generate. It runs once, in one direction, with no feedback loop. That works when queries are clean single-hop lookups against a well-chunked corpus. It fails spectacularly when a user asks "compare the liability clauses across these five contracts," or "summarize what's changed in our infra config since the Q3 incident," or any question that requires synthesizing evidence across documents before forming an answer.

Agentic RAG converts that one-shot pipeline into a control loop. The retrieval step becomes a decision: which tool to call, whether the results are good enough, and whether to try again. It's the difference between a function call and a state machine with conditional transitions.

The Five-Component Control Loop

The canonical agentic RAG architecture has five stages that replace the linear pipeline:

Router → Retriever → Grader → Generator → Hallucination Checker

With a feedback arc from Grader back to Retriever on failure.

The router dispatches each query to the right retrieval mode — vector search over a private corpus, live web search, direct LLM response for general knowledge questions, or a structured SQL query when the answer lives in a database. A RouteQuery Pydantic model enforces the output contract. In production, caching router decisions eliminates 30–40% of routing LLM calls on repeated query patterns.

The retriever fetches candidate documents. The grader — a lightweight LLM call, ideally using a small model like GPT-4o-mini or Claude Haiku to keep costs reasonable — scores each retrieved document for relevance. If documents fail the relevance threshold, the agent rewrites the query and re-retrieves rather than hallucinating over bad context.

The generator synthesizes the graded context into an answer, and a hallucination checker validates that every factual claim is grounded in a retrieved passage. Ungated claims either get stripped or flagged for human review.

This loop is where the intelligence lives. The agent decides at each step whether to continue, retry, escalate, or stop.

Query Planning: Four Techniques for Complex Questions

Simple routing handles query dispatch. Query planning handles structural complexity — queries that are too decomposed for single-shot retrieval to resolve.

Sub-question decomposition breaks a comparative query into independent sub-queries, retrieves against each in parallel, deduplicates, and reranks before synthesis. "Compare X and Y on dimensions A, B, and C" becomes three separate retrieval operations running concurrently. This is the right pattern when you can parallelize without dependencies between sub-queries.

HyDE (Hypothetical Document Embeddings) addresses embedding space mismatch. Short queries and long documents often sit far apart in embedding space even when the document contains the answer. HyDE generates a plausible hypothetical answer first, then uses that as the retrieval query. The hypothetical answer embeds much closer to the actual answer document than the raw question does.

Step-Back Prompting handles over-specific queries by extracting higher-level concepts first. When the query is too narrow ("what does config flag --max-retry-count do in service X version 2.3.4?"), a step-back query asks about the broader principle first, retrieves at that level, and then descends to specifics. It handles informational complexity at the conceptual level rather than the syntactic level.

FLARE (Forward-Looking Active Retrieval) operates at generation time rather than pre-retrieval. As the model generates, it monitors token confidence. When predicted tokens fall below a threshold, the model pauses, forms a retrieval query from the anticipated continuation, fetches, and keeps going. This is fine-grained retrieval embedded in generation — the right tool for long-form content where information needs evolve mid-document.

Hierarchical Multi-Agent Architectures

For large, heterogeneous corpora, flat retrieval doesn't scale. The solution is a two-tier agent hierarchy: document agents and a meta-agent.

Each document agent is responsible for one document or document cluster. It has semantic search and summarization tools, and it operates under a system prompt that mandates tool use — it cannot answer from parametric memory alone. This constraint matters: without it, document agents hallucinate from training data rather than the corpus.

A top-level meta-agent handles tool retrieval — selecting which document agents to invoke — and coordinates routing across the corpus. The meta-agent reasons with chain-of-thought about which agents have the information the query needs, dispatches concurrently, and synthesizes results with a reranker layer.

This architecture scales horizontally. Adding documents means adding document agents, not rewriting orchestration logic. The tradeoff is latency: coordinating multiple agents adds round trips, and at production scale (1,000+ document agents), orchestration overhead becomes a first-class engineering concern.

When Agentic RAG Costs More Than It's Worth

The capability improvements of agentic RAG come with real costs: 5–15 LLM calls per query vs. one for naive RAG, latency of 4–8 seconds vs. sub-second, and per-query costs of $0.06–$0.31 vs. $0.001–$0.005.

These numbers don't mean agentic RAG is always expensive — they mean the router decision is critical. If a query can be resolved with a single-hop vector lookup, the routing layer should dispatch it there and skip the control loop entirely. The practical compromise is an adaptive tiering model:

  • No retrieval: simple factual questions within LLM training data
  • Naive RAG: single-hop lookups against a clean corpus
  • Agentic RAG: multi-hop, comparative, or analytical queries that require synthesis

Use agentic RAG when queries are genuinely complex (multi-hop, cross-document synthesis), when answers require real-time data alongside a static corpus, when tasks are asynchronous (research, document review, code analysis), or when failure has significant downstream consequences and you need an audit trail. Keep it away from high-volume FAQ workloads with sub-second latency requirements.

Four Production Failure Modes

Most production failures fall into one of four patterns:

Retrieval thrash: the agent repeatedly fetches semantically redundant documents without improving answer quality. The grader's relevance threshold is too strict, triggering rewrites when the corpus simply doesn't contain the answer. The diagnostic: if more than 30% of queries trigger re-retrieval, the base retrieval pipeline is the problem, not the agentic layer. Distinguish "retrieval failed" (corpus gap) from "low-confidence retrieval" (rewrite and retry).

Infinite loops: without a rewrite counter in the state machine, agents default to "get more context" when uncertain. Hard-cap at three iterations. Beyond three re-retrievals, degrade gracefully — return "insufficient information found" rather than spinning. Implement this before shipping anything to production.

Context bloat: retrieved documents accumulate across iterations. By the third retry, the context window holds 6–12 passages, much of it redundant. LLM attention degrades on long, noisy contexts. Fix: hash-based deduplication of retrieved passages before each generation call, plus a sliding window of the top-K most relevant documents per iteration (K=3–5 works in practice).

Hallucination amplification: the most insidious failure. If the hallucination checker only tests internal coherence ("does this sound right?") rather than external grounding ("is this claim sourced to a retrieved passage?"), iterative generation amplifies confident-but-wrong narratives. Each re-generation step makes the hallucination more elaborate and harder to catch. Citation gating fixes this: the generator must source every factual claim to a specific retrieved passage. This is non-negotiable for legal, medical, or compliance use cases.

Chunking Is Infrastructure, Not a Detail

A 2025 study on policy document RAG found faithfulness scores of 0.47–0.51 with fixed-size chunking, and 0.79–0.82 with semantic chunking — nearly doubling retrieval quality with nothing more than a chunking strategy change. The research consensus is clear: 80% of RAG failures trace back to chunking decisions, not retrieval algorithms or generation models.

The right chunk size depends on content type. Technical documentation chunks well at 256–512 tokens. Narrative and policy documents need 1,024–2,048 tokens to preserve context. Structured data should chunk at entity or row boundaries, not token boundaries. Always add 20% overlap between adjacent chunks to avoid boundary artifacts where a concept spans the chunk boundary and gets retrieved in neither half.

Chunking strategy belongs in the infrastructure conversation alongside database schema and indexing. Engineers who treat it as a prompt engineering afterthought consistently rebuild their retrieval pipelines six months later.

Accuracy at the Cost of Latency: Understanding the Tradeoff

On multi-hop question-answering benchmarks, agentic RAG substantially outperforms naive RAG: roughly 73% vs. 95% on HotpotQA-style evaluations, and the gap widens on cross-document synthesis tasks. The accuracy case is genuine.

But the latency profile deserves equal attention. Single-iteration agentic RAG runs 2–4 seconds. Multi-hop with two retrieval iterations runs 4–8 seconds. Re-retrieval strategies with aggressive rewriting can push to 30 seconds — a non-starter for interactive applications.

For applications where users expect sub-second responses, the answer is streaming with progressive disclosure: return an initial answer from the first retrieval pass, then stream refinements as subsequent passes complete. This is standard in production search applications and maps cleanly to the agentic RAG control loop.

Measuring What Matters

Four metrics cover production health for agentic RAG systems:

  • Retrieval recall: percentage of queries where at least one relevant document is returned
  • Grader precision: percentage of documents marked relevant that are actually relevant (grader calibration)
  • Answer faithfulness: are all claims in the answer grounded in retrieved passages? (RAGAS faithfulness metric)
  • Answer relevance: does the answer address actual user intent?

Operational monitoring: maintain a labeled test set of 100 representative queries. Run LLM-as-judge evaluation nightly on a production sample. Alert on retrieval recall drops, re-retrieval rate spikes above 30%, and grader precision drift. These three signals catch the majority of regression events before they become user-visible.

Semantic caching reduces cost 20–35% on high-repetition workloads — worthwhile to implement once the system is stable and query patterns are understood.

The Practical Path

Agentic RAG is not a drop-in upgrade over naive RAG. It's a different architecture that solves a different class of problems. The engineers who ship it successfully start with the five-component control loop (router, retriever, grader, generator, hallucination checker), hard-cap iteration at three, use hybrid BM25 + vector retrieval as the base layer, and invest in chunking strategy as infrastructure before optimizing anything else.

The engineers who fail ship naive RAG, call it good enough, then bolt on retrieval retries when it breaks — without the grader, without the hallucination checker, without the iteration cap. The resulting system looks agentic from the outside and behaves like an unreliable pipeline from the inside.

The control loop is the architecture. Build it intentionally from the start.

References:Let's stay in touch and Follow me for more thoughts and updates