Skip to main content

Chunking for Agents vs. RAG: Why One Strategy Breaks Both

· 9 min read
Tian Pan
Software Engineer

Most teams pick a chunk size, tune it for retrieval quality, and call it done. Then they build an agent on the same index and wonder why the agent fails in strange ways — it executes half a workflow, ignores conditional logic, or confidently acts on incomplete instructions. The chunk size that maximized your NDCG score is exactly what's making your agent unreliable.

RAG retrieval and agent execution are not the same problem. They have different goals, different failure modes, and fundamentally different definitions of what a "good chunk" looks like. When you optimize chunking for one, you systematically degrade the other. Most teams don't realize this until they've already built on the wrong foundation.

What RAG Retrieval Actually Needs

For retrieval-augmented generation, the retrieval quality depends on embedding similarity — the chunk must embed into a vector that faithfully captures its topic, and that vector must sit close to the query vector in embedding space. This drives everything about how you should chunk for RAG.

The industry default of 400–512 tokens with 10–20% overlap exists for a reason. An NVIDIA benchmark across seven strategies and five datasets found page-level chunking at the top with 0.648 accuracy, and recursive character splitting in the sweet spot for mixed workloads. But chunk size is query-type dependent: factoid queries (names, dates, specific facts) peak at 256–512 tokens, while analytical queries that need broader context require 1024+ tokens. A January 2026 analysis found that semantic chunking — which splits at meaningful linguistic boundaries — produced fragments averaging only 43 tokens, achieving clean semantic coherence but insufficient context for generation, landing at 54% end-to-end accuracy versus 69% for recursive splitting.

The design target for a RAG chunk is semantic completeness: does this chunk describe a coherent topic? Would a similarity search on the right query land here? Adding overlap between chunks helps preserve context across boundaries. Adding metadata (section headers, document titles) helps the embedding capture the chunk's place in the document hierarchy.

What RAG retrieval is not optimizing for: whether an agent can execute the content inside the chunk without needing anything else.

What Agents Actually Need

When an agent retrieves a chunk to act on, it needs something fundamentally different: procedural completeness. The chunk must be a self-contained unit of work — a complete workflow step, a complete instruction set with its conditions and exceptions, a complete policy with its triggers and outcomes.

Consider a database migration runbook. A RAG-optimized chunk might cleanly capture "the section about locking behavior" — semantically coherent, topically cohesive, great retrieval recall. But the agent needs more than that: it needs the preconditions for when to apply the lock, the post-condition checks, the rollback procedure if those checks fail. Split those across semantic boundaries and the agent receives half an instruction. It doesn't know what it doesn't know, so it executes with what it has.

The failure is subtle because the retrieved text looks reasonable in isolation. The agent doesn't see a retrieval error — it sees plausible content and proceeds. The problem only surfaces when the procedure fails, often in production.

Agent-optimized chunks are larger, variable in size, and bounded by workflow transitions rather than linguistic or topical transitions. A step in a runbook ends when the next action starts, not when the sentence achieves semantic closure.

Three Failure Modes When You Pick One Strategy

Failure 1: RAG chunks are too small for agents. When you tune chunk size for factoid retrieval (256–512 tokens is often optimal), agents receive fragments. A multi-step workflow that spans 1,200 tokens gets split into three chunks. The agent retrieves the most relevant-looking one — usually the first step — and proceeds as if that's the complete instruction. Steps two and three simply don't happen. This isn't retrieval failure; the similarity score was high. It's a chunking mismatch.

Failure 2: Agent chunks are too noisy for retrieval. The inverse problem: if you size chunks for procedural completeness, they run large — often 800–1,500 tokens for a full workflow. Retrieval quality drops because larger chunks introduce multiple topics. A SQuAD-style analysis shows recall drops 10–15% at 512+ tokens for factoid queries as chunk size grows. Your vector index starts returning whole workflows when the user just asked a simple factoid question, hallucinating irrelevant context into the prompt.

Failure 3: Overlap creates redundancy that wastes agent context. Overlap — necessary for RAG to bridge chunk boundaries — means adjacent chunks share 50–100 tokens of identical content. For retrieval, this is largely harmless (duplicates get filtered). For an agent processing multiple retrieved chunks, the repeated content consumes a meaningful fraction of the context window, crowds out other tool calls or reasoning steps, and can cause the agent to double-execute overlapping instructions.

A clinical decision support study quantifies the stakes. Adaptive chunking — aligned to topic boundaries with variable window sizes — achieved 87% accuracy versus 13% for fixed-size chunking (p = 0.001, F1: 0.64 vs. 0.24). The gap isn't marginal. Chunking choice moves the needle more than most teams expect.

The Dual-Index Architecture

The clean solution: maintain two separate indices over the same documents.

Index A: retrieval-optimized. Fixed recursive splitting at 400–512 tokens with 15% overlap. Metadata-enriched (section title, document type, source). This index serves similarity search. Queries that need facts, explanations, or context go here.

Index B: agent-optimized. Procedural boundary splitting — chunks bounded by action boundaries, workflow step transitions, or policy section delimiters. No overlap (each chunk must be self-contained). This index serves agentic execution. Queries that trigger tool use, multi-step operations, or policy enforcement go here.

Same documents. Same embedding model. Different chunking strategy, different chunk sizes, different metadata schemas. The separation is intentional: retrieval quality and execution completeness are genuinely different objectives, and trying to serve both with a single index forces a compromise that weakens both.

LlamaIndex's HierarchicalNodeParser is close to this pattern — it creates multiple chunk sizes simultaneously (coarse sections, medium paragraphs, fine sentences) and lets retrieval select the appropriate granularity at query time. Multi-scale indexing (indexing at 50, 100, 200, 500, and 1000 tokens in parallel) achieves 1–37% improvements over fixed-size approaches by capturing granularity diversity. But the hierarchy is still optimized for retrieval — it doesn't model the procedural boundary concept that agents require.

The dual-index pattern makes the conceptual split explicit and operational.

Task-Classification Routing

A dual-index architecture only works if the system routes queries to the right index. This is where task classification comes in.

The routing logic doesn't need to be complex. Most compound AI systems can classify incoming queries into two buckets with a lightweight classifier or a short LLM prompt:

  • Retrieval mode: the user or calling code needs information — facts, explanations, comparisons. Route to the retrieval-optimized index. Return the top-k chunks as context.
  • Execution mode: an agent is being dispatched to perform a task — run a procedure, enforce a policy, execute a workflow. Route to the agent-optimized index. Return the procedure chunk(s) that match the action.

In practice, classification signals are often available upstream. The orchestration layer usually knows whether it's populating a generation context (retrieval) or providing instructions to a specialized agent (execution). In those cases, routing can be deterministic rather than learned.

When signals are ambiguous, a lightweight routing prompt works: ask the LLM whether the query requires information synthesis or procedural execution, then route accordingly. Modern agentic RAG frameworks already implement evaluator nodes that assess document relevance and trigger re-routing when confidence is low — the same mechanism extends to index selection.

The routing overhead is small. An extra LLM call adds latency, but it can run in parallel with index fetching. The cost is predictable. The alternative — silent degradation when the wrong index handles the query — has unpredictable and harder-to-debug consequences.

Practical Migration Path

If you're running a single-index system and want to move toward this architecture, the migration is incremental:

Start by auditing what's actually being retrieved. Log the queries that produce agent failures and inspect the retrieved chunks. If you see procedurally incomplete chunks (steps without context, instructions without their conditions), you have a confirmed chunking mismatch. That's the evidence you need to prioritize the fix.

Next, classify your document corpus by content type. Runbooks, SOPs, policy documents, and API specifications are strong candidates for the agent-optimized index. Reference docs, FAQs, and explanatory content belong in the retrieval-optimized index. Many corpora naturally separate.

Build the second index without dismantling the first. Run both in parallel, route by content type initially (which is deterministic and requires no classifier). Measure agent failure rates against the agent-optimized index versus the original. This gives you the empirical signal to justify investment in full task-classification routing.

The Underlying Principle

Chunking is not a preprocessing detail — it's a system design decision that determines what information is addressable at runtime. The chunk boundary is the atomic unit of retrieval. Everything above that level is inference; everything below it is invisible.

When RAG and agents share an index, they're sharing an atomic unit that was designed for only one of them. The result is predictable: one of them is running on the wrong granularity. The fact that it often fails silently — high retrieval scores, plausible-looking output — makes it one of the more expensive mistakes in compound AI system design.

The fix is straightforward once you see the problem clearly: two problems, two optimized indices, one classifier to route between them. No exotic architecture required.

References:Let's stay in touch and Follow me for more thoughts and updates