Chunking for Agents vs. RAG: Why One Strategy Breaks Both

May 5, 2026 · 9 min read

Software Engineer

Most teams pick a chunk size, tune it for retrieval quality, and call it done. Then they build an agent on the same index and wonder why the agent fails in strange ways — it executes half a workflow, ignores conditional logic, or confidently acts on incomplete instructions. The chunk size that maximized your NDCG score is exactly what's making your agent unreliable.

RAG retrieval and agent execution are not the same problem. They have different goals, different failure modes, and fundamentally different definitions of what a "good chunk" looks like. When you optimize chunking for one, you systematically degrade the other. Most teams don't realize this until they've already built on the wrong foundation.

What RAG Retrieval Actually Needs

For retrieval-augmented generation, the retrieval quality depends on embedding similarity — the chunk must embed into a vector that faithfully captures its topic, and that vector must sit close to the query vector in embedding space. This drives everything about how you should chunk for RAG.

The industry default of 400–512 tokens with 10–20% overlap exists for a reason. An NVIDIA benchmark across seven strategies and five datasets found page-level chunking at the top with 0.648 accuracy, and recursive character splitting in the sweet spot for mixed workloads. But chunk size is query-type dependent: factoid queries (names, dates, specific facts) peak at 256–512 tokens, while analytical queries that need broader context require 1024+ tokens. A January 2026 analysis found that semantic chunking — which splits at meaningful linguistic boundaries — produced fragments averaging only 43 tokens, achieving clean semantic coherence but insufficient context for generation, landing at 54% end-to-end accuracy versus 69% for recursive splitting.

The design target for a RAG chunk is semantic completeness: does this chunk describe a coherent topic? Would a similarity search on the right query land here? Adding overlap between chunks helps preserve context across boundaries. Adding metadata (section headers, document titles) helps the embedding capture the chunk's place in the document hierarchy.

What RAG retrieval is not optimizing for: whether an agent can execute the content inside the chunk without needing anything else.

What Agents Actually Need

When an agent retrieves a chunk to act on, it needs something fundamentally different: procedural completeness. The chunk must be a self-contained unit of work — a complete workflow step, a complete instruction set with its conditions and exceptions, a complete policy with its triggers and outcomes.

Consider a database migration runbook. A RAG-optimized chunk might cleanly capture "the section about locking behavior" — semantically coherent, topically cohesive, great retrieval recall. But the agent needs more than that: it needs the preconditions for when to apply the lock, the post-condition checks, the rollback procedure if those checks fail. Split those across semantic boundaries and the agent receives half an instruction. It doesn't know what it doesn't know, so it executes with what it has.

The failure is subtle because the retrieved text looks reasonable in isolation. The agent doesn't see a retrieval error — it sees plausible content and proceeds. The problem only surfaces when the procedure fails, often in production.

Agent-optimized chunks are larger, variable in size, and bounded by workflow transitions rather than linguistic or topical transitions. A step in a runbook ends when the next action starts, not when the sentence achieves semantic closure.

Three Failure Modes When You Pick One Strategy

Failure 1: RAG chunks are too small for agents. When you tune chunk size for factoid retrieval (256–512 tokens is often optimal), agents receive fragments. A multi-step workflow that spans 1,200 tokens gets split into three chunks. The agent retrieves the most relevant-looking one — usually the first step — and proceeds as if that's the complete instruction. Steps two and three simply don't happen. This isn't retrieval failure; the similarity score was high. It's a chunking mismatch.

Failure 2: Agent chunks are too noisy for retrieval. The inverse problem: if you size chunks for procedural completeness, they run large — often 800–1,500 tokens for a full workflow. Retrieval quality drops because larger chunks introduce multiple topics. A SQuAD-style analysis shows recall drops 10–15% at 512+ tokens for factoid queries as chunk size grows. Your vector index starts returning whole workflows when the user just asked a simple factoid question, hallucinating irrelevant context into the prompt.

Failure 3: Overlap creates redundancy that wastes agent context. Overlap — necessary for RAG to bridge chunk boundaries — means adjacent chunks share 50–100 tokens of identical content. For retrieval, this is largely harmless (duplicates get filtered). For an agent processing multiple retrieved chunks, the repeated content consumes a meaningful fraction of the context window, crowds out other tool calls or reasoning steps, and can cause the agent to double-execute overlapping instructions.

A clinical decision support study quantifies the stakes. Adaptive chunking — aligned to topic boundaries with variable window sizes — achieved 87% accuracy versus 13% for fixed-size chunking (p = 0.001, F1: 0.64 vs. 0.24). The gap isn't marginal. Chunking choice moves the needle more than most teams expect.

The Dual-Index Architecture

The clean solution: maintain two separate indices over the same documents.

Index A: retrieval-optimized. Fixed recursive splitting at 400–512 tokens with 15% overlap. Metadata-enriched (section title, document type, source). This index serves similarity search. Queries that need facts, explanations, or context go here.

Loading…

References:

Let's stay in touch and Follow me for more thoughts and updates

Twitter LinkedIn Telegram Discord 小红书

Chunking for Agents vs. RAG: Why One Strategy Breaks Both

What RAG Retrieval Actually Needs

What Agents Actually Need

Three Failure Modes When You Pick One Strategy

The Dual-Index Architecture

Recommended Reading

About Tian Pan

What RAG Retrieval Actually Needs​

What Agents Actually Need​

Three Failure Modes When You Pick One Strategy​

The Dual-Index Architecture​

Recommended Reading

About Tian Pan

What RAG Retrieval Actually Needs

What Agents Actually Need

Three Failure Modes When You Pick One Strategy

The Dual-Index Architecture