Skip to main content

Code-Specific RAG: Why General Retrieval Fails for Codebases

· 10 min read
Tian Pan
Software Engineer

Most teams building AI coding assistants reach for the same off-the-shelf RAG pipeline they use for document retrieval: chunk the source files by token count, embed the chunks, store them in a vector database, query by semantic similarity. The pipeline works well enough on prose. On code, it quietly fails — and the failures are hard to see in aggregate metrics, because the retrieved chunks look plausible right up until the model generates code with the wrong return type, calls a function with the wrong signature, or misses a dependency that only exists three hops down the call graph.

The problem isn't the embedding model or the vector database. It's the chunking strategy. Code is not prose. It has structural properties — dependency graphs, call chains, type signatures, scope hierarchies — that token-based chunking destroys before the retriever ever sees them. Fixing this requires rethinking how you decompose code before it ever reaches the embedding step.

What Token-Based Chunking Actually Destroys

When you split a source file every 512 tokens, the split lands wherever the token count runs out. On a natural language document, that's annoying. On code, it's structurally catastrophic.

A function split mid-implementation loses its return type. A class hierarchy fragmented across chunks loses the inheritance chain. A call chain broken at an arbitrary boundary loses the context showing why one function calls another. Type signatures separated from their usage sites become uninterpretable in isolation. Control flow interrupted in the middle of a conditional block can't tell the model whether the branch resolves to early return or falls through.

These aren't edge cases — they're the norm in any realistic codebase. A 200-line class with several methods will almost certainly be split. A service function that validates input, calls downstream APIs, and formats a response will cross multiple chunk boundaries. The model receives semantically invalid fragments and tries to reason over them anyway.

The downstream result: a coding assistant that retrieves what looks like relevant context but generates code based on mistaken assumptions about return values, function signatures, and data shapes. The retrieval metrics look fine. The generated code doesn't compile.

AST-Based Chunking: Respecting Syntactic Boundaries

The fix is to let the language's own structure determine chunk boundaries. Abstract Syntax Trees (ASTs) represent source code as a hierarchy of syntactic units — functions, classes, control structures, variable declarations — rather than a flat stream of tokens. Chunking along AST boundaries ensures each chunk remains semantically intact.

Tree-sitter, a parser library with bindings for 40+ languages, makes this practical. It produces concrete syntax trees incrementally and handles partial or malformed files, which matters for real codebases that aren't always in a clean compilable state. A chunker built on tree-sitter can split code at function boundaries, class definitions, and top-level declarations rather than at arbitrary token counts.

Recent work formalizes this into a recursive split-then-merge algorithm. Large AST nodes (a class with many methods) get recursively decomposed when they exceed a size limit. Adjacent small sibling nodes (a sequence of short utility functions) get merged to maximize information density per chunk. The result is chunks that stay within size constraints without ever splitting across syntactic boundaries.

The benchmark results are significant. On RepoEval, a repository-level code completion benchmark, AST-aware chunking improves retrieval recall by 4.3 points over line-based chunking and generation pass@1 by 2.67 to 5.5 points. On SWE-bench, which tests natural language to code task resolution, the same approach yields a 2.3 to 2.7 point improvement in pass@1. These aren't marginal gains — they compound across a production system handling thousands of completions.

One implementation detail matters: AST-aware chunkers should measure chunk size in characters rather than tokens. Token counts vary significantly across languages (Python is dense; C++ header files are sprawling), but character counts stay consistent. This avoids hidden inconsistencies in how embedding models see chunks of nominally the same "size."

Call-Graph-Aware Retrieval: Following the Code

Fixing chunking gets you structurally valid chunks. It doesn't solve the harder problem: code that's semantically relevant to a query often isn't textually similar to it.

Consider a function processPayment that calls validateCard, which calls checkFraudScore, which calls fetchUserHistory. A query about payment processing might surface processPayment by semantic similarity, but the actual bug or context lives in fetchUserHistory. Token-based retrieval stops at the first hop. Call-graph-aware retrieval follows the chain.

Code Property Graphs (CPGs) model exactly this. A CPG represents a codebase as a graph where nodes are code entities — functions, classes, variables, files — and edges encode structural relationships: CALLS (function invocations), CONTAINS (file containment), INHERITS_FROM (inheritance), HAS_ARGUMENT (parameter relationships), DEPENDS_ON (import dependencies). Rather than matching chunks by embedding distance, a CPG-enabled retriever can traverse: find all functions called by the retrieved function, find all functions that call into it, find all types that flow through its parameters.

The token savings are striking. One open-source implementation reports that traditional retrieval over a 52-file repository consumed roughly 123,000 tokens per query — every potentially relevant chunk passed to the model. With call-graph-guided retrieval, the same repository required about 1,700 tokens per query: a 71x reduction, achieved by traversing only the structurally relevant subgraph rather than flooding the context window.

This matters beyond cost. Large context windows help only up to a point. Production data from teams working with large codebases consistently shows that once the codebase exceeds roughly 4MB, adding more context window doesn't improve retrieval quality — it just adds noise. Smart structural retrieval is the only path forward at that scale.

Test Files as Retrieval Signals

Loading…
References:Let's stay in touch and Follow me for more thoughts and updates