Skip to main content

One post tagged with "code-search"

View all tags

Code-Specific RAG: Why General Retrieval Fails for Codebases

· 10 min read
Tian Pan
Software Engineer

Most teams building AI coding assistants reach for the same off-the-shelf RAG pipeline they use for document retrieval: chunk the source files by token count, embed the chunks, store them in a vector database, query by semantic similarity. The pipeline works well enough on prose. On code, it quietly fails — and the failures are hard to see in aggregate metrics, because the retrieved chunks look plausible right up until the model generates code with the wrong return type, calls a function with the wrong signature, or misses a dependency that only exists three hops down the call graph.

The problem isn't the embedding model or the vector database. It's the chunking strategy. Code is not prose. It has structural properties — dependency graphs, call chains, type signatures, scope hierarchies — that token-based chunking destroys before the retriever ever sees them. Fixing this requires rethinking how you decompose code before it ever reaches the embedding step.