RAG for Code Is Not RAG for Documents

🎧
Listen to this article 11 min
Download MP3

Retrieval-Augmented Generation became mainstream in the context of documents. Legal contracts, support tickets, PDF archives — the mental model is straightforward. Chunk the text, embed it, retrieve the relevant pieces, feed them to the model. That model works well enough in text-heavy domains that it became the default template.

Codebases are not documents. Applying document RAG patterns to code retrieval produces systems that are technically functional and practically frustrating.

The differences are specific. They have real consequences. And most teams only discover them after they've already built something that underperforms.


Why Code Breaks the Standard Model

Document RAG is built around semantic similarity between natural language queries and natural language content. Ask a question in plain English, retrieve paragraphs that discuss similar ideas. The signal lives in the words.

Code doesn't work that way. The signal lives in structure.

A function named proc_evt tells you almost nothing semantically. The function that calls it, the interface it satisfies, the service that depends on its return type — that's where the meaning is. Query for "event processing" and you might miss it entirely because the name never says event and doesn't say processing. The semantic distance between a developer's intent and the actual implementation is often enormous.

This is the first gap standard RAG fails to close: codebases have their own language, and it doesn't resemble the questions people ask about it.

The second gap is structural. Documents are mostly flat. Paragraphs relate to adjacent paragraphs. Sections relate to chapters. The relationship graph is essentially linear. Code is a dependency graph — functions call other functions, modules import other modules, types propagate across files, interfaces span layers. A single class might only make sense in the context of three other files it interacts with. Chunk it in isolation and you've retrieved a piece with no context for what it actually does.

The third gap is temporal. Documents, once written, are largely stable. Codebases change constantly. Embeddings built last sprint may already be stale. A function that moved, a module that was renamed, a service that was split into two — retrieval systems built on outdated indexes return confident results that point to the wrong place.

Three gaps. Most teams are only aware of one, if any.


What Hybrid Search Actually Means for Code

Semantic search finds conceptually similar content. Keyword search finds exact matches. For documents, semantic search is usually the winner because natural language has enough variation that keyword matching misses too much.

For code, the equation shifts.

Identifiers matter. If a developer asks about AuthTokenValidator, they mean AuthTokenValidator — not a conceptually similar function that validates something else. Exact keyword matching on symbol names is often the correct retrieval strategy for that query. Semantic search might surface a password hasher because it's "related to authentication." That's not useful.

On the other hand, a developer asking "where does the system handle expired sessions?" may not know the exact function name. The answer might live in something called check_state_integrity or session_gc. Keyword matching fails here. Semantic search is the right tool.

The practical answer is hybrid retrieval: run both, fuse the results, and let ranking determine what surfaces. Reciprocal Rank Fusion (RRF) is the common mechanism — it combines ranked lists without needing to normalize scores across incompatible embedding and keyword scoring systems. It's not elegant in the way a clean mathematical solution is elegant. It works because it's pragmatic.

The teams that build code RAG on pure semantic search discover the identifier problem in production. The teams that build on pure keyword matching rediscover why they wanted semantic search in the first place. Hybrid is the uncomfortable middle ground that's actually correct.


The Chunking Problem Nobody Talks About Enough

In document RAG, chunking strategy is a tuning parameter. Try 256 tokens, try 512, try overlapping chunks, measure retrieval quality, adjust. The underlying assumption is that all content is more or less interchangeable paragraphs of text.

For code, chunking is a semantic question.

Functions are the natural unit of executable behavior. Splitting a function across two chunks doesn't just reduce retrieval quality — it produces chunks that are semantically broken. Half a function, retrieved and fed to an LLM, provides context that is wrong in a specific way: it looks complete but isn't.

Classes present a different problem. A class definition might span hundreds of lines. Treating it as a single chunk violates token limits. Splitting it arbitrarily loses the relationship between methods. A reasonable answer is to chunk at method boundaries while preserving class-level context in metadata — so each method chunk knows it belongs to a specific class, which implements a specific interface, in a specific module. Retrieval returns the method; the context tells the model where the method lives.

Comments, imports, type signatures, docstrings — each of these carries different weight depending on the query. A type signature might be irrelevant to a question about business logic. It might be the entire point of a question about interfaces. Chunks that strip this information too aggressively lose signal. Chunks that preserve everything indiscriminately dilute it.

There's no universal answer here. The right chunking strategy depends on the codebase, the team's query patterns, and what the retrieval system is ultimately used for. What's clear is that applying fixed-size text chunking to source code is a mistake with predictable consequences.


Reranking: Where Retrieval Quality Is Actually Decided

Standard RAG retrieves the top-k results and passes them directly to the model. For documents, this works well enough when the embedding model is good and the query is clean.

For code, top-k retrieval is often the beginning of the problem, not the end of it.

The embedding space that represents semantic similarity is not the same as the space that represents "what a developer needs to understand this codebase right now." A query about database connection handling might retrieve eight files that mention databases and two that don't seem related at all. Those two might be the most important: the configuration layer that controls connection pooling, the error handler that processes timeouts.

Reranking adds a second-pass model that evaluates retrieved candidates in the context of the specific query. The retrieval step optimizes for recall — get the relevant content into the candidate set. The reranking step optimizes for precision — put the most useful content first.

This distinction matters because most LLMs have position bias. Content that appears earlier in the context window gets more attention. Passing a list of retrieved chunks in arbitrary embedding-score order to an LLM means the most relevant chunk might sit at position seven, behind six less useful results. The model sees it. It doesn't weight it the way it would if it appeared first.

Reranking is where retrieval quality is actually decided. The teams that skip it are measuring their systems on recall and wondering why the end-to-end experience feels inconsistent.


Graph Context: The Piece Most RAG Systems Skip

Document retrieval doesn't need a graph because documents don't have one. Code does.

When a developer asks about a specific function, often what they actually need is the function plus the functions it calls plus the interfaces it satisfies plus the types it depends on. A flat embedding search returns the function. The context required to actually understand or modify it might be distributed across four other files.

Graph-aware retrieval treats the codebase as what it actually is: a dependency graph. Starting from a retrieved node, it expands outward through import edges and call relationships to bring in connected context. The retrieved function arrives with its callers, its dependencies, its type context — not as isolated text, but as a connected subgraph of meaning.

This is computationally heavier than flat retrieval. It requires maintaining a graph index alongside the vector index. It requires decisions about how far to expand — too shallow and you miss critical context; too deep and you flood the context window with irrelevant code.

It's also the difference between a retrieval system that knows where a function is and one that understands what that function does in the context of the system it belongs to.


Where the Industry Is

The document RAG pattern is well-understood. The literature is extensive, the tooling is mature, the benchmarks are established. Code RAG is earlier.

Most production code retrieval systems are document RAG applied to code with varying degrees of code-specific modifications. The teams that have built these systems have learned the same lessons repeatedly: chunking strategy matters more than the embedding model, hybrid search outperforms either approach alone, and stale indexes quietly erode quality until something breaks in a way that's hard to diagnose.

What's not yet solved cleanly: evaluation. Measuring document RAG quality has a reasonably established playbook — retrieval precision, recall, end-to-end generation quality on benchmark question sets. Code RAG evaluation is less settled. The ground truth for "did this retrieval help the developer?" is hard to define and harder to automate. Most teams are measuring proxy metrics rather than actual developer productivity impact.

What's emerging: retrieval systems that are aware of code change velocity. Not just indexing at a point in time but monitoring the codebase for changes and updating incrementally. The problem of stale embeddings is tractable. The tooling for handling it at scale is still catching up to where it needs to be.

The gap between document RAG and code RAG is understood. It's being closed. The teams building on top of generic document retrieval pipelines are going to keep running into the same friction until they acknowledge that the problem is different.

Most of them already know it. Most of them haven't done anything about it yet.


Code retrieval requires code-specific embeddings. The Pyckle Embeddings API — trained on code-to-query pairs across five programming languages — is live at pyckle.co/products.

← Back to News

Go Deeper — Free Guides

Free Guides

Books & Guides — Code Intelligence

Free ebooks and guides on semantic search, embeddings, RAG, and AI-assisted development.

Browse all guides →