RAG for Code Is Not RAG for Documents

There is a common assumption in the AI tooling space that retrieval-augmented generation is retrieval-augmented generation. Feed documents into a vector database, embed queries, return the top-k results, hand them to a language model. Problem solved.

That assumption works reasonably well for enterprise document retrieval. Knowledge bases, support tickets, policy documents—prose that was written to be read, by humans, in sequence.

Code was not written that way. And treating it as if it were explains why so many "context-aware" coding tools feel half-finished.

What RAG Is Solving (and Why the Defaults Break for Code)

The core problem retrieval-augmented generation addresses is context window size. Language models have finite context. When the relevant information lives outside that window—buried in a codebase with hundreds of thousands of lines—you need a way to find what matters and pull it in.

Document RAG handles this by chunking text (usually by token count or paragraph), embedding each chunk, storing it, and retrieving semantically similar chunks at query time. This works when the semantic content of a passage is self-contained. A paragraph about refund policies reads similarly to a question about getting a refund.

Code does not behave this way. A function definition tells you almost nothing without its call sites. A class definition becomes meaningful when you understand what inherits from it. An interface exists in relationship to its implementations. The semantic content of a code chunk is almost always incomplete without graph context—the web of imports, calls, and dependencies that gives it meaning.

Chunking a function definition and embedding it in isolation is like embedding a single frame from a film and calling it a movie summary. Technically possible. Practically misleading.

What's Actually Different About Code RAG

The units of meaning are structural, not textual.

In prose, a sentence or paragraph is often a complete unit of thought. In code, the meaningful units are structural: functions, classes, modules, interfaces, test cases. These don't align with token counts. A three-line function might be the most critical piece of logic in the system. A 200-line configuration block might be nearly irrelevant.

A naive chunking strategy—splitting by token count—will split functions across chunks, merge unrelated methods, and produce fragments that embed poorly because they don't represent complete ideas. Good code RAG chunks by Abstract Syntax Tree boundaries, not byte offsets.

Identifiers are a specialized vocabulary.

Generic embedding models—the kind trained on internet text—have never seen useAuthorizationMiddleware, processIncomingWebhookPayload, or whatever abbreviation system the team settled on three years ago when they were in a hurry. These identifiers carry enormous semantic weight in the codebase. They're the language the team uses to think.

Generic embeddings flatten this. They treat identifiers as unknown tokens or map them to vague approximations based on subword splits. The result is retrieval that finds things that sound related in English but miss things that are actually related in the codebase's own internal language.

Custom embeddings trained on the specific codebase—or at minimum on a broader corpus of code—close this gap. The difference in retrieval precision is not marginal.

Relevance is relational, not just semantic.

When a developer asks why a service is failing, the relevant context might be: the function being called, the interface it's supposed to implement, the module that initializes it, and the test case that was supposed to catch this exact scenario. These four artifacts might not share any textual similarity. They're related by structure, not semantics.

Retrieval that only understands semantic similarity will return functions that sound like they might be relevant. Retrieval that understands the dependency graph will return functions that actually are.

Hybrid Search: Why Semantic Alone Isn't Enough

A related misconception is that dense vector search (embedding similarity) is all you need once you have good embeddings.

It isn't.

Keyword search—BM25, TF-IDF, and their relatives—has a property that semantic search lacks: exact match reliability. When a developer types the name of a specific function or class, they want that exact thing. Semantic search will find related concepts. It may or may not surface the exact match, depending on how the embedding space distributes.

Hybrid search fuses both approaches, typically using Reciprocal Rank Fusion or learned ranking models to combine semantic and keyword signals. The function that exactly matches the query term and is semantically relevant scores higher than something that's only similar in embedding space.

This sounds like an implementation detail. In practice it's the difference between a developer finding what they're looking for and a developer wondering why the tool returned something plausible but wrong.

Where Reranking Fits In

Retrieval returns candidates. Reranking selects which candidates are actually useful.

The standard approach: retrieve a broader set (top-20, top-50) using fast vector search, then apply a more expensive cross-encoder model to score each candidate in the context of the actual query. The cross-encoder sees the query and the candidate together, which gives it information the bi-encoder embedding model never had.

For code, reranking has an additional role: filtering for recency and relevance to the current working context. A function that was heavily modified last week is probably more relevant than a function with a similar name that hasn't been touched in two years. Embeddings don't know this. A reranker with access to metadata can.

The challenge is cost. Cross-encoders are slower. Running one over 50 candidates on every query adds latency and affects token efficiency at scale. There are architectural trade-offs between retrieval speed and ranking quality, and different use cases sit at different points on that curve.

What Developers Actually Run Into

The failures are usually not catastrophic. They're the slow erosion of trust.

A developer asks the coding assistant to add error handling to a function. The assistant generates something plausible—but it doesn't match the error patterns the rest of the codebase uses, because those patterns lived in a different module that retrieval didn't surface. The developer accepts it, or rewrites it. Either way, a small amount of context value was lost.

This happens dozens of times per day. The assistant is helpful enough that developers keep using it. It's wrong enough that developers learn not to trust it without verification.

The root cause is almost always retrieval. The generation layer—the language model—is typically capable. What it receives is incomplete, and it does the best it can with what it has.

Getting retrieval right closes this gap. Not completely. Enough to matter.

Where the Industry Is

Document RAG is a solved problem in the sense that the basic architecture is well understood and the tooling ecosystem is mature. Chunking, embedding, storing, retrieving—this is commodity infrastructure at this point.

Code RAG is earlier. The problems are well defined. The solutions are still fragmented.

Most teams are running document-style retrieval on code, because the tooling defaults assume text. They're using generic embedding models because fine-tuned code embeddings are harder to integrate. They're getting semantic search without keyword fusion, and keyword fusion without graph-aware retrieval.

The teams that have invested in getting this right—AST-aware chunking, domain-trained embeddings, hybrid search, graph traversal for context expansion—see measurably different results. The retrieval is more precise. The context passed to the model is more relevant. The generations are better.

The gap between teams doing this well and teams doing it at defaults is larger than most people realize. Partly because the failures are quiet and slow rather than loud and obvious.

Partly because measuring retrieval quality is genuinely hard, and most teams aren't doing it.

What's Not Solved

Honest assessment: the hardest problem in code RAG isn't retrieval. It's keeping embeddings current.

Codebases change constantly. Functions get renamed. Modules get reorganized. The logic that was in utils/auth.ts last month is in services/identity/core.ts now. Embeddings that were accurate when indexed become stale. Retrieval that was precise becomes imprecise. Developers notice, usually by finding that the assistant confidently references a function that no longer exists.

Continuous re-indexing is the answer in theory. In practice it requires infrastructure: monitoring file changes, triggering incremental updates, invalidating stale vectors without re-embedding the entire codebase. It is tractable but not trivial.

There's also the query understanding problem. Developers don't always query with the right language. "How does billing work?" is a reasonable question. Depending on how the codebase is organized, the relevant code might live in modules with names like transactionProcessor, invoiceService, and paymentGateway—none of which contain the word "billing." Bridging the gap between how developers think and how code is named is partially a retrieval problem and partially a vocabulary problem.

The Direction This Is Going

The next meaningful improvement in code RAG isn't in the embedding models. It's in the retrieval architecture—specifically, in treating code as a graph problem rather than a document problem.

Graph-aware retrieval that traverses import edges, surfaces call hierarchies, and expands context through dependency relationships is where the precision gains are. Language models don't need more tokens. They need the right tokens. Getting to "the right tokens" requires understanding how code is actually structured—which is not the same as how documents are structured.

The tooling is catching up to this understanding. Not fast, but directionally.

Teams that start building retrieval infrastructure designed for code—rather than retrofitting document infrastructure—will have a compounding advantage. Better context means better generations. Better generations mean more trust. More trust means the tool actually changes how the team works, rather than existing as a feature nobody quite relies on.

Most teams are still in the retrofitting stage. The ones who aren't are noticeably further along.