```yaml
---
title: "RAG for Code: A Complete Guide"
subtitle: "Retrieval-Augmented Generation for Software Repositories"
author: "David Kelly Price"
version: "1.0"
date: "April 2026"
description: "A practitioner's guide to building, evaluating, and operating RAG systems over code — from chunking strategy to production deployment."
tags: [RAG, retrieval, code-search, embeddings, vector-stores, LLMs]
---
```

# RAG for Code: A Complete Guide

**David Kelly Price**
Version 1.0, April 2026

---

## Table of Contents

- [About This Guide](#about-this-guide)
- [Chapter 1: The Problem With Code Search](#chapter-1-the-problem-with-code-search)
- [Chapter 2: Breaking Code Into Chunks](#chapter-2-breaking-code-into-chunks)
- [Chapter 3: Embedding Code](#chapter-3-embedding-code)
- [Chapter 4: Vector Stores](#chapter-4-vector-stores)
- [Chapter 5: Sparse Retrieval](#chapter-5-sparse-retrieval)
- [Chapter 6: Hybrid Retrieval](#chapter-6-hybrid-retrieval)
- [Chapter 7: Reranking](#chapter-7-reranking)
- [Chapter 8: Evaluation](#chapter-8-evaluation)
- [Chapter 9: Production](#chapter-9-production)
- [Conclusion](#conclusion)
- [Appendix A: Glossary](#appendix-a-glossary)
- [Appendix B: Tools & Resources](#appendix-b-tools--resources)
- [Appendix C: Further Reading](#appendix-c-further-reading)

---

## About This Guide

This guide is for engineers who want to build retrieval systems over code and do it right the first time.

It assumes you know what a large language model is. It assumes you've seen a vector embedding before. What it doesn't assume is that you've built a production RAG pipeline, evaluated one honestly, or thought carefully about why naive approaches fail on code when they work fine on prose.

The code-specific part matters. Text retrieval is a solved problem with decades of literature behind it. Code retrieval is its own domain — different structure, different vocabulary distribution, different failure modes. A system that works well on documentation or support tickets will not simply transfer. The chunking is different. The embedding models are different. The evaluation metrics are different.

Each chapter builds on the previous one. The sequence is deliberate: chunking comes before embedding because your chunk boundaries determine what the embedding model ever sees. Sparse retrieval comes before hybrid retrieval because you can't fuse signals you don't understand individually. Evaluation comes late in the sequence but should start early in your build process — these chapters are ordered by dependency, not by priority.

Where choices have to be made, this guide recommends specific tools and approaches. Those recommendations reflect what works at production scale, not what's most popular or most recently marketed. The reasoning is included so you can update the recommendation when the landscape changes.

Code examples are in Python throughout. The principles apply to any stack.

---

## Chapter 1: The Problem With Code Search

### Why Search Fails on Code

Every engineering team has the same experience with code search. You open the search box, type something specific — "how does the auth middleware handle expired tokens" — and get back a list of files containing the words "auth," "middleware," "handle," "expired," or "tokens." The results are not useless. They're just not what you asked.

Traditional search is lexical. It matches terms. Code doesn't work that way. The function that handles expired tokens might be called `validate_session`, might live in a file called `guards.py`, and might not contain the word "expired" anywhere. The concept is there. The vocabulary isn't.

This is the fundamental problem. Code is written for machines to execute, not for keyword indexers to parse. Variable names are chosen for clarity within a codebase, not for retrieval accuracy. Domain-specific naming conventions, abbreviations, and idioms accumulate over time. A four-year-old Python repository has its own internal language that a search engine built for text has no way to understand.

RAG addresses this by separating the retrieval problem from the generation problem. Instead of asking a keyword search engine to answer your question, you ask a vector retrieval system to find the most relevant chunks of code, then pass those chunks as context to an LLM that synthesizes the actual answer. The LLM doesn't need to know where the answer lives — it just needs to see it.

> **Key Insight:** RAG shifts the hard work from generation to retrieval. The LLM's job is to reason over context it's given. Getting the right context is your job. Most quality problems in code RAG systems are retrieval problems, not generation problems.

### What Makes Code Different From Text

Before building anything, understand what makes code retrieval harder than document retrieval.

**Structure is semantic.** In prose, structure is mostly presentational — headers, paragraphs, lists. In code, structure carries meaning. A function definition, a class body, an import block, a decorator — these aren't formatting choices. They're semantic units. Chunking code by character count or line count destroys this structure. A chunk that cuts through the middle of a function captures neither its signature nor its body completely, and the embedding model trained on syntactically valid code may produce a worse representation for a syntactically mangled fragment.

**Token distribution is unusual.** Natural language has a power-law term distribution: a small number of common words appear very frequently, and the long tail is rare. Code mixes this with dense identifier namespaces — every function name, class name, and variable name is essentially a low-frequency token. BM25 and TF-IDF, which work well for natural language, have trouble here because the most important tokens (the ones that identify what the code does) are often unique within the corpus.

**Queries don't match code vocabulary.** A developer asking "how do I authenticate a user?" is using English. The code that performs that operation probably uses terms like `verify_credentials`, `jwt_decode`, `principal`, `claims`, or `401`. The semantic gap between question and answer is larger than in document retrieval.

**Duplicates and near-duplicates are common.** Code is refactored, copied, and adapted. Helper functions get duplicated across modules. Boilerplate appears repeatedly. Retrieval systems that don't account for this will surface the same conceptual chunk multiple times, wasting context window space.

**Code changes constantly.** Documentation updates when someone remembers to update it. Code changes every day. An index built on Friday can be stale by Monday. Staleness in a code RAG system produces a specific failure: confident answers about functionality that no longer exists.

### The RAG Pipeline, Concisely

A code RAG pipeline has five stages:

1. **Chunk** — parse the codebase into semantically meaningful units
2. **Embed** — convert chunks to vectors using a code-aware embedding model
3. **Index** — store vectors in a queryable structure
4. **Retrieve** — given a query, find the most relevant chunks
5. **Generate** — pass retrieved chunks as context to an LLM and produce an answer

Each stage has decisions. Most of those decisions have consequences that compound. A bad chunking strategy produces bad chunks that produce bad embeddings that produce bad retrieval regardless of how good your vector store or LLM is. Garbage in, garbage out is not a cliché here — it's the dominant failure mode.

The rest of this guide works through each stage in detail.

---

### Chapter 1 Takeaways

- Lexical search fails on code because code vocabulary doesn't match query vocabulary.
- Code has structural semantics that text chunking strategies don't respect.
- Most quality problems in production code RAG are retrieval problems, not generation problems.
- Every stage of the pipeline affects every subsequent stage.

### Try This

Before building anything, run a baseline. Take ten representative queries from your target use case — the kinds of questions developers actually ask. Run them against your existing code search (GitHub search, grep, IDE indexer). Record the precision at 5: how many of the top five results are actually relevant? This is your floor. Everything you build should beat it. If it doesn't, you have a retrieval problem and more sophisticated generation won't fix it.

---

## Chapter 2: Breaking Code Into Chunks

### Why Chunking Strategy Determines Everything Downstream

The chunk is the atomic unit of your retrieval system. Everything downstream — embedding quality, retrieval precision, context coherence — depends on whether your chunks represent meaningful, self-contained semantic units.

The goal is not to break the codebase into small pieces. The goal is to break the codebase into the smallest pieces that still make sense in isolation. A function definition makes sense in isolation. Half a function definition does not. A class with its methods makes sense. A class split across three chunks, with each chunk missing either the class signature or the method bodies, does not.

Three chunking strategies exist for code. They're listed here in increasing order of quality and complexity.

### Fixed-Size Chunking

Take the source text, split it every N characters or lines with some overlap between chunks. No parsing required.

This is the approach that ships fastest and performs worst. It's the right choice for prototyping, for getting something running, for verifying that the rest of your pipeline works before you invest in better chunking. It is not the right choice for production.

The problem is structural blindness. Fixed-size chunking doesn't know where functions begin or end. A chunk might start mid-function and end mid-comment. The embedding model, trained on syntactically valid code, produces a worse representation for a syntactically invalid fragment. Retrieval suffers.

> **Warning:** Overlap doesn't fix structural chunking problems — it just means the same broken boundary appears twice. If your chunks are cutting through function definitions, adding a 20% overlap means you're storing 20% more garbage, not less.

### Structural Chunking (Regex-Based)

Identify chunk boundaries using patterns rather than hard character counts. In Python, for example, you can split on `def ` and `class ` prefixes at the module level. This is better than fixed-size because at least your chunks tend to start at meaningful boundaries. It's still not great, because regex can't handle nested structures, decorators, multi-line function signatures, or anything unusual.

Structural chunking with regex is fast to implement and reasonable for single-language, well-structured codebases where you control the code style. It breaks on real-world code.

### AST-Based Chunking

Parse the source into an Abstract Syntax Tree and extract nodes that correspond to meaningful semantic units: function definitions, class definitions, module-level assignments, docstrings.

This is the correct approach. AST-based chunking always produces syntactically valid chunks because it extracts nodes from a valid parse tree. The chunks are semantically meaningful by construction. Functions include their signature, body, and decorators. Classes include their methods. The embedding model gets well-formed input every time.

For Python, the standard library `ast` module is sufficient and has no dependencies. For anything else, use Tree-Sitter.

```python
import ast

def extract_chunks(source_code: str, filepath: str) -> list[dict]:
    tree = ast.parse(source_code)
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            chunk_source = ast.get_source_segment(source_code, node)
            if chunk_source:
                chunks.append({
                    "content": chunk_source,
                    "filepath": filepath,
                    "name": node.name,
                    "type": type(node).__name__,
                    "start_line": node.lineno,
                    "end_line": node.end_lineno,
                })

    return chunks
```

This is the basic form. Real implementations add: deduplication, minimum/maximum size filters, metadata enrichment (module docstring, class context for methods), and handling for nested definitions.

### Multi-Language Chunking With Tree-Sitter

Python's `ast` module covers Python. For a polyglot repository — Python, TypeScript, Go, and Rust — you need something language-agnostic. Tree-Sitter is the answer.

Tree-Sitter is a parsing library with grammar implementations for 50+ languages. It produces concrete syntax trees (CSTs), which include all syntactic tokens, not just semantic nodes. The query interface lets you extract specific node types across any supported language using a consistent API.

```python
from tree_sitter import Language, Parser
from tree_sitter_languages import get_language, get_parser

def extract_chunks_ts(source_code: str, language: str, filepath: str) -> list[dict]:
    parser = get_parser(language)
    tree = parser.parse(source_code.encode())
    lang = get_language(language)

    # Query for function and class definitions
    query_string = """
        (function_definition name: (identifier) @name) @function
        (class_definition name: (identifier) @name) @class
    """
    query = lang.query(query_string)
    captures = query.captures(tree.root_node)

    chunks = []
    seen_nodes = set()

    for node, capture_name in captures:
        if capture_name in ("function", "class") and id(node) not in seen_nodes:
            seen_nodes.add(id(node))
            chunk_text = source_code[node.start_byte:node.end_byte]
            chunks.append({
                "content": chunk_text,
                "filepath": filepath,
                "type": capture_name,
                "start_line": node.start_point[0] + 1,
                "end_line": node.end_point[0] + 1,
            })

    return chunks
```

The query strings differ by language, but the parsing API is identical. This is the practical advantage of Tree-Sitter over writing per-language parsers.

> **Key Insight:** Tree-Sitter is resilient to parse errors by design. It produces partial trees for syntactically invalid input rather than failing. This matters for real codebases, which contain work-in-progress files, auto-generated files, and occasionally just broken code.

### Metadata Is Not Optional

Whatever chunking strategy you use, attach metadata to every chunk. At minimum: file path, chunk type (function, class, module), start and end line numbers, and the top-level name. Add the module docstring if available. Add the class name for method chunks so that `authenticate` in `UserManager` can be distinguished from `authenticate` in `OAuthProvider`.

Metadata serves three purposes. First, it supports filtered retrieval — "only search within the authentication module" — which dramatically improves precision for scoped queries. Second, it provides context to the LLM that reads the chunk: "this is the `authenticate` method of the `UserManager` class in `auth/managers.py`" is far more useful than a raw code block. Third, it enables debugging when retrieval produces bad results — you can inspect what was actually retrieved and why.

### Chunk Size and Overlap

For AST-based chunking, chunk size is determined by the semantic unit, not an arbitrary limit. Functions are as long as they are. This is correct.

For fixed-size or structural chunking, a common default is 512 tokens with 50–100 tokens of overlap. This is a starting point, not a universal answer. Code is denser than prose — 512 tokens of Python can represent a complex function. Monitor chunk size distributions in your corpus. If the median chunk is 80 tokens and you're capping at 512, your chunks are too small for the limit to matter. If many functions exceed 512 tokens and get cut, you need either a larger size limit or a strategy for handling long functions (truncate to signature + first N lines, or split by logical block).

---

### Chapter 2 Takeaways

- Fixed-size chunking is for prototyping, not production.
- AST-based chunking produces syntactically valid, semantically meaningful chunks by construction.
- Tree-Sitter handles multi-language repositories with a single consistent API.
- Metadata attached at chunking time enables filtered retrieval, better LLM context, and debugging.

### Try This

Run your chunking strategy on a representative sample of your codebase and compute the chunk size distribution. Count: What percentage of chunks are under 50 tokens? What percentage exceed your embedding model's context window? What percentage cut through function boundaries (check by looking for chunks that start with indented code rather than a definition keyword)? These numbers tell you whether your chunking is working before you touch embeddings or retrieval.

---

## Chapter 3: Embedding Code

### What Embeddings Do and Why Code-Specific Models Matter

An embedding model converts a chunk of code into a dense vector — a list of floating-point numbers, typically 768 or 1024 dimensions — where vectors close together in geometric space correspond to chunks with similar meaning. The retrieval system finds the vectors closest to the query vector.

The quality of everything that follows depends on the quality of these vectors. A poor embedding model produces vectors where "get user by ID" and "fetch_user_by_pk" are far apart in the embedding space, even though they're the same operation. A good code embedding model understands that relationship.

General-purpose text embedding models — models trained on web text, books, and Wikipedia — produce poor embeddings for code. They've never seen `async def`, `impl Trait for`, or `go func()`. The vocabulary is unfamiliar. The patterns are foreign. The resulting vectors don't capture code semantics accurately.

Code-specific models are trained on GitHub, StackOverflow, and paired natural-language-code datasets. They understand that a docstring and the function it describes should be near each other in embedding space. They understand that `def authenticate(self, token: str) -> User` and "verify a user's auth token" mean the same thing.

### The Bi-Encoder Architecture

All embedding models used in retrieval are bi-encoders. The query and the document are encoded independently — the model never sees them together — and the relevance score is computed as the cosine similarity between the two output vectors.

This is the only architecture that scales to large codebases. You encode the corpus once at indexing time. At query time, you encode the query (fast, single forward pass) and do a nearest-neighbor search against pre-computed document vectors (also fast). The cost of the similarity computation doesn't grow with the corpus size; it grows with the number of dimensions and the number of candidates you return.

The downside is that the model never sees the query and document together, so it can't reason about their relationship. A cross-encoder can do this but only at reranking time, after candidates are already retrieved. More on that in Chapter 7.

### Model Options

**microsoft/unixcoder-base** — The strongest open-source code embedding model for natural-language-to-code retrieval tasks. Pretrained on multiple code-related tasks including code search, code summarization, and code-to-code similarity. The go-to recommendation for most code RAG use cases.

**microsoft/codebert-base** — Earlier and slightly weaker than UniXcoder on most benchmarks, but still competitive for the languages it covers well (Python, Java, JavaScript, PHP, Ruby, Go). If you need to minimize model size or already have it deployed, it's a reasonable choice.

**text-embedding-3-small** (OpenAI) — A general-purpose embedding model that performs surprisingly well on code at scale. It's a managed API, so there's no deployment overhead. If operational simplicity matters more than maximum retrieval quality, this is worth considering. The cost at scale is predictable but non-zero.

**sentence-transformers** — A library, not a model. It provides a unified Python interface for loading and using embedding models locally. Load any of the models above through it.

```python
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("microsoft/unixcoder-base")

chunks = [chunk["content"] for chunk in extracted_chunks]
embeddings = model.encode(chunks, batch_size=32, show_progress_bar=True)
```

Batching is important. Encoding chunks one at a time is an order of magnitude slower. Use the largest batch size that fits in GPU memory.

> **Warning:** Embedding model context windows are typically 512 tokens. Code chunks that exceed this limit are silently truncated. Check your chunk size distribution against the model's limit. Truncated embeddings are not flagged as errors — they silently degrade retrieval quality.

### Asymmetric Retrieval

Code search is asymmetric: short natural-language queries retrieve long code chunks. Many embedding models handle this poorly because they're trained on symmetric similarity (text-to-text or code-to-code). UniXcoder was specifically trained on paired natural-language and code samples, which is why it outperforms text-only models on this task.

For query-time encoding, use the query prefix provided by the model if one exists. UniXcoder uses a specific `[CLS]` token setup. CodeBERT expects certain input formatting. Read the model card — this is not an area where guessing works.

### Query Expansion

When your query vocabulary differs significantly from code vocabulary, embedding alone may not bridge the gap. One technique is HyDE (Hypothetical Document Embeddings): before embedding the query, use an LLM to generate a hypothetical code snippet that would answer the question, then embed that hypothetical code. Because the hypothetical is written in code vocabulary rather than natural-language vocabulary, it tends to retrieve better candidates.

```python
def hypothetical_document_embedding(query: str, model, llm) -> list[float]:
    prompt = f"""Write a Python function that does the following:
{query}

Return only the function code, no explanation."""

    hypothetical_code = llm.complete(prompt)
    return model.encode(hypothetical_code)
```

HyDE adds latency (one LLM call per query) and cost. It's worth testing when baseline retrieval quality is poor due to vocabulary mismatch, not as a first-line optimization.

---

### Chapter 3 Takeaways

- General-purpose text embedding models produce poor representations for code; use code-specific models.
- All retrieval-scale embedding uses bi-encoders; cross-encoders are reserved for reranking.
- Batch encoding is mandatory for performance; single-chunk encoding is an order of magnitude slower.
- Embedding model context windows are commonly 512 tokens; chunk sizes must stay within this limit.
- HyDE (query → hypothetical code → embed) can bridge vocabulary gaps but adds latency.

### Try This

Compute embedding cosine similarities for five known-relevant pairs from your codebase — a natural-language description and the function it describes. Then compute similarities for five known-irrelevant pairs. The gap between relevant and irrelevant similarity scores tells you how well the embedding model separates signal from noise. If the gap is less than 0.1, the model is struggling with your code domain. If it's greater than 0.3, you have strong signal to work with.

---

## Chapter 4: Vector Stores

### What a Vector Store Does

A vector store stores high-dimensional vectors and answers nearest-neighbor queries: given a query vector, return the K most similar stored vectors. That's the core operation.

Practical vector stores add a layer on top: they also store the metadata attached to each vector and support filtered queries — nearest neighbors within the subset of vectors where `language == "python"` or `module == "auth"`. For code RAG, metadata filtering is not optional. The ability to scope retrieval to a specific module, file type, or author is what separates a useful tool from a general code search.

The nearest-neighbor search is almost always approximate. Exact nearest-neighbor search over millions of vectors is too slow to be practical. Approximate Nearest Neighbor (ANN) algorithms trade a small amount of recall for large speed improvements. HNSW (Hierarchical Navigable Small World) is the dominant algorithm in production vector stores — it offers strong recall-versus-speed trade-offs and supports online insertion without rebuilding the index.

### Choosing a Vector Store

The choice of vector store is driven by scale, operational preferences, and what infrastructure you're already running. There is no universal best option.

**ChromaDB** is the right choice for development and early production. The Python API is minimal, it runs in-process with no server required, and it supports both persistent and in-memory modes. The time from "I want to try this" to "I have an index" is about ten lines of code. At scale beyond a few million vectors, its performance degrades relative to purpose-built alternatives.

**Qdrant** is the right choice for production at scale. It supports filtering, sharding, quantization, and the HNSW configuration parameters that matter when you're tuning for recall or latency. The Rust implementation is fast. Payload filtering is first-class, not bolted on.

**pgvector** is the right choice if you're already on PostgreSQL and want to avoid adding new infrastructure. The query interface is SQL. Operational complexity is zero if you already operate Postgres. It's slower than Qdrant for pure vector search at large scale, but if your corpus fits in a Postgres instance you're already running, the operational simplicity wins.

**Pinecone** is the managed option. No operational overhead. Scaling is automatic. Cost is per-query and per-vector. If your team doesn't want to operate vector infrastructure, it's a reasonable trade.

**Weaviate** has a strong horizontal scaling story and a GraphQL query interface. Its operational model is more complex than Qdrant's. It's a reasonable choice for teams already familiar with it.

> **Key Insight:** Don't evaluate vector stores based on benchmarks alone. The choice is mostly operational: who deploys this, how, what's already in your infrastructure, and what happens when it goes down. Start with ChromaDB, move to Qdrant or pgvector when you hit production scale.

### Basic ChromaDB Setup

```python
import chromadb
from chromadb.config import Settings

client = chromadb.PersistentClient(path="./code_index")
collection = client.get_or_create_collection(
    name="codebase",
    metadata={"hnsw:space": "cosine"}
)

# Indexing
collection.add(
    documents=[chunk["content"] for chunk in chunks],
    embeddings=embeddings.tolist(),
    metadatas=[{
        "filepath": chunk["filepath"],
        "type": chunk["type"],
        "name": chunk.get("name", ""),
        "start_line": chunk["start_line"],
        "end_line": chunk["end_line"],
    } for chunk in chunks],
    ids=[f"{chunk['filepath']}:{chunk['start_line']}" for chunk in chunks],
)

# Retrieval
results = collection.query(
    query_embeddings=[query_embedding.tolist()],
    n_results=20,
    where={"type": "function"},  # metadata filter
)
```

### HNSW Configuration

Most vector stores expose HNSW parameters. The two that matter most:

**`ef_construction`** — Controls index quality at build time. Higher values produce better recall but slower indexing. Default is typically 100–200. For a code index built once and queried many times, err toward higher values.

**`ef_search`** (sometimes `ef`) — Controls recall at query time. Higher values produce better recall but slower queries. This is the parameter to tune if you're trading latency for recall. A value of 50–100 is a reasonable starting point; increase if recall is poor.

**`m`** — The number of connections per node in the graph. Higher M improves recall and increases memory usage. Default of 16 is reasonable for most code indexes.

For a code RAG system, retrieval latency is rarely the binding constraint — LLM generation is orders of magnitude slower. Optimize for recall over latency.

---

### Chapter 4 Takeaways

- ANN search (not exact nearest-neighbor) is the practical requirement at production scale; HNSW is the dominant algorithm.
- Metadata filtering is essential for code RAG; scope retrieval to modules, file types, or authors.
- ChromaDB for development and early production; Qdrant or pgvector for production at scale.
- HNSW parameters trade recall for speed; for code RAG, optimize for recall.

### Try This

Build an index with ChromaDB using AST-chunked embeddings from a single module of your codebase. Run ten queries and inspect the raw retrieved chunks — the actual code text, not just whether the file path looks right. Ask: does the chunk contain enough context to answer the query? Does it include the function signature? Is the chunk self-contained? This inspection tells you more about chunking and retrieval quality than any automated metric.

---

## Chapter 5: Sparse Retrieval

### BM25 and Why It Still Matters

Dense retrieval is not a replacement for sparse retrieval. It's a complement. Understanding why requires understanding what BM25 does well and where it fails.

BM25 is a term-frequency-based ranking algorithm. Given a query containing terms T, it scores each document based on how often the query terms appear in the document, normalized for document length and adjusted by how common the term is across the corpus. Documents with rare query terms score higher than documents with common query terms. Long documents are penalized relative to short ones with the same term frequency.

For code retrieval, BM25 excels at exact identifier matching. If a developer queries for `verify_jwt_signature`, BM25 will find documents containing that exact string with high precision. A dense embedding model might find semantically related functions, but BM25 will find the exact one. For debugging queries ("why does `parse_config` fail"), exact function name matching is often exactly what you want.

Where BM25 fails is semantic generalization. "Authenticate a user" won't retrieve `verify_credentials` unless the query and function happen to share tokens. The semantic gap is invisible to a term-frequency model.

These failure modes are complementary. BM25 fails on semantic queries; dense retrieval fails on exact identifier queries. Combining them covers both failure modes. This is hybrid retrieval, covered in Chapter 6.

### BM25 in Python

The `rank-bm25` library provides an in-process BM25 implementation with no external dependencies. For codebases up to ~200,000 chunks, it's fast enough and simple enough to run in-process. For larger corpora, use Elasticsearch or OpenSearch.

```python
from rank_bm25 import BM25Okapi
import re

def tokenize_code(text: str) -> list[str]:
    # Split on whitespace and code delimiters
    # Preserve identifiers as single tokens
    tokens = re.findall(r'[a-zA-Z_][a-zA-Z0-9_]*|[0-9]+', text)
    return [t.lower() for t in tokens]

# Build index
corpus = [chunk["content"] for chunk in chunks]
tokenized_corpus = [tokenize_code(doc) for doc in corpus]
bm25 = BM25Okapi(tokenized_corpus)

# Query
query_tokens = tokenize_code("verify jwt signature")
scores = bm25.get_scores(query_tokens)
top_indices = scores.argsort()[::-1][:20]
```

The tokenization function matters. Code tokenization should split on camelCase and snake_case boundaries, not just whitespace. `verifyJwtSignature` should tokenize to `["verify", "jwt", "signature"]` so that a query for "jwt signature" matches it.

```python
def tokenize_code_smart(text: str) -> list[str]:
    # Split camelCase: verifyJwt -> verify Jwt
    text = re.sub(r'([a-z])([A-Z])', r'\1 \2', text)
    # Split on underscores, spaces, punctuation
    tokens = re.findall(r'[a-zA-Z][a-zA-Z0-9]*', text)
    return [t.lower() for t in tokens]
```

> **Warning:** Do not filter stop words aggressively for code. Words like "get", "set", "is", "has" are meaningful in code contexts (`get_user`, `set_flag`, `is_authenticated`, `has_permission`). Standard stop-word lists built for natural-language processing will strip these and degrade BM25 quality on code.

### Elasticsearch and OpenSearch for Large Scale

When the corpus exceeds a few hundred thousand chunks, in-process BM25 becomes slow and memory-intensive. Elasticsearch and OpenSearch provide distributed BM25 search with the same BM25 ranking semantics, horizontal scalability, and real-time indexing.

Both use Lucene under the hood and expose nearly identical APIs. Choose based on licensing preferences and your existing infrastructure — if you already run one of them, use it. If starting fresh, both are reasonable.

For code-specific improvements in Elasticsearch, configure a custom analyzer:

```json
{
  "settings": {
    "analysis": {
      "analyzer": {
        "code_analyzer": {
          "tokenizer": "code_tokenizer",
          "filter": ["lowercase", "code_synonyms"]
        }
      },
      "tokenizer": {
        "code_tokenizer": {
          "type": "pattern",
          "pattern": "([^a-zA-Z0-9])|(?<=[a-z])(?=[A-Z])|(?<=[A-Z])(?=[A-Z][a-z])"
        }
      }
    }
  }
}
```

This tokenizes on non-alphanumeric characters and camelCase boundaries, matching the behavior of the smart tokenizer above.

---

### Chapter 5 Takeaways

- BM25 excels at exact identifier matching; dense retrieval excels at semantic generalization. Neither alone is sufficient.
- `rank-bm25` is adequate for in-process BM25 on moderate corpora; Elasticsearch or OpenSearch for large scale.
- Tokenization for code should split camelCase and snake_case, and should not strip code-relevant stop words.

### Try This

Run the same ten queries from your Chapter 1 baseline through BM25 alone. Note which queries BM25 handles better than your current search tool (exact function names, specific error types, configuration keys) and which it handles worse (behavioral descriptions, conceptual questions). This tells you what signal BM25 is contributing before you build the hybrid system — if BM25 adds no precision over your baseline, the issue is tokenization.

---

## Chapter 6: Hybrid Retrieval

### Why Hybrid Is the Standard Baseline

Dense retrieval alone misses exact identifier matches. Sparse retrieval alone misses semantic similarity. Combining them consistently outperforms either in isolation across nearly every evaluation dataset. For code RAG specifically — where queries span both exact-match ("find the `parse_config` function") and semantic ("how does configuration loading work?") — hybrid retrieval is the correct baseline, not an advanced optimization.

The combination is typically: BM25 retrieves a candidate set, dense retrieval retrieves a candidate set, the two sets are merged via Reciprocal Rank Fusion, and the merged list is passed to retrieval or reranking.

> **Key Insight:** Hybrid retrieval is not about covering your bases. It's about recognizing that query intent varies, and different retrieval signals have different strengths at different points in that space. No single signal dominates across all query types.

### Reciprocal Rank Fusion

The challenge in combining retrieval results is that BM25 and dense retrieval produce scores on completely different scales. BM25 scores are term-frequency-derived floats with no universal range. Dense similarity scores are cosine similarities between 0 and 1. Adding them directly is meaningless.

Reciprocal Rank Fusion (RRF) sidesteps the score normalization problem entirely by operating on ranks rather than scores. For each result in each retrieval system, it computes `1 / (k + rank)` where k is a constant (typically 60) that controls how much the formula penalizes lower-ranked results. These values are summed across retrieval systems, and the results are re-sorted by the sum.

```python
def reciprocal_rank_fusion(
    result_sets: list[list[str]],
    k: int = 60
) -> list[tuple[str, float]]:
    scores = {}

    for result_list in result_sets:
        for rank, doc_id in enumerate(result_list, start=1):
            scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)
```

The k=60 default is from the original RRF paper and holds up well empirically. Results appearing in multiple retrieval systems are rewarded because their RRF scores accumulate. Results in only one system contribute to the merged list but score lower.

RRF has two properties worth emphasizing. First, it's robust across different scoring scales — you can fuse BM25, dense, and any other ranked list regardless of their native scoring conventions. Second, it's simple to debug — a result's position in the merged list depends only on its position in each contributing list, not on opaque score arithmetic.

### Implementing Hybrid Retrieval

```python
class HybridRetriever:
    def __init__(self, bm25_index, vector_collection, embedding_model, rrf_k=60):
        self.bm25 = bm25_index
        self.collection = vector_collection
        self.model = embedding_model
        self.rrf_k = rrf_k

    def retrieve(self, query: str, n_candidates: int = 50, n_results: int = 20) -> list[dict]:
        # Dense retrieval
        query_embedding = self.model.encode(query)
        dense_results = self.collection.query(
            query_embeddings=[query_embedding.tolist()],
            n_results=n_candidates,
        )
        dense_ids = dense_results["ids"][0]

        # Sparse retrieval
        query_tokens = tokenize_code_smart(query)
        bm25_scores = self.bm25.get_scores(query_tokens)
        bm25_top_indices = bm25_scores.argsort()[::-1][:n_candidates]
        bm25_ids = [self.chunk_ids[i] for i in bm25_top_indices]

        # RRF merge
        fused = reciprocal_rank_fusion([dense_ids, bm25_ids], k=self.rrf_k)

        # Fetch top results
        top_ids = [doc_id for doc_id, _ in fused[:n_results]]
        return self.fetch_chunks(top_ids)
```

Set `n_candidates` higher than `n_results`. You're retrieving a large candidate pool from each system and then merging — the merge can only work if the relevant results are present in at least one candidate set. Starting with 50 candidates per system and returning 20 final results is a reasonable default.

### Tuning RRF

The k parameter in RRF controls ranking sensitivity. Low k (e.g., 10) puts heavy weight on top-ranked results and penalizes lower-ranked results sharply. High k (e.g., 100) flattens the distribution and treats results at rank 1 and rank 30 more similarly.

For most code RAG applications, k=60 requires no tuning. If you want to test it, hold out a labeled evaluation set and sweep k from 20 to 100. The variance is usually small enough that the default is fine.

What matters more than k is the number of candidates retrieved from each system. If BM25 returns 10 candidates and dense returns 50, the dense system dominates the merged list regardless of k. Match candidate counts across systems.

### Metadata Filtering in Hybrid Search

Metadata filtering is easier in dense retrieval (most vector stores support `where` clauses natively) than in BM25. For in-process BM25, apply the filter post-retrieval: retrieve more candidates than needed, then filter by metadata before RRF.

For Elasticsearch, apply filters at query time using bool queries with filter clauses — these don't affect BM25 scoring but exclude non-matching documents from results.

---

### Chapter 6 Takeaways

- Hybrid retrieval (BM25 + dense) outperforms either system alone for code queries that span exact-match and semantic intent.
- RRF merges ranked lists without score normalization; it's robust and debuggable.
- Match candidate counts across retrieval systems to avoid one system dominating the merge.
- Metadata filtering applies to both sparse and dense retrieval but requires different implementation.

### Try This

Run your evaluation queries through BM25 alone, dense alone, and hybrid. For each system, record precision at 5 (how many of the top five results are relevant). The hybrid system should beat both individually on average. If it doesn't, the most likely cause is either a large mismatch in candidate counts between systems or a tokenization issue in BM25. Fix the smaller issue first.

---

## Chapter 7: Reranking

### The Two-Stage Retrieval Pattern

Hybrid retrieval with RRF produces a ranked list of ~20 candidates. That list is good — better than either system alone — but not perfect. Some irrelevant results will rank in the top 5. Some highly relevant results will rank 15th. The first-stage retrieval is optimized for speed at scale; it can't afford the computation required for truly precise ranking.

Reranking is the second stage. Take the ~20 candidates from first-stage retrieval, re-score each one with a more expensive model, and return the top K results in the new ranking. The context window passed to the LLM receives this reranked list.

The key property of reranking models — specifically, cross-encoders — is that they see the query and the document together. A cross-encoder concatenates the query and document into a single input and produces a relevance score for that specific pair. This is fundamentally more accurate than bi-encoder similarity, which encodes query and document independently. The cross-encoder can reason about the relationship, not just the proximity of their vector representations.

The cost is speed. A cross-encoder must score every candidate individually — 20 forward passes versus 1 for bi-encoder retrieval. For a code RAG system where the bottleneck is LLM generation (seconds), this is almost always acceptable.

### Cross-Encoder Reranking

```python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, candidates: list[dict], top_k: int = 5) -> list[dict]:
    pairs = [(query, chunk["content"]) for chunk in candidates]
    scores = reranker.predict(pairs)

    ranked = sorted(
        zip(candidates, scores),
        key=lambda x: x[1],
        reverse=True
    )

    return [chunk for chunk, score in ranked[:top_k]]
```

**cross-encoder/ms-marco-MiniLM-L-6-v2** is a lightweight, fast cross-encoder trained on MS MARCO. It's the go-to baseline for general-purpose reranking. For code-specific queries, **BAAI/bge-reranker-large** is stronger but slower.

The Cohere Rerank API is a managed option. It performs well, integrates simply, and charges per query. For teams that don't want to deploy and maintain a reranking model, it's a reasonable choice.

> **Warning:** Cross-encoders trained on web search data (MS MARCO, MS MARCO MiniLM) have seen very little code in training. They work reasonably well because code reranking is partly about natural-language understanding of the query, but they're not optimal. When reranking precision is critical, fine-tune a cross-encoder on code-specific labeled pairs or use BAAI/bge-reranker-large, which has broader multilingual and domain coverage.

### Maximum Marginal Relevance

Cross-encoder reranking optimizes for relevance. MMR (Maximum Marginal Relevance) optimizes for relevance *and* diversity. It solves a different problem: when the top-ranked candidates are near-duplicates of each other, a pure relevance reranker fills the context window with redundant information. MMR selects candidates that are relevant to the query but different from each other.

The algorithm is iterative. The first selection is the most relevant candidate. Each subsequent selection maximizes a combination of relevance to the query and dissimilarity from already-selected chunks.

```python
import numpy as np

def mmr_rerank(
    query_embedding: np.ndarray,
    candidate_embeddings: np.ndarray,
    candidates: list[dict],
    top_k: int = 5,
    lambda_mult: float = 0.5
) -> list[dict]:
    selected = []
    selected_embeddings = []
    remaining = list(range(len(candidates)))

    for _ in range(min(top_k, len(candidates))):
        if not selected_embeddings:
            # First: pick the most relevant
            scores = candidate_embeddings @ query_embedding
            best = remaining[np.argmax(scores[remaining])]
        else:
            # Subsequently: balance relevance and diversity
            relevance = candidate_embeddings[remaining] @ query_embedding
            diversity = np.max(
                candidate_embeddings[remaining] @ np.array(selected_embeddings).T,
                axis=1
            )
            mmr_scores = lambda_mult * relevance - (1 - lambda_mult) * diversity
            best = remaining[np.argmax(mmr_scores)]

        selected.append(best)
        selected_embeddings.append(candidate_embeddings[best])
        remaining.remove(best)

    return [candidates[i] for i in selected]
```

`lambda_mult` controls the relevance/diversity trade-off. 1.0 is pure relevance (equivalent to top-K retrieval); 0.0 is maximum diversity. 0.5 is a balanced default. Decrease toward 0.3 if you're seeing near-duplicate chunks dominating the context window.

> **Key Insight:** Use MMR when your codebase has significant code duplication — copy-pasted helpers, repeated boilerplate, similar utility functions. Use cross-encoder reranking when you need maximum relevance precision and duplication isn't an issue. You can combine both: cross-encode for relevance, then MMR over the top-10 for diversity.

### Context Assembly

After reranking, you have 3–7 high-quality, diverse, relevant chunks. How you assemble them into the LLM context window matters.

Research from Liu et al. (2023) — "Lost in the Middle" — shows that LLMs use information at the beginning and end of long contexts better than information in the middle. Place the most relevant chunks first or last, not buried in the middle.

Include file path and context metadata before each chunk:

```
### File: auth/managers.py (lines 45-78)
### Type: function
### Name: UserManager.authenticate

def authenticate(self, token: str) -> User:
    ...
```

This metadata costs tokens but pays for itself in generation quality. The LLM can cite specific locations, understands the organizational context, and can distinguish between two functions named `authenticate` in different modules.

---

### Chapter 7 Takeaways

- Cross-encoders see query and document together; they're more accurate than bi-encoders but must be run per-candidate.
- Reranking adds a small cost (20 forward passes) that's almost always acceptable when the LLM generation bottleneck is orders of magnitude slower.
- MMR prevents near-duplicate chunks from dominating the context window by combining relevance with diversity.
- Chunk ordering within the context window affects generation quality; most relevant chunks should go first or last.

### Try This

Take your top 20 hybrid retrieval results for five queries and run them through both cross-encoder reranking and MMR. Compare which gets into your final top-5. Look specifically for cases where a highly relevant chunk ranked 15th in retrieval gets promoted to the top-5 by the cross-encoder — that's reranking working. Look for cases where four of the top five are nearly identical — that's where MMR should replace cross-encoding or follow it.

---

## Chapter 8: Evaluation

### The Evaluation Problem in RAG

RAG evaluation is hard because the system has multiple distinct failure modes, and they don't all produce the same symptom. A system that retrieves good context and generates a bad answer is different from a system that retrieves bad context and generates a confident hallucination. Measuring only the final answer quality conflates these two failure modes.

A complete evaluation covers: retrieval quality (are the right chunks being found?), faithfulness (does the answer reflect the retrieved context?), and answer relevance (does the answer address the question?). Each requires different measurement.

### Retrieval Evaluation

Retrieval evaluation requires a labeled dataset: a set of queries, each with a set of known-relevant documents. For code RAG, this dataset takes effort to build but is the most valuable investment in evaluation.

Build the dataset from two sources:

**Real developer questions** — GitHub issues, Slack conversations, internal Q&A threads. These represent actual query distribution. For each, annotate which functions or files are genuinely relevant.

**Synthetic generation** — Use an LLM to generate questions for each indexed chunk: "What question would this function answer?" This scales to cover more of the corpus but produces a distribution that may not match real queries.

With a labeled dataset, compute:

**Recall at K** — Of all relevant documents, what fraction appear in the top K retrieved results? Recall@10 tells you whether the right code is in the candidate pool at all.

**Precision at K** — Of the top K retrieved results, what fraction are relevant? Precision@5 tells you whether the LLM context is high quality.

**MRR (Mean Reciprocal Rank)** — The average of `1/rank` of the first relevant result. Captures whether the most relevant chunk appears near the top.

```python
def recall_at_k(retrieved_ids: list[str], relevant_ids: set[str], k: int) -> float:
    top_k = set(retrieved_ids[:k])
    return len(top_k & relevant_ids) / len(relevant_ids) if relevant_ids else 0.0

def precision_at_k(retrieved_ids: list[str], relevant_ids: set[str], k: int) -> float:
    top_k = retrieved_ids[:k]
    return sum(1 for doc_id in top_k if doc_id in relevant_ids) / k

def mrr(retrieved_ids: list[str], relevant_ids: set[str]) -> float:
    for rank, doc_id in enumerate(retrieved_ids, start=1):
        if doc_id in relevant_ids:
            return 1.0 / rank
    return 0.0
```

### Faithfulness

Faithfulness measures whether the generated answer is grounded in the retrieved context. An unfaithful answer contains claims that cannot be found in or inferred from the retrieved chunks — hallucinations.

RAGAS implements faithfulness as an LLM-based metric. It decomposes the generated answer into atomic claims and verifies each claim against the retrieved context using an LLM judge.

```python
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

eval_data = {
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,  # list of lists of strings
    "ground_truth": reference_answers,
}

result = evaluate(
    Dataset.from_dict(eval_data),
    metrics=[faithfulness, answer_relevancy, context_precision],
)
```

RAGAS faithfulness scores range from 0 to 1. A score below 0.7 indicates significant hallucination. For code RAG systems used for debugging or code understanding, faithfulness is the most critical metric — a system that confidently answers questions about code that doesn't exist causes real engineering harm.

> **Warning:** RAGAS uses an LLM as a judge. LLM judges are not perfectly calibrated — they have their own biases and failure modes. Use RAGAS scores as directional indicators, not ground truth. When a score changes significantly (>0.05) between versions, investigate the specific failures rather than trusting the aggregate number.

### Answer Relevance

Answer relevance measures whether the generated answer actually addresses the question asked. A system can be faithful (grounded in context) but irrelevant (the answer doesn't address the question because the wrong context was retrieved).

RAGAS `answer_relevancy` estimates this by prompting an LLM to generate candidate questions from the answer and measuring the similarity of those generated questions to the original question.

High faithfulness with low answer relevance indicates a retrieval problem — the system retrieved coherent context but the wrong context. High answer relevance with low faithfulness indicates a generation problem — the LLM is generating confident but unsupported claims.

### Context Precision

Context precision measures whether the retrieved context is relevant. Specifically: are the retrieved chunks actually useful for answering the question, or is the context window filled with noise?

Low context precision degrades generation quality in two ways. First, the LLM has to reason over irrelevant context, which introduces noise. Second, the irrelevant context displaces relevant context that could have been included (context window is finite).

```python
# Simple context precision: fraction of retrieved chunks rated relevant by an LLM judge
def context_precision_manual(
    query: str, retrieved_chunks: list[dict], llm
) -> float:
    relevant_count = 0
    for chunk in retrieved_chunks:
        prompt = f"""Is the following code chunk relevant for answering the question below?

Question: {query}

Code:
{chunk['content']}

Answer with yes or no."""
        response = llm.complete(prompt).lower()
        if "yes" in response:
            relevant_count += 1

    return relevant_count / len(retrieved_chunks) if retrieved_chunks else 0.0
```

### Evaluation Pipeline

Run evaluations continuously, not as a one-time check. When you change chunking strategy, embedding model, or retrieval parameters, run the evaluation suite and compare metrics.

Track metrics over time using Weights & Biases, MLflow, or a simple spreadsheet. What matters is that you can see whether a change helped or hurt, across which query types, and by how much.

> **Key Insight:** Build your evaluation dataset before you build your production pipeline. Knowing your metrics from the start keeps you from optimizing the wrong thing. A retrieval system tuned for recall but evaluated on answer quality will fail in ways that are hard to debug because you don't know which stage broke.

---

### Chapter 8 Takeaways

- RAG evaluation covers retrieval quality (Recall@K, Precision@K, MRR), faithfulness, and answer relevance separately.
- Low faithfulness indicates hallucination; low answer relevance indicates retrieval of wrong context; low precision indicates noise in retrieved context.
- RAGAS provides LLM-based metrics for faithfulness and relevance; treat scores as directional, not ground truth.
- Build the evaluation dataset before building the pipeline; track metrics continuously across versions.

### Try This

Build a labeled evaluation set of 50 query/relevant-document pairs from your codebase. Score your current pipeline on Recall@10 and Precision@5. Then make one change to the pipeline (different chunking, different embedding model, adding reranking) and compare. You're looking for changes that improve Precision@5 without hurting Recall@10 significantly — that's the reranking effect. Changes that improve Recall@10 with flat Precision@5 benefit from reranking to pull signal out of the larger candidate pool.

---

## Chapter 9: Production

### What Changes in Production

A pipeline that works in a Jupyter notebook fails in production for reasons that have nothing to do with retrieval quality. The issues are operational: stale indexes, embedding model versioning, query latency under load, error handling for malformed code, and monitoring for silent degradation.

None of these are exotic. They're the ordinary problems of running any ML-backed service at scale. The code RAG-specific versions of them are worth understanding before they catch you off-guard.

### Index Freshness

Code changes daily. An index built on Monday is partially stale by Friday. Functions are renamed, deleted, moved. New modules appear. A code RAG system with a stale index produces answers about code that no longer exists — with the same confidence as answers about current code.

Three approaches exist:

**Full rebuild on a schedule.** Simple. The entire codebase is re-indexed nightly or weekly. Appropriate for codebases that change slowly or where same-day freshness isn't required. The downside is the gap: changes made at 10 AM aren't indexed until the overnight job runs.

**Incremental updates.** Track file modification timestamps or use git hooks to trigger re-indexing of changed files. Faster gap closure, more complex implementation. The tricky part is deletion: if a file is removed or a function is renamed, the old chunk must be removed from the index. Most vector stores support delete by ID; make sure your pipeline tracks chunk IDs by file path so you can delete all chunks for a deleted file.

**Event-driven indexing.** On every git push or pull request merge, trigger a re-index job for the changed files. Sub-minute freshness is achievable. Requires infrastructure (a queue, a worker, a git webhook). Appropriate for teams where developer productivity depends on accurate, current code search.

```python
import subprocess
from pathlib import Path

def get_changed_files_since_last_index(last_indexed_commit: str) -> list[str]:
    result = subprocess.run(
        ["git", "diff", "--name-only", last_indexed_commit, "HEAD"],
        capture_output=True, text=True
    )
    return [f for f in result.stdout.strip().split("\n") if f.endswith(".py")]

def update_index_incrementally(changed_files: list[str], collection, model):
    for filepath in changed_files:
        # Remove old chunks for this file
        collection.delete(where={"filepath": filepath})

        # Re-chunk and re-embed
        if Path(filepath).exists():
            source = Path(filepath).read_text()
            new_chunks = extract_chunks(source, filepath)
            if new_chunks:
                embeddings = model.encode([c["content"] for c in new_chunks])
                collection.add(
                    documents=[c["content"] for c in new_chunks],
                    embeddings=embeddings.tolist(),
                    metadatas=[{k: v for k, v in c.items() if k != "content"} for c in new_chunks],
                    ids=[f"{c['filepath']}:{c['start_line']}" for c in new_chunks],
                )
```

> **Warning:** If you switch embedding models, you must rebuild the entire index from scratch. Vectors from different models live in different geometric spaces — you cannot mix them in the same collection. Version your index alongside your model, and always rebuild on model changes.

### Embedding Model Versioning

The embedding model is part of your system's dependencies, the same as any library. When it changes, the index must be rebuilt. Track:

- Which model version generated the current index
- When the index was built
- The git commit at index time

Store this as collection metadata or in a simple JSON file alongside the index. This is not optional information — when a retrieval quality regression occurs, the first question is whether the model changed.

If you update the embedding model, rebuild the index in parallel before switching over. Don't update the live index in place during a model change.

### Query Latency

The latency profile of a code RAG system has a predictable shape: embedding the query is fast (20–50ms), dense retrieval is fast (10–50ms), BM25 retrieval is fast (10–100ms), cross-encoder reranking is slow (200–500ms for 20 candidates), and LLM generation is slowest (1–10 seconds).

LLM generation dominates. Optimize retrieval latency only if it's greater than ~500ms. In most deployments, it won't be.

Cache query embeddings for repeated queries — the same developer asking the same question twice shouldn't trigger two embedding model calls.

For cross-encoder reranking, batching is important. Score all 20 candidates in a single batch call rather than 20 individual calls.

### Monitoring and Alerting

Silent degradation is the hardest failure mode to catch. The system continues to return answers. The answers are plausible. Retrieval quality has declined. No error is raised.

Monitoring should cover:

**Retrieval coverage** — What fraction of queries return fewer than N candidates? A sudden drop suggests an index problem.

**Embedding latency** — Slow embedding suggests model server issues or GPU contention.

**BM25 index staleness** — Track when the BM25 index was last rebuilt relative to the codebase commit.

**LLM faithfulness on a sample** — Run RAGAS faithfulness on a random sample of production queries daily. Alert when the 7-day moving average drops more than 0.1 points.

**Error rates by stage** — Instrument each stage (chunking, embedding, retrieval, reranking, generation) separately so you can identify which stage is failing.

TruLens provides application-level monitoring for RAG systems. Weights & Biases is appropriate for tracking metric trends across pipeline versions. Start with structured logging and add dashboards when you have multiple people watching the system.

### Deployment Patterns

**In-process** — The entire pipeline runs in the same process as the application. Simple for small teams and moderate scale. No network overhead. The right starting point.

**Microservice** — Retrieval runs as a separate service. Allows independent scaling of retrieval and generation. The right pattern when retrieval load grows independently (many users, high query volume) or when different applications share the same index.

**Serverless retrieval** — Query time is bursty and unpredictable; serverless functions handle embedding and retrieval. Pinecone is designed for this model. Eliminates operational overhead at the cost of cold start latency and per-query cost.

For most engineering teams building internal tooling, in-process with a persistent vector store (ChromaDB or pgvector) and incremental updates via a cron job or git hook is sufficient. Scale the architecture only when there's evidence that the current architecture is the bottleneck.

> **Key Insight:** The production question is not "what architecture scales to a million users?" It's "what's the minimum architecture that serves my actual users without operational pain?" Build for the users you have, not the users you imagine.

---

### Chapter 9 Takeaways

- Index freshness is a code-RAG-specific failure mode; choose a freshness strategy (scheduled, incremental, event-driven) based on how quickly your codebase changes and how current answers need to be.
- Changing embedding models requires a full index rebuild; version the index alongside the model.
- LLM generation dominates latency; optimize retrieval only if it exceeds ~500ms.
- Monitor for silent degradation using daily RAGAS faithfulness sampling on production traffic.
- Build for actual scale, not imagined scale.

### Try This

Simulate a staleness scenario. Take your current index, make 20 significant code changes (rename a function, delete a module, add a new class), and query the system for the changed functionality without updating the index. Count how many answers are confidently wrong about code that no longer exists or doesn't exist yet. This quantifies your staleness risk and helps you decide how urgently you need incremental indexing.

---

## Conclusion

A working code RAG system is not complicated. The core is a few hundred lines of Python: an AST parser for chunking, a sentence-transformer for embedding, a vector store for ANN search, a BM25 index for lexical matching, RRF for fusion, and a cross-encoder for reranking. The principles are well-understood. The libraries are mature.

What makes the difference between a prototype and a production system is the operational layer: fresh indexes, versioned models, monitored quality, and honest evaluation. These require discipline, not ingenuity.

Build the evaluation dataset first. Run baselines honestly before optimizing. Start with the simplest architecture that works. Add complexity only when measurement shows that simplicity is the bottleneck.

Code search has been a persistent pain point for engineering teams for decades. Semantic retrieval backed by good evaluation and sound operations solves most of it. The components in this guide have existed long enough to have known failure modes and known fixes. There's nothing left to invent — only to build carefully.

---

## Appendix A: Glossary

**Cross-Encoder:** A neural model that takes a (query, document) pair as a single input and outputs a relevance score. More accurate than bi-encoder similarity but slower. Used in the reranking stage.

**Dense Retrieval:** Retrieval based on vector similarity between embedded query and embedded documents. Captures semantic similarity rather than lexical overlap.

**Faithfulness:** A RAG quality metric measuring whether generated answers are supported by the retrieved context, as opposed to hallucinated from training data.

**HNSW (Hierarchical Navigable Small World):** The dominant ANN index algorithm. Provides strong recall-versus-speed trade-offs and supports online insertion. Implemented in most production vector stores.

**Hybrid Retrieval:** Combining sparse (BM25) and dense retrieval results, typically via Reciprocal Rank Fusion. The standard baseline for code RAG.

**Index Freshness:** How closely the vector index reflects the current state of the codebase. Stale indexes produce answers about code that no longer exists.

**MMR (Maximum Marginal Relevance):** A reranking strategy that balances relevance to the query against diversity among selected chunks. Prevents near-duplicate chunks from dominating the context window.

**Reranking:** A second-pass scoring step that takes retrieval candidates and re-scores them with a more expensive, more accurate model (cross-encoder or LLM) before final selection.

**RRF (Reciprocal Rank Fusion):** A rank merging formula that combines result sets from multiple retrieval systems. Operates on ranks, not raw scores, making it robust across different scoring scales.

**Sparse Retrieval:** Retrieval based on lexical matching (term frequency, inverted index). BM25 is the standard sparse retrieval algorithm.

**Tree-Sitter:** A multi-language parsing library that produces concrete syntax trees. Used for AST-based chunking across multiple languages with a consistent interface.

**Vector Store:** A database optimized for storing and querying high-dimensional vectors. Examples: ChromaDB, Qdrant, Weaviate, Pinecone, pgvector.

---

## Appendix B: Tools & Resources

### Parsing and Chunking

- **Python `ast` module** — Standard library AST parser for Python. No dependencies. Use for single-language Python repositories.
- **tree-sitter** — Multi-language parsing. Supports Python, TypeScript, Go, Rust, Java, Ruby, and 50+ others. Fast and resilient to parse errors.
- **tree-sitter-languages** — Pre-built tree-sitter grammars packaged as a single Python wheel. Simplifies multi-language setup.

### Embedding Models

- **microsoft/unixcoder-base** — Strong code embedding model from Microsoft. Good for natural-language-to-code retrieval.
- **microsoft/codebert-base** — Earlier Microsoft code model, still competitive for the languages it covers.
- **flax-sentence-embeddings/st-codesearch-distilroberta-base** — Efficient code-to-code similarity model.
- **text-embedding-3-small** (OpenAI) — General-purpose model that performs well on code at scale. Managed API.
- **sentence-transformers** library — Unified interface for loading and using embedding models locally.

### Vector Stores

- **ChromaDB** — Excellent for development and moderate-scale production. Simple Python API. Persistent and in-memory modes.
- **Qdrant** — Strong production-grade vector store. Supports filtering, sharding, and quantization natively.
- **Weaviate** — Good horizontal scaling story. GraphQL query interface.
- **pgvector** — Vector search extension for PostgreSQL. Use if you're already on Postgres and want to avoid a new infrastructure component.
- **Pinecone** — Managed vector database. No operational overhead, cost scales with usage.

### BM25

- **rank-bm25** — Python BM25 implementation. Lightweight, no dependencies, suitable for in-process BM25 on moderate-sized corpora.
- **Elasticsearch / OpenSearch** — Full-featured BM25 search for large-scale deployments where in-process BM25 is insufficient.
- **Tantivy** — Rust-based full-text search library with Python bindings. Fast for large corpora.

### Reranking

- **cross-encoder/ms-marco-MiniLM-L-6-v2** — Lightweight, fast cross-encoder. Good general-purpose baseline.
- **BAAI/bge-reranker-large** — Stronger cross-encoder with better multilingual performance.
- **Cohere Rerank API** — Managed reranking API. Strong performance, simple integration, per-query cost.

### Evaluation

- **RAGAS** — Framework for RAG evaluation. Implements faithfulness, answer relevance, and context precision metrics. LLM-based evaluation.
- **TruLens** — LLM application evaluation and monitoring framework.
- **Weights & Biases** — Experiment tracking and evaluation logging. Useful for tracking metric trends across pipeline versions.

---

## Appendix C: Further Reading

### Retrieval Systems

**"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"** — Lewis et al. (2020). The original RAG paper. Useful for understanding the foundational architecture before the subsequent applied work.

**"Precise Zero-Shot Dense Retrieval without Relevance Labels"** — Gao et al. (2022). Introduces HyDE (Hypothetical Document Embeddings) — generating a hypothetical answer and using its embedding for retrieval. Relevant for cases where the query vocabulary is very different from the document vocabulary.

**"A Survey of Transformers"** — Lin et al. (2022). Covers attention mechanisms and the model architectures behind most embedding and generation components. Useful background if you want to understand why context window position matters for generation grounding.

### Code Representation and Retrieval

**"CodeBERT: A Pre-Trained Model for Programming and Natural Languages"** — Feng et al. (2020). The foundational paper on code-specific pretrained models. Explains the training approach and evaluations that motivate using code-specific rather than text-only embedding models.

**"UniXcoder: Unified Cross-Modal Pre-Training for Code Representation"** — Guo et al. (2022). Extends CodeBERT to multiple downstream tasks including code search, code generation, and code summarization within a single model.

**"An Empirical Study of Deep Learning Models for Vulnerability Detection"** — Chakraborty et al. (2022). Examines how well neural models actually understand code semantics. Informative for setting realistic expectations about what embedding models can and cannot capture.

### RAG Architecture and Production

**"Self-RAG: Learning to Retrieve, Generate, and Critique Through Self-Reflection"** — Asai et al. (2023). Introduces adaptive retrieval — systems that decide when to retrieve rather than always retrieving. Relevant for code RAG systems where some queries are answerable from training data and retrieval adds noise.

**"RAGAS: Automated Evaluation of Retrieval Augmented Generation"** — Es et al. (2023). Describes the RAGAS evaluation framework. Useful companion to Chapter 8 of this guide.

**"Lost in the Middle: How Language Models Use Long Contexts"** — Liu et al. (2023). Empirical study showing that LLM performance degrades for information in the middle of long context windows. Directly relevant to context assembly strategy.

### Hybrid Retrieval

**"Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods"** — Cormack et al. (2009). The original RRF paper. Short, clear, worth reading to understand the formula and its properties.

**"BEIR: A Heterogenous Benchmark for Zero-Shot Evaluation of Information Retrieval Models"** — Thakur et al. (2021). Benchmark study showing which retrieval approaches generalize across different domains. Relevant for understanding when dense retrieval underperforms sparse retrieval and why.

---

*RAG for Code: A Complete Guide* — Version 1.0, April 2026
David Kelly Price



---

## Related Blog Posts

- [RAG for Code Is Not RAG for Documents](https://pyckle.co/blog/rag-for-code-is-not-rag-for-documents.html)
- [Why Chunking Breaks Your AI's Retrieval](https://pyckle.co/blog/why-chunking-breaks-your-ais-retrieval.html)
- [Semantic Chunking Will Not Save Your RAG System](https://pyckle.co/blog/semantic-chunking-will-not-save-your-rag-system.html)
- [The Vector Database Decision Nobody Actually Makes](https://pyckle.co/blog/the-vector-database-decision-nobody-actually-makes.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*
