# Building RAG Systems for Codebases: Architecture, Tradeoffs, and Implementation

## Introduction

Retrieval-Augmented Generation (RAG) solves a problem that most LLM-based developer tools fail to solve well: how do you give an AI assistant accurate, specific knowledge about a codebase it has never seen?

Fine-tuning is too slow and expensive for most teams. Stuffing the entire codebase into a context window costs too many tokens and degrades answer quality. RAG is the practical middle path: retrieve the relevant pieces of the codebase at query time, include them in the prompt, and let the model reason over real code.

This guide covers the architecture, the failure modes, and the implementation patterns for building a RAG system over a codebase that works in production — not just in demos.

---

## Part 1: Why Standard RAG Fails on Code

### The Vocabulary Problem

Standard RAG uses a general-purpose embedding model for retrieval. These models are trained on natural language text. They work well for retrieving documentation, blog posts, and prose.

Code has a different structure. Developer queries use natural language descriptions ("where does the payment webhook fire"). Code uses implementation names (`emit_payment_event`, `handle_webhook_callback`). The gap between these vocabularies is the primary failure mode of naive code RAG.

A general-purpose embedding model might retrieve:
- Documentation about payment webhooks (natural language match)
- The actual `emit_payment_event` function (vocabulary mismatch)

The documentation is often less useful than the implementation. But naive retrieval returns the documentation because it scores higher on cosine similarity with a general-purpose model.

### The Granularity Problem

Standard RAG uses fixed-size text chunks, typically 512-1024 tokens with some overlap. This works reasonably well for prose (articles, documentation) where any 512-token window carries useful information.

Code is different. A 512-token chunk that starts in the middle of a function body is nearly meaningless. It lacks the function signature (what are the types?), the docstring (what does this do?), and the broader class context (what is this a method of?). The embedding of that chunk is noise.

Code must be chunked at semantic boundaries — functions, classes, methods — not arbitrary token counts.

### The Context Completeness Problem

Even with good retrieval, a single function is often insufficient context. A question like "how does the password reset flow work end-to-end" requires:
- The route handler
- The service function that generates a reset token
- The email sending function
- The token validation function when the user clicks the link
- The token storage model

Retrieving only the route handler and asking an LLM to explain the full flow will produce a hallucinated answer that sounds plausible but is wrong.

Code RAG needs multi-hop retrieval or dependency-aware retrieval to surface the full relevant context.

---

## Part 2: Architecture

### Component Overview

```
User Query
    |
    v
Query Processing
    |
    +-- Query expansion / rewriting
    +-- Routing (what type of question is this?)
    |
    v
Retrieval
    |
    +-- Semantic search (embedding similarity)
    +-- Keyword search (BM25)
    +-- Graph traversal (dependency expansion)
    |
    v
Context Assembly
    |
    +-- De-duplication
    +-- Re-ranking
    +-- Token budget management
    |
    v
Generation
    |
    +-- Prompt construction
    +-- LLM call
    |
    v
Response + Sources
```

### The Retrieval Layer

The retrieval layer is where most RAG systems for codebases fail. It has three sub-components:

**Semantic search** retrieves by meaning. The user's query is embedded and compared against the codebase index. Good for: finding functions that do what you're describing, even when the vocabulary doesn't match.

**Keyword search** retrieves by exact match. Good for: finding specific function names, variable names, imports, and identifiers the user mentions explicitly.

**Graph traversal** retrieves by dependency. Given a retrieved function, return its callers and callees. Good for: understanding end-to-end flows, finding all places a function is used, expanding context for complex queries.

All three are necessary for production quality. A system with only semantic search misses exact-name queries. A system with only keyword search misses descriptive queries. A system without graph traversal misses multi-hop context.

### The Context Assembly Layer

After retrieval, you have a set of candidate chunks. Not all of them fit in the prompt. The context assembly layer decides what to include:

**De-duplication**: If the same function was retrieved by both semantic and keyword search, include it once.

**Re-ranking**: A cross-encoder model (one that processes query+document jointly) scores each retrieved chunk for relevance more accurately than the initial embedding similarity.

**Token budget management**: Given a max context budget (e.g., 8K tokens for context), select the highest-ranked chunks that fit. Include file paths and line numbers so the model can reference them.

**Coherence**: If you include a function, consider also including the class definition it belongs to (a few lines) so the model has structural context.

---

## Part 3: Implementation

### Step 1: Index Your Codebase

```python
import ast
import httpx
import asyncio
from pathlib import Path

async def embed_batch(texts: list[str], api_key: str) -> list[list[float]]:
    async with httpx.AsyncClient(timeout=30) as client:
        response = await client.post(
            "https://api.pyckle.co/v1/embeddings",
            headers={"Authorization": f"Bearer {api_key}"},
            json={"input": texts, "model": "pycklelm-1"}
        )
        response.raise_for_status()
        return [item["embedding"] for item in response.json()["data"]]

def extract_chunks(filepath: str) -> list[dict]:
    source = Path(filepath).read_text()
    try:
        tree = ast.parse(source)
    except SyntaxError:
        return []

    chunks = []
    lines = source.splitlines()

    for node in ast.walk(tree):
        if not isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            continue

        start = max(0, node.lineno - 4)
        end = node.end_lineno
        content = "\n".join(lines[start:end])

        header = f"# File: {filepath}\n# {type(node).__name__}: {node.name}\n"

        chunks.append({
            "id": f"{filepath}:{node.lineno}",
            "content": header + content,
            "metadata": {
                "file": filepath,
                "name": node.name,
                "type": type(node).__name__.lower(),
                "line_start": node.lineno,
                "line_end": node.end_lineno,
            }
        })

    return chunks

async def index_codebase(root: str, api_key: str, store) -> int:
    all_chunks = []

    for filepath in Path(root).rglob("*.py"):
        if should_skip(filepath):
            continue
        all_chunks.extend(extract_chunks(str(filepath)))

    batch_size = 50
    for i in range(0, len(all_chunks), batch_size):
        batch = all_chunks[i:i+batch_size]
        embeddings = await embed_batch([c["content"] for c in batch], api_key)

        store.upsert([
            {
                "id": c["id"],
                "values": emb,
                "metadata": c["metadata"],
                "text": c["content"],
            }
            for c, emb in zip(batch, embeddings)
        ])

        await asyncio.sleep(1.0)

    return len(all_chunks)

def should_skip(path: Path) -> bool:
    skip = {".git", "node_modules", "__pycache__", ".venv", "dist", "build", "vendor"}
    return any(p in skip for p in path.parts)
```

### Step 2: Build the Retrieval Pipeline

```python
class CodebaseRetriever:
    def __init__(self, store, api_key: str, bm25_index=None):
        self.store = store
        self.api_key = api_key
        self.bm25 = bm25_index

    async def retrieve(
        self,
        query: str,
        top_k: int = 20,
        file_filter: list[str] = None,
    ) -> list[dict]:
        async with httpx.AsyncClient(timeout=30) as client:
            resp = await client.post(
                "https://api.pyckle.co/v1/embeddings",
                headers={"Authorization": f"Bearer {self.api_key}"},
                json={"input": [query], "model": "pycklelm-1"},
            )
            resp.raise_for_status()
            query_emb = resp.json()["data"][0]["embedding"]

        filter_dict = {"file": {"$in": file_filter}} if file_filter else None
        semantic_results = self.store.query(
            vector=query_emb,
            top_k=top_k,
            filter=filter_dict,
            include_metadata=True,
            include_values=False,
        )

        results = [
            {
                "id": match["id"],
                "content": match["metadata"]["text"],
                "metadata": match["metadata"],
                "score": match["score"],
                "source": "semantic",
            }
            for match in semantic_results["matches"]
        ]

        if self.bm25:
            bm25_results = self.bm25.search(query, top_k=top_k)
            results = self._rrf_merge(results, bm25_results)

        return results

    def _rrf_merge(
        self,
        semantic: list[dict],
        bm25: list[dict],
        k: int = 60
    ) -> list[dict]:
        scores = {}
        docs = {}

        for rank, r in enumerate(semantic):
            scores[r["id"]] = scores.get(r["id"], 0) + 1 / (k + rank + 1)
            docs[r["id"]] = r

        for rank, r in enumerate(bm25):
            scores[r["id"]] = scores.get(r["id"], 0) + 1 / (k + rank + 1)
            if r["id"] not in docs:
                docs[r["id"]] = r

        return sorted(
            [{"rrf_score": s, **docs[id_]} for id_, s in scores.items()],
            key=lambda x: x["rrf_score"],
            reverse=True
        )
```

### Step 3: Dependency Expansion

```python
import ast
from collections import defaultdict

def build_call_graph(root: str) -> dict[str, set[str]]:
    call_graph = defaultdict(set)

    for filepath in Path(root).rglob("*.py"):
        if should_skip(filepath):
            continue
        source = filepath.read_text()
        try:
            tree = ast.parse(source)
        except SyntaxError:
            continue

        for node in ast.walk(tree):
            if not isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
                continue

            for child in ast.walk(node):
                if isinstance(child, ast.Call):
                    if isinstance(child.func, ast.Name):
                        call_graph[node.name].add(child.func.id)
                    elif isinstance(child.func, ast.Attribute):
                        call_graph[node.name].add(child.func.attr)

    return dict(call_graph)

def expand_with_dependencies(
    retrieved: list[dict],
    call_graph: dict[str, set[str]],
    all_chunks: dict[str, dict],
    depth: int = 1
) -> list[dict]:
    seen_ids = {r["id"] for r in retrieved}
    expanded = list(retrieved)

    frontier = retrieved
    for _ in range(depth):
        next_frontier = []
        for chunk in frontier:
            fn_name = chunk["metadata"].get("name")
            if not fn_name:
                continue

            for called_fn in call_graph.get(fn_name, set()):
                for chunk_id, chunk_data in all_chunks.items():
                    if (chunk_data["metadata"].get("name") == called_fn
                            and chunk_id not in seen_ids):
                        expanded.append({**chunk_data, "source": "dependency"})
                        seen_ids.add(chunk_id)
                        next_frontier.append(chunk_data)

        frontier = next_frontier

    return expanded
```

### Step 4: Context Assembly

```python
import tiktoken

def assemble_context(
    chunks: list[dict],
    max_tokens: int = 8000,
    model: str = "gpt-4"
) -> tuple[str, list[dict]]:
    enc = tiktoken.encoding_for_model(model)

    context_parts = []
    total_tokens = 0
    included = []

    ranked = sorted(
        chunks,
        key=lambda x: (x.get("source") == "dependency", -x.get("rrf_score", x.get("score", 0)))
    )

    for chunk in ranked:
        content = chunk["content"]
        meta = chunk["metadata"]

        formatted = (
            f"### {meta['file']}:{meta['line_start']}\n"
            f"```{infer_language(meta['file'])}\n"
            f"{content}\n"
            f"```\n\n"
        )

        chunk_tokens = len(enc.encode(formatted))

        if total_tokens + chunk_tokens > max_tokens:
            break

        context_parts.append(formatted)
        total_tokens += chunk_tokens
        included.append(chunk)

    return "".join(context_parts), included

def infer_language(filepath: str) -> str:
    ext_map = {".py": "python", ".ts": "typescript", ".tsx": "typescript",
               ".go": "go", ".rs": "rust", ".js": "javascript"}
    ext = Path(filepath).suffix
    return ext_map.get(ext, "")
```

### Step 5: Generation

```python
SYSTEM_PROMPT = """You are a senior software engineer helping a developer understand and work with their codebase.

You have been given relevant code snippets from the codebase as context. When answering:
- Reference specific files and line numbers from the provided context
- If the answer requires code not shown in the context, say so rather than guessing
- Explain data flow, not just what individual functions do
- If you can't find the answer in the provided context, say so explicitly

Context format: Each snippet is labeled with its file path and starting line number."""

async def answer_query(
    query: str,
    retriever: CodebaseRetriever,
    call_graph: dict,
    all_chunks: dict,
    llm_client,
    max_context_tokens: int = 8000,
) -> dict:
    raw_results = await retriever.retrieve(query, top_k=20)
    expanded = expand_with_dependencies(raw_results, call_graph, all_chunks, depth=1)
    context, included_chunks = assemble_context(expanded, max_tokens=max_context_tokens)

    response = await llm_client.messages.create(
        model="claude-sonnet-4-6",
        system=SYSTEM_PROMPT,
        messages=[
            {"role": "user", "content": f"## Codebase Context\n\n{context}\n\n## Question\n\n{query}"}
        ],
        max_tokens=2048,
    )

    return {
        "answer": response.content[0].text,
        "sources": [
            {"file": c["metadata"]["file"], "line": c["metadata"]["line_start"]}
            for c in included_chunks
        ],
        "chunks_retrieved": len(raw_results),
        "chunks_used": len(included_chunks),
    }
```

---

## Part 4: Evaluation

### What to Measure

**Retrieval recall**: For a set of known queries with known correct answers, what fraction of the correct chunks appear in the top-k retrieved results?

**Answer accuracy**: For factual questions about the codebase (e.g., "what parameters does `create_user` take?"), what fraction of generated answers are correct?

**Hallucination rate**: What fraction of answers contain claims that contradict the retrieved context or cannot be verified from it?

### Building a Test Set

```python
test_cases = [
    {
        "query": "where does JWT token validation happen",
        "expected_chunks": ["src/auth/middleware.py:42", "src/auth/tokens.py:15"],
        "top_k": 10,
    },
    {
        "query": "how is a new user created",
        "expected_chunks": ["src/users/service.py:88", "src/users/models.py:12"],
        "top_k": 10,
    },
]

async def evaluate_retrieval(test_cases: list[dict], retriever) -> dict:
    results = []

    for case in test_cases:
        retrieved = await retriever.retrieve(case["query"], top_k=case["top_k"])
        retrieved_ids = {r["id"] for r in retrieved}
        expected_ids = set(case["expected_chunks"])

        recall = len(expected_ids & retrieved_ids) / len(expected_ids)
        results.append({"query": case["query"], "recall": recall})

    avg_recall = sum(r["recall"] for r in results) / len(results)
    return {"mean_recall": avg_recall, "per_query": results}
```

---

## Part 5: Production Checklist

Before shipping a codebase RAG system to developers:

**Indexing:**
- [ ] Function-level (not fixed-size) chunking
- [ ] Metadata includes file path, line numbers, function name
- [ ] Incremental update pipeline (not full re-index on every change)
- [ ] Skip lists exclude generated code, dependencies, build artifacts

**Retrieval:**
- [ ] Hybrid search (semantic + keyword) — not semantic-only
- [ ] Code-specific embedding model — not general-purpose
- [ ] Retrieval eval on L1 queries, not just L0

**Context Assembly:**
- [ ] Token budget enforced before prompt construction
- [ ] File + line number labels in context (enables model to cite sources)
- [ ] De-duplication across semantic and keyword results

**Generation:**
- [ ] System prompt instructs model to say "I don't know" rather than hallucinate
- [ ] Sources included in response for user verification
- [ ] Hallucination rate monitored in production

**Operations:**
- [ ] Staleness detection (warn when retrieved chunk was last indexed >N hours ago)
- [ ] Query logging (future training signal)
- [ ] Error handling for embedding API rate limits

---

## Conclusion

Building a RAG system that works on a real codebase is more involved than the tutorials suggest. The failure modes are specific to code: vocabulary mismatch between developer queries and implementation names, fixed-size chunking breaking semantic units, and missing multi-hop context for end-to-end flow questions.

The architecture described in this guide addresses all three:
- Code-specific embeddings (Pyckle Embeddings API) close the vocabulary gap
- Function-level chunking preserves semantic units
- Dependency expansion adds multi-hop context

The result is a system that answers "how does the payment webhook flow work" with references to the actual code, not a hallucinated narrative.

---

*The Pyckle Embeddings API provides the retrieval model. Get started at pyckle.co.*