```markdown
---
title: "The Developer's Complete Guide to Code Embeddings"
subtitle: "From Theory to Production: How to Build Search Systems That Actually Understand Your Code"
author: "Kelly Price"
date: "2026-04-21"
description: "The definitive field guide for developers building code search and retrieval systems. Covers embeddings, chunking, vector stores, calibration, hybrid search, and production operations — with specific numbers and working code throughout."
tags: [embeddings, ai, developer-tools, machine-learning]
---

# The Developer's Complete Guide to Code Embeddings
### From Theory to Production: How to Build Search Systems That Actually Understand Your Code

**Kelly Price**

---

## About This Guide

This book exists because most of the writing on embeddings is either too abstract to act on or too narrow to generalize from. You get either the gradient-descent-from-scratch lecture or a vendor tutorial that assumes your codebase fits a specific shape. Neither helps when you're actually shipping a code search system at work.

What follows is a field guide. It's written for developers who are past the "what is a vector?" question and are trying to answer harder ones: Why does my search return the wrong file 40% of the time? Should I use a managed vector database or run pgvector myself? What does it actually cost in latency to add BM25 to my pipeline? Why does my model score an irrelevant file at 0.74 when the right file scores 0.78?

Every number in this book comes from somewhere real — production systems, published benchmarks, or direct experiments. Where I say "6ms embedding latency," that's measured on ONNX Runtime at 384 dimensions, not a theoretical estimate. Where I say "1.4x lower p95 latency than Pinecone," that's pgvector on existing Postgres hardware against a representative workload. I'll tell you the conditions when they matter and skip the asterisks when they don't.

The architecture this book describes — small domain-specific model, hybrid search, calibrated thresholds, adaptive re-indexing — is not the only architecture that works. It is the architecture that consistently outperforms the defaults when you measure what actually matters: retrieval quality, operational cost, and the ability to improve the system over time without rebuilding it.

A note on scope: this book focuses on retrieval, not generation. RAG systems, code completion, and LLM context injection all depend on retrieval as a prerequisite, and this book covers that prerequisite thoroughly. Once your retrieval layer is solid, connecting it to a generation layer is the easy part.

The chapters are designed to be read in order. The first three establish the conceptual foundations you'll need to make good infrastructure decisions in chapters four through six. Chapter seven on calibration is the one most developers skip and then regret — read it. Chapters eight and nine cover what happens after you ship, which is where most of the real learning happens.

By the end, you'll have a working mental model for every layer of a code search system, specific numbers to benchmark against, and runnable code for each component. That's the goal: not to survey the space, but to give you what you need to build something that works.

---

## Table of Contents

1. [What Embeddings Actually Are (Not the Wikipedia Definition)](#chapter-1)
2. [Why Code Embedding Is a Different Problem Than Text Embedding](#chapter-2)
3. [Model Selection: What the Benchmarks Aren't Telling You](#chapter-3)
4. [Chunking Strategies for Code: Function-Level, File-Level, Context-Aware](#chapter-4)
5. [Vector Stores: Infrastructure Decisions That Don't Require a PhD](#chapter-5)
6. [Hybrid Search: Semantic + BM25, When to Use Each](#chapter-6)
7. [Calibration: Thresholds, Margin Loss, and Why Static top_k Is a Trap](#chapter-7)
8. [Production Operations: Latency, Drift, Monitoring, Re-Indexing](#chapter-8)
9. [The Compounding Advantage: Why a Model Trained on Your Codebase Gets Better Over Time](#chapter-9)

**Conclusion**

**Appendix A:** Glossary
**Appendix B:** Tools and Resources
**Appendix C:** Further Reading

---

## Chapter 1: What Embeddings Actually Are (Not the Wikipedia Definition) {#chapter-1}

The Wikipedia definition of an embedding tells you it's a mapping from a high-dimensional discrete space to a lower-dimensional continuous one. That's true and almost entirely useless for building a system.

Here's a more useful frame: an embedding is a function that takes a piece of content — a string, a function, an image, a row in a database — and returns a fixed-length list of floating-point numbers that encodes meaning. Two pieces of content that mean similar things should produce vectors that are close together. Two pieces that mean different things should produce vectors that are far apart. That's the whole contract.

The "meaning" part is what the model learns during training. When you train a model on a corpus, you're teaching it what "similar" means in that domain. This is why the training data is the most consequential decision in the pipeline — not the architecture, not the number of dimensions, not the vector store you choose. Those decisions matter, but they're downstream of what your model learned to consider similar.

### The Geometry Matters More Than You Think

Embeddings live in a metric space. Every vector is a point, and distance between points is computable. The most common metric is cosine similarity: the cosine of the angle between two vectors, ranging from -1 (opposite directions) to 1 (identical direction). Dot product and Euclidean distance are alternatives with different tradeoffs.

Cosine similarity has a useful property for text and code: it's invariant to vector magnitude. A short function and a long one can have similar direction even if their magnitudes differ wildly because one had more tokens to average over. This makes cosine the default choice for most code search applications.

```python
import numpy as np

def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float:
    """Cosine similarity between two vectors."""
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

def cosine_similarity_matrix(queries: np.ndarray, corpus: np.ndarray) -> np.ndarray:
    """
    Batch cosine similarity: queries (Q, D) against corpus (N, D).
    Returns (Q, N) similarity matrix.
    """
    queries_norm = queries / np.linalg.norm(queries, axis=1, keepdims=True)
    corpus_norm = corpus / np.linalg.norm(corpus, axis=1, keepdims=True)
    return queries_norm @ corpus_norm.T

# Example: find top-k results for a query vector
def top_k_results(query_vec: np.ndarray, corpus_vecs: np.ndarray, k: int = 10):
    sims = cosine_similarity_matrix(query_vec[np.newaxis, :], corpus_vecs)[0]
    top_indices = np.argpartition(sims, -k)[-k:]
    top_indices = top_indices[np.argsort(sims[top_indices])[::-1]]
    return top_indices, sims[top_indices]
```

> **Key Insight**: Cosine similarity measures direction, not magnitude. This means the length of the input text doesn't directly inflate similarity scores — a 5-line function and a 500-line module can still be "close" if they're about the same thing.

### How Models Produce Embeddings

Most embedding models are transformer-based encoders. The input (a string of tokens) passes through a series of attention layers, and the output is a sequence of per-token representations. To produce a single vector for the whole input, the model applies pooling — most commonly mean pooling, which averages all token representations; sometimes CLS pooling, which uses a special classification token's representation.

The dimension of the output vector is a design choice made during training. 384-dimensional embeddings are common for lightweight models: small enough to be fast, large enough to encode meaningful distinctions. 1536-dimensional vectors (OpenAI's `text-embedding-3-small` output size) carry more capacity but cost proportionally more in storage, compute, and bandwidth.

For code search at production scale, 384 dimensions is usually the right tradeoff. The retrieval quality difference between 384 and 1536 is smaller than you'd expect, and the operational cost difference is substantial.

> **Warning**: Don't conflate model output dimension with model capability. A 384-dimensional model trained specifically on code will outperform a 1536-dimensional general-purpose model on code search tasks. The dimension is the output size, not a proxy for quality.

### Training: What the Model Actually Learns

Embedding models are typically trained with contrastive loss. You present the model with pairs: (anchor, positive) where the positive is semantically similar to the anchor, and (anchor, negative) where the negative is not. The model is trained to minimize the distance to positives and maximize the distance to negatives.

The hard part is constructing negatives. Random negatives are easy to discriminate — the model quickly learns to separate "how to sort a list" from "database migration strategy." What trains the model well is *hard negatives*: pairs that are superficially similar but semantically different. For code, this means two different authentication functions, or two different error handlers that do different things.

Triplet loss and margin ranking loss are common formulations:

```python
import torch
import torch.nn.functional as F

def margin_ranking_loss(
    anchor: torch.Tensor,
    positive: torch.Tensor,
    negative: torch.Tensor,
    margin: float = 0.3
) -> torch.Tensor:
    """
    Forces positive similarity to exceed negative similarity by at least `margin`.
    """
    pos_sim = F.cosine_similarity(anchor, positive)
    neg_sim = F.cosine_similarity(anchor, negative)
    loss = F.relu(margin - (pos_sim - neg_sim))
    return loss.mean()
```

This loss function is a preview of Chapter 7's calibration discussion. The gap between positive and negative scores — not just whether they're ordered correctly — determines whether your retrieval threshold can actually separate relevant from irrelevant results.

> **Try This**: Take any embedding model you're currently using and run this diagnostic. Generate embeddings for 20 pairs of functions you consider semantically similar (e.g., two different sorting implementations) and 20 pairs you consider different. Plot the score distributions. If the similar and dissimilar distributions overlap significantly, your model has poor discriminative power for code — and no amount of tuning the top_k will fix that.

### Embeddings Are Not Magic, They're Compressed Knowledge

The most important misconception to shed early: embeddings are not looking up meaning from some universal semantic database. They're compressing input into a point in a learned space. The space was shaped by training data, and its geometry reflects what was in that data.

If your model was trained primarily on English prose, it learned a geometry where "authentication" and "login" are close because they co-occur in prose contexts. It did not necessarily learn that `authenticate_user()` and `verify_credentials()` are close, because function naming conventions aren't prose patterns. This is the domain gap, and it's the subject of the entire next chapter.

---

### Key Takeaways

- An embedding is a fixed-length vector encoding meaning. Similar things should produce nearby vectors; the training data defines what "similar" means.
- Cosine similarity is the default metric because it's magnitude-invariant. Use it unless you have a specific reason not to.
- Output dimension (384 vs 1536) is a tradeoff between capacity and cost, not a direct proxy for quality.
- Training data and loss function shape the embedding space more than architecture choices.
- Hard negatives are what make embedding models useful for fine-grained retrieval.

### Practical Exercise

Run this script against any codebase you have access to, using any embedding model (OpenAI, Sentence Transformers, or a local model):

```python
"""
Embedding discriminability diagnostic.
Measures how well your model separates similar-but-different code pairs.
"""
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")  # swap in your model

# Define pairs: (text_a, text_b, label) where label=1 means similar, 0 means different
pairs = [
    ("def authenticate_user(token): ...", "def verify_token(jwt): ...", 1),
    ("def authenticate_user(token): ...", "def render_dashboard(user): ...", 0),
    ("class UserRepository:", "class AccountRepository:", 1),
    ("class UserRepository:", "class ImageProcessor:", 0),
]

scores = []
for a, b, label in pairs:
    va = model.encode(a)
    vb = model.encode(b)
    sim = np.dot(va, vb) / (np.linalg.norm(va) * np.linalg.norm(vb))
    scores.append((sim, label))
    print(f"{'similar' if label else 'different':10s}  score={sim:.4f}  |  {a[:40]}")

similar_scores = [s for s, l in scores if l == 1]
diff_scores = [s for s, l in scores if l == 0]
print(f"\nSimilar mean: {np.mean(similar_scores):.4f}")
print(f"Different mean: {np.mean(diff_scores):.4f}")
print(f"Gap: {np.mean(similar_scores) - np.mean(diff_scores):.4f}")
```

A gap below 0.15 indicates poor discriminability. Note your baseline — you'll revisit this after Chapter 3.

---

## Chapter 2: Why Code Embedding Is a Different Problem Than Text Embedding {#chapter-2}

When you take a model trained on natural language and point it at source code, something subtle goes wrong. The model doesn't fail loudly. It produces vectors, scores come back, results appear. But the results are systematically off in ways that are hard to diagnose without understanding why code is fundamentally different from the text the model was trained on.

This chapter is about that difference.

### The Structural Unit Problem

Natural language has implicit structure: sentences, paragraphs, documents. These units are noisy — a paragraph can be anything from one sentence to twenty — but they carry semantic coherence by convention. When you embed a paragraph, you're capturing a unit of thought.

Code has explicit structure: tokens, expressions, statements, blocks, functions, classes, modules, packages. These units are defined by the language grammar, not by the author's stylistic choices. A function is a function. It has a signature, a body, and well-defined boundaries.

This matters for chunking (Chapter 4) and for understanding what "similar" means in code. Two functions that both authenticate users are similar even if they're implemented differently. Two files in the same module are related even if their functions are unrelated. These relationships are structural, and a model that doesn't understand code structure can't capture them.

"Chunking a function definition and embedding it in isolation is like embedding a single frame from a film and calling it a movie summary." The function body without its signature loses argument context. The signature without the docstring loses intent. The function without its module-level imports loses type information. Code embeds well when it's chunked with enough surrounding context to be interpreted.

> **Key Insight**: The right unit of embedding for code is not the file and not the line. It's the semantic unit — typically the function with its signature, decorators, and immediate type context — plus enough surrounding structure to make it interpretable in isolation.

### The Identifier Vocabulary Problem

Natural language models learn vocabulary from billions of tokens of human writing. The word "authentication" appears millions of times, in many contexts, with many co-occurrences. A model trained on prose understands its meaning richly.

The identifier `useAuthorizationMiddleware` appears in one codebase. Maybe two. The identifier `validateJwtPayload` appears in perhaps a few thousand codebases across GitHub. Generic models have seen these identifiers some number of times, but not enough to form rich representations — and critically, not enough to distinguish between subtly different variants in your specific codebase.

This is the domain gap. When you embed `useAuthorizationMiddleware` and `useAuthenticationMiddleware` with a generic model, it may produce nearly identical vectors because the model has no strong representation of either identifier and falls back to the common components ("use", "Authorization", "Middleware"). The two vectors are indistinguishable even though the functions do different things.

```python
# Demonstrating domain gap: run this with a generic model
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

# These are different functions — one checks authorization, one checks authentication
auth_fn = """
def useAuthorizationMiddleware(request, resource_id):
    if not request.user.has_permission(resource_id):
        raise PermissionDenied("Insufficient permissions for resource")
    return True
"""

authn_fn = """
def useAuthenticationMiddleware(request):
    token = request.headers.get("Authorization", "").split(" ")[-1]
    if not verify_jwt(token):
        raise AuthenticationError("Invalid or expired token")
    return request.user
"""

va = model.encode(auth_fn)
vb = model.encode(authn_fn)
sim = np.dot(va, vb) / (np.linalg.norm(va) * np.linalg.norm(vb))
print(f"Similarity between authorization and authentication middleware: {sim:.4f}")
# On most generic models, this will be 0.91+ — indistinguishably similar
```

> **Warning**: A high similarity score between two different code objects is not a sign that your retrieval is working well — it may be a sign that your model can't tell them apart. Before trusting similarity scores, verify discriminability on pairs you know should be different.

### The Structural vs. Semantic Duality

Code has two kinds of similarity that matter independently:

**Structural similarity**: Two functions have the same shape — similar argument types, similar control flow, similar error handling patterns. A sorting algorithm in Python and a sorting algorithm in Go might be structurally similar even with different surface syntax.

**Semantic similarity**: Two functions do the same thing at a higher level of abstraction, even if implemented completely differently. A bubble sort and a quicksort are both "sorting" but structurally dissimilar.

Generic models, trained on prose, mostly capture semantic similarity in the natural language sense: "this function is about authentication" is the kind of cluster they form. They're poor at structural similarity because they weren't trained to recognize code patterns as patterns.

Domain-specific code models can capture both, but only if the training data included examples of both. This is why training on triplets — (function, similar function, different function) — is more valuable than pretraining on raw code.

### Context Windows and Long Code

Most embedding models have a maximum sequence length of 512 tokens. A 50-line Python function with docstrings can easily exceed that. What happens when input is too long depends on the model implementation: some models truncate silently, some error, some apply sliding-window averaging.

Silent truncation is the dangerous case. The model encodes the first 512 tokens of a long function and ignores the rest. If the critical logic is in the second half, the embedding doesn't represent it at all.

```python
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

def estimate_token_count(text: str) -> int:
    return len(tokenizer.encode(text, add_special_tokens=True))

def check_truncation_risk(code: str, max_length: int = 512) -> dict:
    token_count = estimate_token_count(code)
    return {
        "token_count": token_count,
        "max_length": max_length,
        "will_truncate": token_count > max_length,
        "truncation_pct": max(0, (token_count - max_length) / token_count * 100)
    }

# Example audit
sample_code = open("your_module.py").read()
result = check_truncation_risk(sample_code)
if result["will_truncate"]:
    print(f"WARNING: {result['truncation_pct']:.1f}% of content will be lost")
```

The fix isn't always to split the function. Sometimes it's to change what you embed: instead of the full function body, embed the signature + docstring + a compressed representation of the implementation. The goal is a meaningful, complete unit, not a maximum-length unit.

### Why Code-Specific Training Changes the Game

The reason domain-specific models outperform generic ones on code isn't that they're better models in some general sense. It's that they learned a different geometry. When a model is trained on code triplets — where the positive is "another function that does the same thing" and the negative is "a function that looks similar but does something different" — it learns to separate those categories.

A 40M parameter model trained from scratch on code-specific triplets can achieve 0.456 MRR@10 on CodeSearchNet, which is 62% better than GraphCodeBERT. The architecture is smaller. The parameter count is smaller. The key difference is what it was trained on and what loss function shaped its geometry.

> **Try This**: Pick 5 pairs of functions from your codebase that you personally know are doing similar jobs, and 5 pairs you know are doing different jobs. Run them through whatever model you're currently using and record the scores. Do the similar pairs consistently score higher than the different pairs? If the distributions overlap, you have a model calibration problem, and adding more infrastructure around it won't fix the underlying issue.

### The Linked Knowledge Dimension

Code doesn't exist in isolation. It has documentation, issue trackers, commit messages, PR descriptions, and comments that describe intent and behavior in natural language. These linked artifacts are semantically valuable but structurally different from the code they describe.

A model that can bridge this gap — understanding that `validateJwtPayload` is related to the PR titled "Fix JWT expiration edge case" — can serve queries that mix natural language intent with code structure. This is why training on linked knowledge (code + associated documentation + structured metadata) produces better retrieval than training on code alone.

The practical implication: when you're building your indexing pipeline, include docstrings, inline comments, and if accessible, linked ticket/PR descriptions as part of the chunk. Not as separate embeddings — as part of the text that gets embedded alongside the code.

---

### Key Takeaways

- Generic models flatten domain-specific identifiers because they lack enough training signal to distinguish them.
- Code has structural units (functions, classes, modules) with explicit boundaries — use them as chunking boundaries.
- Silent truncation at the embedding model's token limit is a common, hard-to-diagnose quality problem.
- Code has two kinds of similarity (structural and semantic) that require different training signals to capture.
- Including linked artifacts (docstrings, comments, PR descriptions) in the embedded text improves cross-modal retrieval.

### Practical Exercise

Run a truncation audit on your codebase to understand how much content is at risk:

```bash
#!/bin/bash
# truncation_audit.sh
# Counts functions by token length to estimate truncation risk

python3 << 'EOF'
import ast
import os
import sys
from pathlib import Path
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")
MAX_TOKENS = 512
results = {"ok": 0, "at_risk": 0, "truncated": 0}

for py_file in Path(".").rglob("*.py"):
    try:
        source = py_file.read_text()
        tree = ast.parse(source)
        for node in ast.walk(tree):
            if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
                func_src = ast.get_source_segment(source, node) or ""
                tokens = len(tokenizer.encode(func_src))
                if tokens > MAX_TOKENS:
                    results["truncated"] += 1
                elif tokens > MAX_TOKENS * 0.85:
                    results["at_risk"] += 1
                else:
                    results["ok"] += 1
    except Exception:
        continue

total = sum(results.values())
print(f"Total functions: {total}")
print(f"OK (<85% limit):    {results['ok']:5d} ({results['ok']/total*100:.1f}%)")
print(f"At risk (85-100%):  {results['at_risk']:5d} ({results['at_risk']/total*100:.1f}%)")
print(f"Will truncate:      {results['truncated']:5d} ({results['truncated']/total*100:.1f}%)")
EOF
```

---

## Chapter 3: Model Selection: What the Benchmarks Aren't Telling You {#chapter-3}

Every embedding model vendor publishes benchmark results. Most benchmark results are not useful for predicting how a model will perform on your codebase. Understanding why requires understanding how benchmarks work and where they fail.

### The Benchmark Trap

CodeSearchNet is the most widely cited benchmark for code retrieval. It tests a model's ability to match natural language queries to the correct code snippet across six programming languages. It's a reasonable proxy for one retrieval task: "I describe what I want in English, find the function that does it."

It's a poor proxy for several other retrieval tasks that matter in production:
- "Find all the places we handle rate limiting" (pattern retrieval)
- "What functions are structurally similar to this one I'm reading?" (structural similarity)
- "Which files are most relevant to this bug report?" (cross-modal retrieval)
- "Find the authentication middleware" (identifier retrieval with disambiguation)

A model that achieves 0.456 MRR@10 on CodeSearchNet is genuinely good at the first task. Whether it's good at the others depends on what it was trained on, and CodeSearchNet doesn't test them.

> **Warning**: When a vendor says "state-of-the-art on CodeSearchNet," that tells you exactly one thing: it's good at natural-language-to-code matching on six languages with clean docstrings. That's useful but not sufficient for most production code search workloads.

### What MRR@10 Actually Measures

Mean Reciprocal Rank at 10 (MRR@10) measures how high the first correct result appears in the top 10 results, averaged across all queries. If the correct result is always rank 1, MRR@10 = 1.0. If it's always rank 10, MRR@10 = 0.1. If it's never in the top 10, MRR@10 = 0.

This is a useful metric because it weights finding the right answer first more heavily than finding it eventually. A system that returns the right function at rank 1 is more useful than one that returns it at rank 7.

What MRR@10 doesn't capture:
- Whether wrong results are close or far in score from the right result (discriminability)
- Whether the system degrades gracefully when the right answer doesn't exist
- How scores behave at distribution shift — when queries are unlike training data

```python
def mrr_at_k(rankings: list[list[int]], relevant_items: list[int], k: int = 10) -> float:
    """
    rankings: list of ordered result lists (each list is one query's results)
    relevant_items: the correct item index for each query
    k: cutoff
    """
    reciprocal_ranks = []
    for ranking, relevant in zip(rankings, relevant_items):
        rr = 0.0
        for i, item in enumerate(ranking[:k]):
            if item == relevant:
                rr = 1.0 / (i + 1)
                break
        reciprocal_ranks.append(rr)
    return sum(reciprocal_ranks) / len(reciprocal_ranks)
```

### Dimension vs. Quality

There's an intuition that bigger embeddings are better. Intuitively: more dimensions means more capacity, more capacity means more expressiveness. This is true up to a point and then quickly becomes false.

The relationship between dimension and quality depends on training. A 1536-dimension model that was trained on general text will have 1536 dimensions of general-text geometry. A 384-dimension model trained on code triplets will have 384 dimensions of code geometry. For code retrieval, the latter is better despite being smaller.

The operational cost of higher dimensions:
- Storage: a 1M-vector index at 384 dimensions requires ~1.5GB of float32 storage. At 1536 dimensions, that's 6GB.
- Query latency: cosine similarity computation scales with dimension. At 384 vs. 1536, the difference is approximately 4x.
- Memory: holding the full index in RAM for fast search requires proportionally more memory.

For codebases under 100K functions, 384 dimensions is almost always the right choice.

> **Key Insight**: The 40M parameter PyckLM model with 384 dimensions achieves better retrieval on code-specific tasks than models with 10x more parameters and 4x more dimensions, because the geometry was shaped by code-specific training. Parameters encode knowledge, and the right knowledge for code comes from code training data.

### The Architecture That Makes This Work

PyckLM's architecture: 6 transformer layers, 8 attention heads, 384 output dimensions, 40M parameters total. Random initialization — no pretrained language model base. The model was trained from scratch on 968,692 code triplets spanning Python, JavaScript, TypeScript, Go, and Rust, plus linked natural language.

Random initialization matters. Most embedding models start from a pretrained language model (BERT, RoBERTa, CodeBERT) and fine-tune on code. That pretrained initialization carries an inductive bias toward natural language patterns. Training from scratch on code allows the model to learn a geometry that isn't constrained by prose-space priors.

The training cost: 38.8 minutes on an H100. Final cosine accuracy: 0.996. These numbers are useful context for thinking about whether training your own model is feasible — on modern hardware, for a 40M parameter model, it is.

### Evaluating Models for Your Use Case

The right way to select a model is to build a small evaluation set from your actual codebase and run candidate models against it. This doesn't require a full labeled dataset. It requires:

1. 50-100 queries that represent real searches in your system
2. For each query, the correct function/file (ground truth)
3. A script that runs each candidate model and measures MRR or Recall@K

```python
import numpy as np
from typing import Callable

def evaluate_model_on_corpus(
    model_fn: Callable[[list[str]], np.ndarray],
    queries: list[str],
    corpus: list[str],
    ground_truth_indices: list[int],
    k: int = 10
) -> dict:
    """
    model_fn: takes list of strings, returns (N, D) ndarray of embeddings
    queries: natural language queries
    corpus: code chunks to search over
    ground_truth_indices: for each query, index in corpus of correct result
    """
    query_vecs = model_fn(queries)
    corpus_vecs = model_fn(corpus)

    # Normalize
    query_vecs /= np.linalg.norm(query_vecs, axis=1, keepdims=True)
    corpus_vecs /= np.linalg.norm(corpus_vecs, axis=1, keepdims=True)

    scores_matrix = query_vecs @ corpus_vecs.T  # (Q, N)

    mrr_scores = []
    recall_at_k = []

    for i, (scores, gt) in enumerate(zip(scores_matrix, ground_truth_indices)):
        ranked = np.argsort(scores)[::-1]
        rank_of_gt = np.where(ranked == gt)[0][0] + 1
        mrr_scores.append(1.0 / rank_of_gt if rank_of_gt <= k else 0.0)
        recall_at_k.append(1.0 if rank_of_gt <= k else 0.0)

    return {
        "mrr_at_k": np.mean(mrr_scores),
        "recall_at_k": np.mean(recall_at_k),
        "k": k,
        "n_queries": len(queries),
    }
```

> **Try This**: Build a 25-query evaluation set from your codebase today. Write the queries as natural language ("function that validates email format", "error handler for database timeouts"). Have each query point to the function you'd want returned first. Run this against three candidate models. The model that wins on your eval set is the right model — not the one that wins on CodeSearchNet.

### When to Use a Generic Model Anyway

Generic models are the right choice when:
- You're prototyping and need to ship quickly
- Your codebase is small enough that precision doesn't matter much (under 500 functions)
- You're building a cross-modal system that embeds both code and natural language prose in the same space

They're the wrong choice when:
- Precision at the top of the ranking matters (you're injecting results into LLM context)
- Your identifiers are domain-specific
- You need to distinguish between similar but different functions

The operational path looks like: start with a generic model, measure quality on your eval set, and when the quality delta justifies the cost, switch to a domain-specific model or fine-tune on your own data.

---

### Key Takeaways

- Benchmark performance on CodeSearchNet predicts performance on NL-to-code matching, not broader retrieval tasks.
- 384-dimensional embeddings outperform 1536-dimensional ones for code when the 384-dim model was trained on code data.
- Random initialization + code-specific training can outperform pretrained language model fine-tuning.
- Build your own eval set from real queries — 25-50 examples is enough to differentiate models.
- Generic models are fine for prototypes and small corpora; domain-specific models pay off at scale or where precision matters.

### Practical Exercise

Compare three models on the same 25-query eval set you built above:

```bash
pip install sentence-transformers openai numpy
```

```python
"""
model_comparison.py — run three candidate models against your eval set
"""
import json
import numpy as np
from sentence_transformers import SentenceTransformer

# Load your eval set
# Format: list of {"query": str, "correct_chunk": str, "corpus": [str, ...], "gt_index": int}
eval_set = json.load(open("eval_set.json"))

models = {
    "all-MiniLM-L6-v2": SentenceTransformer("all-MiniLM-L6-v2"),
    "all-mpnet-base-v2": SentenceTransformer("all-mpnet-base-v2"),
    "code-search-net": SentenceTransformer("flax-sentence-embeddings/st-codesearch-distilroberta-base"),
}

for model_name, model in models.items():
    mrrs = []
    for item in eval_set:
        all_texts = item["corpus"]
        vecs = model.encode(all_texts, normalize_embeddings=True)
        q_vec = model.encode([item["query"]], normalize_embeddings=True)[0]
        scores = vecs @ q_vec
        ranked = np.argsort(scores)[::-1]
        rank = np.where(ranked == item["gt_index"])[0][0] + 1
        mrrs.append(1.0 / rank if rank <= 10 else 0.0)
    print(f"{model_name:50s}  MRR@10={np.mean(mrrs):.4f}")
```

---

## Chapter 4: Chunking Strategies for Code: Function-Level, File-Level, Context-Aware {#chapter-4}

Chunking is the decision you spend the least time thinking about and the one that has the most impact on retrieval quality. The model and vector store are substitutable; a bad chunking strategy corrupts every query.

The core problem: embeddings are fixed-size, but code units vary enormously in size. A one-line utility function and a 400-line class are both valid retrieval targets, but they'll have very different embedding quality under naive chunking.

### The Three Chunking Strategies

There are three practical strategies for code, each with a different tradeoff:

**Function-level chunking**: Each function (including its signature, decorators, and immediate docstring) becomes one chunk. This is the most natural unit for code retrieval because it aligns with how developers think about code: "find the function that does X."

**File-level chunking**: Each file becomes one chunk (or each file is split into overlapping windows if it's large). This captures file-level context — imports, module docstrings, class-level state — but loses function-level resolution.

**Context-aware chunking**: Each function is embedded with surrounding context: the class it belongs to, the module-level imports, and enough of the file structure to make it interpretable in isolation.

For most production systems, function-level chunking is the starting point and context-aware chunking is the upgrade you make when you measure quality and find it lacking.

> **Key Insight**: The right granularity is the granularity that matches what users are searching for. If users search for functions, chunk by function. If users search for files ("which file handles database connection pooling?"), you need file-level chunks too. Most systems need both, served from the same index with different chunk types.

### Implementing Function-Level Chunking

```python
import ast
from dataclasses import dataclass
from pathlib import Path
from typing import Iterator

@dataclass
class CodeChunk:
    file_path: str
    chunk_type: str  # "function", "class", "module"
    name: str
    source: str
    start_line: int
    end_line: int
    parent_class: str | None = None
    module_context: str | None = None  # imports + module docstring

def extract_module_context(source: str, tree: ast.Module) -> str:
    """Extract imports and module-level docstring for context injection."""
    lines = source.splitlines()
    context_lines = []

    # Module docstring
    if (tree.body and isinstance(tree.body[0], ast.Expr)
            and isinstance(tree.body[0].value, ast.Constant)):
        docstring = tree.body[0].value.s
        context_lines.append(f'"""{docstring[:200]}"""')

    # Imports (top of file only)
    for node in tree.body[:20]:  # avoid scanning whole file
        if isinstance(node, (ast.Import, ast.ImportFrom)):
            context_lines.append(ast.get_source_segment(source, node) or "")

    return "\n".join(context_lines[:30])  # cap context size

def chunk_python_file(file_path: str) -> Iterator[CodeChunk]:
    source = Path(file_path).read_text(encoding="utf-8", errors="replace")
    try:
        tree = ast.parse(source)
    except SyntaxError:
        return

    module_ctx = extract_module_context(source, tree)

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            func_source = ast.get_source_segment(source, node)
            if not func_source:
                continue

            # Find parent class if this is a method
            parent_class = None
            for parent in ast.walk(tree):
                if isinstance(parent, ast.ClassDef):
                    if node in ast.walk(parent):
                        parent_class = parent.name
                        break

            yield CodeChunk(
                file_path=file_path,
                chunk_type="function",
                name=node.name,
                source=func_source,
                start_line=node.lineno,
                end_line=node.end_lineno or node.lineno,
                parent_class=parent_class,
                module_context=module_ctx,
            )

def build_embedded_text(chunk: CodeChunk) -> str:
    """Construct the text that actually gets embedded."""
    parts = []
    if chunk.module_context:
        parts.append(f"# Module context\n{chunk.module_context}\n")
    if chunk.parent_class:
        parts.append(f"# Class: {chunk.parent_class}")
    parts.append(chunk.source)
    return "\n".join(parts)
```

### Handling Token Limits Without Losing Information

The practical constraint: most models cap at 512 tokens. The approach isn't to truncate — it's to construct the embedded text intelligently so the most important content is within the first 512 tokens.

The hierarchy of importance for a function:
1. Function signature (always include, always first)
2. Docstring (high signal, usually short)
3. First ~10 lines of body (function setup, arg validation)
4. Module imports (type context)
5. Remaining function body (include if space allows)

```python
from transformers import AutoTokenizer

_tokenizer = AutoTokenizer.from_pretrained("microsoft/codebert-base")

def construct_chunk_text(chunk: CodeChunk, max_tokens: int = 480) -> str:
    """Build embedded text, prioritizing high-value content."""
    sig_and_doc = extract_signature_and_docstring(chunk.source)
    body = extract_body(chunk.source)
    context = chunk.module_context or ""

    # Always include signature
    core = sig_and_doc

    # Add context if it fits
    candidate = f"{context}\n{core}" if context else core
    if len(_tokenizer.encode(candidate)) <= max_tokens:
        core = candidate

    # Add body progressively
    body_lines = body.splitlines()
    for i in range(1, len(body_lines) + 1):
        partial_body = "\n".join(body_lines[:i])
        candidate = f"{core}\n{partial_body}"
        if len(_tokenizer.encode(candidate)) > max_tokens:
            break
        core = candidate

    return core

def extract_signature_and_docstring(func_source: str) -> str:
    try:
        tree = ast.parse(func_source)
        func = tree.body[0]
        lines = func_source.splitlines()
        body_start = func.body[0].lineno - 1
        # Check for docstring
        if (isinstance(func.body[0], ast.Expr)
                and isinstance(func.body[0].value, ast.Constant)):
            body_start = func.body[1].lineno - 1 if len(func.body) > 1 else body_start
        return "\n".join(lines[:body_start])
    except Exception:
        return func_source.splitlines()[0]  # fallback to just signature line

def extract_body(func_source: str) -> str:
    try:
        tree = ast.parse(func_source)
        func = tree.body[0]
        lines = func_source.splitlines()
        body_start = func.body[0].lineno - 1
        return "\n".join(lines[body_start:])
    except Exception:
        return ""
```

> **Warning**: Never silently truncate. If a chunk exceeds your token budget, log it. You need to know which functions are losing content so you can decide whether to split them, change the priority order, or flag them for review.

### Class-Level and File-Level Chunks

Some queries naturally target larger units. "Find the database connection pool class" requires a class-level chunk. "Which file handles CSV parsing?" requires file-level chunks.

The practical solution: build a multi-granularity index. Index function-level chunks for fine-grained retrieval and file-level chunks for coarse-grained retrieval. When a query comes in, search both and merge results.

```python
def chunk_at_all_levels(file_path: str) -> list[CodeChunk]:
    chunks = list(chunk_python_file(file_path))  # function level

    # Add file-level chunk
    source = Path(file_path).read_text(encoding="utf-8", errors="replace")
    try:
        tree = ast.parse(source)
        # File summary: docstring + all function/class names
        names = [
            node.name for node in ast.walk(tree)
            if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef))
        ]
        file_summary = f"# File: {file_path}\n"
        docstring = ast.get_docstring(tree) or ""
        if docstring:
            file_summary += f'"""{docstring[:300]}"""\n'
        file_summary += "# Defines: " + ", ".join(names[:50])

        chunks.append(CodeChunk(
            file_path=file_path,
            chunk_type="file",
            name=Path(file_path).name,
            source=file_summary,
            start_line=1,
            end_line=source.count("\n"),
        ))
    except Exception:
        pass

    return chunks
```

### Overlap and Sliding Windows

For large functions that must be split (beyond your token budget even after prioritization), the standard approach is sliding windows with overlap:

```python
def sliding_window_chunks(
    text: str,
    tokenizer,
    window_tokens: int = 400,
    overlap_tokens: int = 50
) -> list[str]:
    """Split long text into overlapping token windows."""
    tokens = tokenizer.encode(text)
    chunks = []
    start = 0
    while start < len(tokens):
        end = min(start + window_tokens, len(tokens))
        chunk_tokens = tokens[start:end]
        chunks.append(tokenizer.decode(chunk_tokens))
        if end == len(tokens):
            break
        start += window_tokens - overlap_tokens
    return chunks
```

The overlap ensures that content near the window boundaries appears in two chunks, preventing important content from being split across a boundary and lost.

> **Try This**: After implementing chunking, run a chunk size histogram on your codebase. Plot the distribution of token counts per chunk. You want a distribution roughly centered around 150-300 tokens, with tails cut off at your model's limit. A large spike near the limit means many chunks are being silently truncated or are near risk.

---

### Key Takeaways

- Function-level chunking is the correct default — it aligns with how developers conceptualize code.
- Context-aware chunking (function + module imports + class context) improves quality at the cost of more tokens per chunk.
- Build a multi-granularity index: function-level for precision, file-level for coarse recall.
- Construct embedded text with a priority order: signature first, docstring second, body third.
- Never silently truncate. Log oversized chunks and handle them explicitly.

### Practical Exercise

Run the chunker against your codebase and build a distribution report:

```bash
python3 << 'EOF'
import ast
from pathlib import Path
from collections import Counter

sizes = Counter()
oversized = []
MAX_TOKENS = 512

for py_file in Path(".").rglob("*.py"):
    try:
        source = py_file.read_text()
        tree = ast.parse(source)
        for node in ast.walk(tree):
            if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
                src = ast.get_source_segment(source, node) or ""
                lines = len(src.splitlines())
                bucket = (lines // 10) * 10
                sizes[bucket] += 1
                if lines > 80:
                    oversized.append((str(py_file), node.name, lines))
    except Exception:
        continue

print("Function size distribution (lines):")
for bucket in sorted(sizes):
    bar = "█" * min(sizes[bucket], 40)
    print(f"  {bucket:4d}-{bucket+9:<4d}: {bar} ({sizes[bucket]})")

print(f"\nLarge functions (>80 lines): {len(oversized)}")
for path, name, lines in sorted(oversized, key=lambda x: -x[2])[:10]:
    print(f"  {lines:4d} lines  {name}  ({path})")
EOF
```

---

## Chapter 5: Vector Stores: Infrastructure Decisions That Don't Require a PhD {#chapter-5}

The vector store decision gets more attention than it deserves. Developers spend weeks evaluating Pinecone vs. Weaviate vs. Qdrant vs. pgvector before they've built their first chunk. The infrastructure decision is important, but it's downstream of chunking and model quality — get those right first.

This chapter gives you the actual decision framework: what matters, what doesn't, and the specific numbers that should anchor your choices.

### The Fundamental Operations

Every vector store does four things:
1. **Insert**: store a vector + associated metadata
2. **Search**: given a query vector, return the top-K nearest vectors
3. **Update/Delete**: modify or remove existing vectors
4. **Filter**: search within a subset defined by metadata predicates

Differences between stores appear in: query latency, index build time, recall at a given query budget, operational complexity, and cost. These are the axes you should evaluate on — not feature checklists or marketing comparisons.

### When You Don't Need a Vector Store

For codebases under 100K functions, in-memory NumPy cosine search is faster and simpler than any dedicated vector store. The math: 100K functions × 384 dimensions × 4 bytes = 153MB of RAM. That fits in memory on any modern server with room to spare.

```python
import numpy as np
import json
from pathlib import Path
from dataclasses import dataclass, field

@dataclass
class InMemoryIndex:
    vectors: np.ndarray = field(default_factory=lambda: np.empty((0, 384)))
    metadata: list[dict] = field(default_factory=list)

    def add(self, vector: np.ndarray, meta: dict):
        vector = vector / np.linalg.norm(vector)  # normalize on insert
        self.vectors = np.vstack([self.vectors, vector[np.newaxis, :]]) \
            if len(self.vectors) else vector[np.newaxis, :]
        self.metadata.append(meta)

    def add_batch(self, vectors: np.ndarray, metas: list[dict]):
        norms = np.linalg.norm(vectors, axis=1, keepdims=True)
        vectors = vectors / norms
        self.vectors = np.vstack([self.vectors, vectors]) \
            if len(self.vectors) else vectors
        self.metadata.extend(metas)

    def search(self, query: np.ndarray, k: int = 10) -> list[tuple[dict, float]]:
        query = query / np.linalg.norm(query)
        scores = self.vectors @ query  # cosine sim (already normalized)
        top_k_idx = np.argpartition(scores, -min(k, len(scores)))[-k:]
        top_k_idx = top_k_idx[np.argsort(scores[top_k_idx])[::-1]]
        return [(self.metadata[i], float(scores[i])) for i in top_k_idx]

    def save(self, path: str):
        np.save(f"{path}.npy", self.vectors)
        json.dump(self.metadata, open(f"{path}.meta.json", "w"))

    @classmethod
    def load(cls, path: str) -> "InMemoryIndex":
        idx = cls()
        idx.vectors = np.load(f"{path}.npy")
        idx.metadata = json.load(open(f"{path}.meta.json"))
        return idx
```

Query latency for 100K vectors at 384 dimensions: approximately 3-8ms on a single CPU core using NumPy's optimized BLAS routines. That's competitive with dedicated vector stores without the operational overhead.

> **Key Insight**: For codebases under 100K files, in-memory NumPy cosine search beats dedicated vector databases on both latency and operational cost. The crossover point where a dedicated store becomes worthwhile is around 500K-1M vectors, depending on your latency requirements.

### pgvector: The Practical Choice for Medium Scale

When you cross the in-memory threshold, or when you need transactional semantics (update individual vectors when files change), pgvector is the practical choice. It's a Postgres extension — if you're already running Postgres, pgvector adds negligible operational complexity.

The case for pgvector: self-hosted on existing Postgres hardware, it runs at 1.4x lower p95 latency than Pinecone with 79% lower monthly cost. These aren't cherry-picked numbers — they reflect the fundamental advantage of co-locating the vector index with your application database on hardware you control.

```sql
-- Setup pgvector
CREATE EXTENSION IF NOT EXISTS vector;

-- Schema for code chunks
CREATE TABLE code_chunks (
    id          BIGSERIAL PRIMARY KEY,
    file_path   TEXT NOT NULL,
    chunk_type  TEXT NOT NULL DEFAULT 'function',
    name        TEXT,
    source_text TEXT,
    start_line  INT,
    end_line    INT,
    embedding   vector(384),
    indexed_at  TIMESTAMPTZ DEFAULT NOW(),
    file_hash   TEXT  -- for detecting file changes
);

-- HNSW index for approximate nearest neighbor search
CREATE INDEX ON code_chunks USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64);

-- Search query
CREATE OR REPLACE FUNCTION search_chunks(
    query_embedding vector(384),
    result_limit INT DEFAULT 10,
    similarity_threshold FLOAT DEFAULT 0.5
)
RETURNS TABLE(
    id BIGINT,
    file_path TEXT,
    name TEXT,
    source_text TEXT,
    similarity FLOAT
) AS $$
    SELECT
        id,
        file_path,
        name,
        source_text,
        1 - (embedding <=> query_embedding) AS similarity
    FROM code_chunks
    WHERE 1 - (embedding <=> query_embedding) >= similarity_threshold
    ORDER BY embedding <=> query_embedding
    LIMIT result_limit;
$$ LANGUAGE sql;
```

```python
import psycopg2
import numpy as np
from psycopg2.extras import execute_values

class PgVectorIndex:
    def __init__(self, dsn: str):
        self.conn = psycopg2.connect(dsn)

    def insert_chunks(self, chunks: list[dict], embeddings: np.ndarray):
        """Batch insert chunks with their embeddings."""
        rows = [
            (
                c["file_path"], c["chunk_type"], c["name"],
                c["source"], c["start_line"], c["end_line"],
                embedding.tolist(), c.get("file_hash", "")
            )
            for c, embedding in zip(chunks, embeddings)
        ]
        with self.conn.cursor() as cur:
            execute_values(cur, """
                INSERT INTO code_chunks
                    (file_path, chunk_type, name, source_text,
                     start_line, end_line, embedding, file_hash)
                VALUES %s
                ON CONFLICT DO NOTHING
            """, rows)
        self.conn.commit()

    def search(self, query_vec: np.ndarray, k: int = 10, threshold: float = 0.5):
        with self.conn.cursor() as cur:
            cur.execute(
                "SELECT * FROM search_chunks(%s::vector, %s, %s)",
                (query_vec.tolist(), k, threshold)
            )
            return cur.fetchall()

    def delete_by_file(self, file_path: str):
        with self.conn.cursor() as cur:
            cur.execute("DELETE FROM code_chunks WHERE file_path = %s", (file_path,))
        self.conn.commit()
```

### HNSW: The Silent Degradation Problem

HNSW (Hierarchical Navigable Small World) is the approximate nearest neighbor algorithm used by most vector stores, including pgvector and Pinecone. It trades recall for speed — instead of scanning every vector, it navigates a graph structure to find approximate nearest neighbors quickly.

The problem that developers routinely miss: HNSW degrades silently as the index grows. Latency stays stable. Query times don't increase noticeably. But retrieval recall drops because the graph structure built at index creation time doesn't account for later additions.

Specifically: when you insert a vector into an HNSW index, it's connected to existing nodes. If the index was built with certain connectivity parameters (`m`, `ef_construction`) and you add many vectors over time without rebuilding, the graph becomes less navigable and recall drops without any latency signal.

> **Warning**: HNSW performance degradation is silent. Latency stays stable while retrieval quality drops as the index grows. Monitor recall on a fixed evaluation set over time — if recall drops while latency stays constant, your index needs rebuilding. Schedule periodic full index rebuilds (e.g., weekly) rather than relying on incremental inserts indefinitely.

```python
def monitor_hnsw_recall(
    index,
    eval_queries: list[np.ndarray],
    eval_ground_truth: list[list[int]],
    k: int = 10
) -> float:
    """
    Measure actual recall by comparing HNSW results to exact search.
    If recall drops over time, rebuild the index.
    """
    hits = 0
    total = 0
    for query, gt_ids in zip(eval_queries, eval_ground_truth):
        results = index.search(query, k=k)
        returned_ids = {r[0] for r in results}
        hits += len(set(gt_ids) & returned_ids)
        total += len(gt_ids)
    return hits / total if total > 0 else 0.0
```

### Managed vs. Self-Hosted

The case for managed (Pinecone, Weaviate Cloud, Qdrant Cloud): zero ops overhead, scales automatically, no index rebuilds to schedule.

The case for self-hosted (pgvector, Qdrant on your own server): significantly lower cost at scale, data stays in your infrastructure, co-location with your application reduces network latency.

For most teams, the crossover point: if you're spending more than $200/month on a managed vector store, run the math on self-hosted pgvector. The 79% cost reduction number reflects a real workload — approximately 5M vectors with 1K queries/day — where the managed tier was $800/month and pgvector on existing Postgres infrastructure was $165/month (incremental compute and storage).

---

### Key Takeaways

- For under 100K functions, in-memory NumPy search is faster and simpler than any dedicated vector store.
- pgvector on existing Postgres offers 1.4x lower p95 latency than managed alternatives at 79% lower cost at medium scale.
- HNSW degradation is silent — monitor recall on a fixed eval set, not just latency.
- The correct vector store decision depends on scale, latency requirements, and operational capacity — not feature comparisons.
- Batch inserts are 10-50x faster than individual inserts; always buffer writes.

### Practical Exercise

Benchmark in-memory search against pgvector on your corpus size:

```python
import numpy as np
import time

# Simulate your corpus
N = 50_000  # replace with your actual chunk count
D = 384
corpus = np.random.randn(N, D).astype(np.float32)
corpus /= np.linalg.norm(corpus, axis=1, keepdims=True)
query = np.random.randn(D).astype(np.float32)
query /= np.linalg.norm(query)

# Benchmark in-memory cosine search
times = []
for _ in range(100):
    t0 = time.perf_counter()
    scores = corpus @ query
    top_k = np.argpartition(scores, -10)[-10:]
    times.append(time.perf_counter() - t0)

p50 = sorted(times)[50] * 1000
p95 = sorted(times)[95] * 1000
print(f"In-memory search ({N:,} vectors, {D}D)")
print(f"  p50: {p50:.2f}ms  p95: {p95:.2f}ms")
```

---

## Chapter 6: Hybrid Search: Semantic + BM25, When to Use Each {#chapter-6}

The debate between keyword search and semantic search misses the point. They're not competing approaches — they capture different kinds of relevance, and the best production systems use both.

Semantic search finds what something *means*. BM25 finds what something *says*. For code search, both matter. A developer searching for `validateJwtPayload` wants the exact function, which BM25 will find instantly. A developer searching for "function that checks token expiration" wants semantic matching, which BM25 will miss entirely.

### What BM25 Does

BM25 (Best Match 25) is a term frequency-based ranking function. It scores documents by how often query terms appear in them, adjusted for document length and corpus-wide term frequency. It's fast, deterministic, and extremely good at exact and near-exact matching.

BM25 score for document d given query q:

```
score(d, q) = Σ IDF(qi) × (f(qi,d) × (k1+1)) / (f(qi,d) + k1 × (1 - b + b × |d|/avgdl))
```

Where `f(qi,d)` is term frequency in document, `|d|` is document length, `avgdl` is average document length, `k1` and `b` are tuning parameters (defaults: k1=1.5, b=0.75).

In practice: BM25 achieved 85% recall with 8 results on identifier-heavy queries against a code corpus. When a developer types a function name or an exact identifier, BM25 returns it in the first 8 results with 85% reliability. Semantic search at comparable recall required 7 results — essentially the same performance on clean, specific queries.

```python
import math
from collections import Counter, defaultdict

class BM25:
    def __init__(self, corpus: list[str], k1: float = 1.5, b: float = 0.75):
        self.k1 = k1
        self.b = b
        self.corpus = corpus
        self.n = len(corpus)

        # Tokenize (simple split; use a proper tokenizer for production)
        self.tokenized = [self._tokenize(doc) for doc in corpus]
        self.avgdl = sum(len(d) for d in self.tokenized) / self.n

        # Build IDF
        df = defaultdict(int)
        for doc in self.tokenized:
            for term in set(doc):
                df[term] += 1
        self.idf = {
            term: math.log((self.n - freq + 0.5) / (freq + 0.5) + 1)
            for term, freq in df.items()
        }

    def _tokenize(self, text: str) -> list[str]:
        # Split on whitespace and common code delimiters
        import re
        tokens = re.split(r'[\s\(\)\[\]\{\},;:\.]+', text.lower())
        # Also split camelCase and snake_case identifiers
        expanded = []
        for tok in tokens:
            if tok:
                # split camelCase
                parts = re.sub(r'([A-Z])', r' \1', tok).split()
                # split snake_case
                parts = [p for part in parts for p in part.split('_')]
                expanded.extend(p for p in parts if p)
        return expanded

    def score(self, query: str, doc_idx: int) -> float:
        query_terms = self._tokenize(query)
        doc = self.tokenized[doc_idx]
        dl = len(doc)
        tf = Counter(doc)

        score = 0.0
        for term in query_terms:
            if term not in self.idf:
                continue
            f = tf[term]
            score += self.idf[term] * (f * (self.k1 + 1)) / (
                f + self.k1 * (1 - self.b + self.b * dl / self.avgdl)
            )
        return score

    def search(self, query: str, k: int = 10) -> list[tuple[int, float]]:
        scores = [(i, self.score(query, i)) for i in range(self.n)]
        scores = [(i, s) for i, s in scores if s > 0]
        scores.sort(key=lambda x: -x[1])
        return scores[:k]
```

> **Key Insight**: BM25 and semantic search have approximately equal recall on clean, specific queries. BM25 wins decisively on exact identifier lookups. Semantic search wins decisively on intent-based queries. The domains of excellence don't overlap — they're complementary.

### Reciprocal Rank Fusion

The challenge with combining BM25 and semantic results: they produce scores on different scales. BM25 scores might range from 0 to 15. Cosine similarity scores range from -1 to 1. You can't average them directly.

Reciprocal Rank Fusion (RRF) solves this by converting scores to ranks. Instead of merging the scores, you merge the ranked lists. Each document's combined score is the sum of its reciprocal ranks in each list:

```
RRF_score(d) = Σ_r (1 / (k + rank_r(d)))
```

Where k is a constant (typically 60) that prevents very high-ranked documents from dominating completely, and rank_r(d) is the position of document d in ranking r.

```python
def reciprocal_rank_fusion(
    *ranked_lists: list[tuple[str, float]],
    k: int = 60
) -> list[tuple[str, float]]:
    """
    Merge multiple ranked lists using RRF.
    Each list is [(doc_id, score), ...] sorted by decreasing score.
    Returns merged list sorted by RRF score.
    """
    rrf_scores: dict[str, float] = defaultdict(float)

    for ranked_list in ranked_lists:
        for rank, (doc_id, _) in enumerate(ranked_list, start=1):
            rrf_scores[doc_id] += 1.0 / (k + rank)

    merged = sorted(rrf_scores.items(), key=lambda x: -x[1])
    return merged

# Usage
semantic_results = [("func_auth.py:42", 0.91), ("func_verify.py:17", 0.87)]
bm25_results = [("func_verify.py:17", 8.3), ("func_auth.py:42", 7.1)]

merged = reciprocal_rank_fusion(semantic_results, bm25_results)
for doc_id, score in merged[:5]:
    print(f"  {doc_id:40s}  RRF={score:.4f}")
```

RRF is the right fusion method for code search because it doesn't require calibrated scores. You don't need to normalize BM25 and cosine similarity to the same scale — the rank positions do the normalization for you.

> **Warning**: Don't average cosine similarity and BM25 scores directly. Their distributions are different in shape, mean, and variance. A BM25 score of 3.0 is not comparable to a cosine score of 0.75 without calibration. Use RRF or train a reranker — don't eyeball a linear combination.

### Query Classification: When to Weight Each

For most queries, equal weighting of semantic and BM25 results works well. But you can do better with query classification:

- **High specificity** (contains identifiers, function names, exact strings): increase BM25 weight
- **High abstraction** (describes behavior without naming it): increase semantic weight
- **Mixed** (natural language with technical terms): equal weight

A simple classifier based on identifier density:

```python
import re

def estimate_query_type(query: str) -> str:
    """Classify query as 'identifier', 'semantic', or 'mixed'."""
    tokens = query.split()

    # Indicators of identifier-heavy query
    camel_case = sum(1 for t in tokens if re.search(r'[a-z][A-Z]', t))
    snake_case = sum(1 for t in tokens if '_' in t and t.isidentifier())
    quoted = sum(1 for t in tokens if t.startswith('"') or t.startswith("'"))

    identifier_score = camel_case + snake_case + quoted

    if identifier_score >= 2:
        return "identifier"
    elif identifier_score == 0 and len(tokens) >= 5:
        return "semantic"
    else:
        return "mixed"

def weighted_rrf(
    semantic_results: list[tuple[str, float]],
    bm25_results: list[tuple[str, float]],
    query: str,
    k: int = 60
) -> list[tuple[str, float]]:
    query_type = estimate_query_type(query)

    if query_type == "identifier":
        # BM25 dominant
        return reciprocal_rank_fusion(bm25_results, bm25_results, semantic_results, k=k)
    elif query_type == "semantic":
        # Semantic dominant
        return reciprocal_rank_fusion(semantic_results, semantic_results, bm25_results, k=k)
    else:
        return reciprocal_rank_fusion(semantic_results, bm25_results, k=k)
```

### Building the Full Pipeline

```python
class HybridCodeSearch:
    def __init__(self, chunks: list[dict], embed_fn, dim: int = 384):
        texts = [c["text"] for c in chunks]
        self.chunks = chunks
        self.ids = [c["id"] for c in chunks]

        # Build BM25 index
        self.bm25 = BM25(texts)

        # Build vector index
        self.embed_fn = embed_fn
        embeddings = embed_fn(texts)
        embeddings /= np.linalg.norm(embeddings, axis=1, keepdims=True)
        self.vectors = embeddings

    def search(self, query: str, k: int = 10) -> list[dict]:
        # BM25 search
        bm25_raw = self.bm25.search(query, k=k*2)
        bm25_results = [(self.ids[i], score) for i, score in bm25_raw]

        # Semantic search
        q_vec = self.embed_fn([query])[0]
        q_vec /= np.linalg.norm(q_vec)
        scores = self.vectors @ q_vec
        top_idx = np.argpartition(scores, -k*2)[-k*2:]
        top_idx = top_idx[np.argsort(scores[top_idx])[::-1]]
        semantic_results = [(self.ids[i], float(scores[i])) for i in top_idx]

        # Merge
        merged = reciprocal_rank_fusion(semantic_results, bm25_results)
        merged_ids = {doc_id for doc_id, _ in merged[:k]}

        # Return chunk metadata for top-k
        id_to_chunk = {c["id"]: c for c in self.chunks}
        return [id_to_chunk[doc_id] for doc_id, _ in merged[:k] if doc_id in id_to_chunk]
```

> **Try This**: A/B test hybrid vs. pure semantic on your eval set. Run 25 queries through both, measure Recall@5. You'll typically see a 10-20% recall improvement from adding BM25 for identifier-heavy query sets, and little to no improvement for purely intent-based query sets. The data will tell you how much your specific workload benefits from the hybrid approach.

---

### Key Takeaways

- BM25 and semantic search have complementary failure modes: BM25 misses intent, semantics misses exact identifiers.
- Both achieve ~85% recall at similar k values on clean queries; the difference appears on ambiguous or query-type-mismatched searches.
- Use RRF to merge results — it doesn't require calibrated scores and handles score distribution mismatch cleanly.
- Query classification (identifier vs. semantic vs. mixed) allows adaptive weighting that improves hybrid performance by 5-15%.
- Build the BM25 index at chunk ingestion time — it adds minimal overhead and is always available for free.

### Practical Exercise

Measure the recall improvement of hybrid search on your eval set:

```python
# Compare pure semantic vs. hybrid on your 25-query eval set
results = {"semantic_only": [], "hybrid": []}

for item in eval_set:
    corpus_texts = [c["text"] for c in item["corpus_chunks"]]
    gt_id = item["gt_chunk_id"]

    # Pure semantic
    q_vec = embed_fn([item["query"]])[0]
    corpus_vecs = embed_fn(corpus_texts)
    corpus_vecs /= np.linalg.norm(corpus_vecs, axis=1, keepdims=True)
    q_vec /= np.linalg.norm(q_vec)
    scores = corpus_vecs @ q_vec
    top5_semantic = set(np.argsort(scores)[::-1][:5])
    results["semantic_only"].append(item["gt_index"] in top5_semantic)

    # Hybrid
    search = HybridCodeSearch(item["corpus_chunks"], embed_fn)
    top5_hybrid = {c["id"] for c in search.search(item["query"], k=5)}
    results["hybrid"].append(gt_id in top5_hybrid)

for method, hits in results.items():
    print(f"{method:20s}  Recall@5={sum(hits)/len(hits):.3f}")
```

---

## Chapter 7: Calibration: Thresholds, Margin Loss, and Why Static top_k Is a Trap {#chapter-7}

Calibration is the chapter most developers skip. It's also where most of the retrieval quality problems actually live. You can have the right model, the right chunking, the right vector store, and hybrid search — and still inject mostly noise into your LLM context because the scoring isn't calibrated.

### The Mushy Middle Problem

When you train an embedding model with mean squared error (MSE) loss — or with any loss that doesn't explicitly force a gap between positive and negative scores — you end up with a "mushy middle" in your score distributions.

Here's what mushy middle looks like in practice: a query's true positive function scores 0.78. Irrelevant functions score 0.74. The gap is 0.04. If you use a threshold of 0.75, you catch the true positive but also return every file that scored above 0.74 — which could be dozens. If you use top_k=10, you return 10 results, 9 of which are noise because 9 files scored between 0.74 and 0.78.

This isn't a retrieval threshold problem — it's a training problem. The model didn't learn to put true positives *significantly* above hard negatives. It learned to order them slightly, which is insufficient for threshold-based retrieval.

```python
# Diagnostic: score distribution analysis
import numpy as np
import matplotlib.pyplot as plt

def score_distribution_analysis(
    model_fn,
    positive_pairs: list[tuple[str, str]],  # (query, relevant_doc)
    negative_pairs: list[tuple[str, str]],  # (query, irrelevant_doc)
) -> dict:
    """
    Measure the score gap between relevant and irrelevant documents.
    A healthy gap is >0.2. Gap <0.1 indicates mushy middle.
    """
    pos_scores = []
    neg_scores = []

    for query, doc in positive_pairs:
        q_vec = model_fn([query])[0]
        d_vec = model_fn([doc])[0]
        q_vec /= np.linalg.norm(q_vec)
        d_vec /= np.linalg.norm(d_vec)
        pos_scores.append(float(q_vec @ d_vec))

    for query, doc in negative_pairs:
        q_vec = model_fn([query])[0]
        d_vec = model_fn([doc])[0]
        q_vec /= np.linalg.norm(q_vec)
        d_vec /= np.linalg.norm(d_vec)
        neg_scores.append(float(q_vec @ d_vec))

    return {
        "positive_mean": np.mean(pos_scores),
        "negative_mean": np.mean(neg_scores),
        "gap": np.mean(pos_scores) - np.mean(neg_scores),
        "positive_std": np.std(pos_scores),
        "negative_std": np.std(neg_scores),
        "overlap": sum(1 for p in pos_scores
                      for n in neg_scores if p < n) / (len(pos_scores) * len(neg_scores))
    }
```

### Margin Ranking Loss: The Fix

Margin Ranking Loss directly penalizes cases where the score gap is too small. The loss is zero when the positive score exceeds the negative score by at least `margin`. It becomes positive when the gap is smaller than `margin`, and the model is penalized proportionally.

```python
import torch
import torch.nn.functional as F

def margin_ranking_loss_batch(
    anchor_vecs: torch.Tensor,   # (B, D)
    positive_vecs: torch.Tensor, # (B, D)
    negative_vecs: torch.Tensor, # (B, D)
    margin: float = 0.3
) -> torch.Tensor:
    """
    Forces: score(anchor, positive) - score(anchor, negative) >= margin
    Loss is zero when the gap is large enough, positive when it isn't.
    """
    # Normalize
    anchor = F.normalize(anchor_vecs, dim=1)
    positive = F.normalize(positive_vecs, dim=1)
    negative = F.normalize(negative_vecs, dim=1)

    pos_sim = (anchor * positive).sum(dim=1)  # (B,)
    neg_sim = (anchor * negative).sum(dim=1)  # (B,)

    loss = F.relu(margin - (pos_sim - neg_sim))
    return loss.mean()

# Training loop sketch
def train_epoch(model, dataloader, optimizer, margin=0.3):
    model.train()
    total_loss = 0.0
    for batch in dataloader:
        anchors, positives, negatives = batch

        a_vecs = model(anchors)
        p_vecs = model(positives)
        n_vecs = model(negatives)

        loss = margin_ranking_loss_batch(a_vecs, p_vecs, n_vecs, margin)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        total_loss += loss.item()

    return total_loss / len(dataloader)
```

The margin parameter (0.3 is a common starting point) controls how much separation is required. A higher margin produces more discriminative models but can underfit if the training data doesn't have enough diversity in hard negatives.

> **Key Insight**: The margin in margin ranking loss is the minimum score gap you're enforcing between true positives and hard negatives. A 0.3 margin means the model will not converge until true positives score at least 0.3 higher than hard negatives. This is what creates a calibratable score distribution.

### Why Static top_k Is a Trap

The most common retrieval setup: return top_k=10 (or 20) results regardless of score. This has a fundamental problem that gets worse as your corpus grows.

If your corpus has 50,000 functions and 1 is relevant to a given query, top_k=20 returns that 1 relevant function plus 19 noise functions. You've deliberately polluted your LLM context with 95% irrelevant content. At top_k=10, it's 90% noise.

The numbers for LLM context injection: with static top_k=20 and average 200-token chunks, you're sending approximately 4,000 tokens per query, of which 3,800 are noise. At OpenAI's pricing, that's roughly $0.004/query just for the context — before the response. At scale (10K queries/day), that's $14,600/year in context noise.

Adaptive thresholding solves this: instead of returning a fixed number of results, return all results above a calibrated score threshold. Queries with a clear best match return 1-3 results. Ambiguous queries return more.

```python
def find_adaptive_threshold(
    model_fn,
    eval_set: list[dict],
    target_precision: float = 0.9
) -> float:
    """
    Find the threshold that achieves target precision on eval set.
    eval_set items: {"query": str, "positive_doc": str, "negative_docs": [str]}
    """
    all_pos_scores = []
    all_neg_scores = []

    for item in eval_set:
        q_vec = model_fn([item["query"]])[0]
        q_vec /= np.linalg.norm(q_vec)

        p_vec = model_fn([item["positive_doc"]])[0]
        p_vec /= np.linalg.norm(p_vec)
        all_pos_scores.append(float(q_vec @ p_vec))

        for neg_doc in item["negative_docs"]:
            n_vec = model_fn([neg_doc])[0]
            n_vec /= np.linalg.norm(n_vec)
            all_neg_scores.append(float(q_vec @ n_vec))

    # Find threshold that achieves target precision
    # Binary search over threshold values
    thresholds = np.linspace(0.3, 0.99, 200)
    best_threshold = 0.5

    for t in thresholds:
        tp = sum(1 for s in all_pos_scores if s >= t)
        fp = sum(1 for s in all_neg_scores if s >= t)
        if tp + fp == 0:
            continue
        precision = tp / (tp + fp)
        if precision >= target_precision:
            best_threshold = t
            # Don't break — keep searching for the lowest threshold that meets precision

    return best_threshold

class AdaptiveRetriever:
    def __init__(self, index, threshold: float, min_results: int = 1, max_results: int = 20):
        self.index = index
        self.threshold = threshold
        self.min_results = min_results
        self.max_results = max_results

    def search(self, query_vec: np.ndarray) -> list[tuple[dict, float]]:
        # Get more results than we'll return to have enough candidates
        candidates = self.index.search(query_vec, k=self.max_results * 2)

        # Filter by threshold
        above_threshold = [(meta, score) for meta, score in candidates
                          if score >= self.threshold]

        # Enforce min/max bounds
        if len(above_threshold) < self.min_results:
            return candidates[:self.min_results]
        return above_threshold[:self.max_results]
```

> **Warning**: Calibrate your threshold on a held-out eval set, not on the data you used to train or evaluate your model. Threshold calibration on training data overfits to the training distribution and will underperform on production queries.

### The Token Cost Payoff

When calibration is tight, the token cost reduction is dramatic. With a calibrated threshold that achieves 90% precision, a typical production workload returns 1-5 results per query instead of 20. At 200 tokens per chunk, that's 200-1000 tokens of context per query vs. 4,000. The reduction is 75-95%.

The 96% reduction in context delivery tokens (from 2,000 tokens/message to 84 tokens/message) reflects this calibration effect in production. At that level of calibration, you're delivering almost exclusively relevant context — which also improves LLM response quality because the context is signal, not noise.

---

### Key Takeaways

- MSE-trained models produce a "mushy middle" where relevant and irrelevant scores are nearly indistinguishable.
- Margin ranking loss forces a gap of at least `margin` between true positives and hard negatives.
- Static top_k deliberately injects noise proportional to (k-1)/k when there's one correct answer.
- Adaptive thresholding dramatically reduces token costs: 95% reduction is achievable with tight calibration.
- Always calibrate on a held-out set. Threshold calibration on training data overfits.

### Practical Exercise

Run a calibration analysis on your current model:

```python
# calibration_check.py
# Requires: 20 (query, correct_chunk) pairs from your codebase
import numpy as np

# Load your eval pairs — format: [(query_text, correct_chunk_text, [wrong_chunk_text, ...])]
eval_pairs = [...]  # fill in from your codebase

pos_scores = []
neg_scores = []

for query, correct, wrongs in eval_pairs:
    q = embed_fn([query])[0]; q /= np.linalg.norm(q)
    c = embed_fn([correct])[0]; c /= np.linalg.norm(c)
    pos_scores.append(float(q @ c))
    for w in wrongs[:5]:
        wv = embed_fn([w])[0]; wv /= np.linalg.norm(wv)
        neg_scores.append(float(q @ wv))

gap = np.mean(pos_scores) - np.mean(neg_scores)
print(f"Positive mean:  {np.mean(pos_scores):.4f} ± {np.std(pos_scores):.4f}")
print(f"Negative mean:  {np.mean(neg_scores):.4f} ± {np.std(neg_scores):.4f}")
print(f"Gap:            {gap:.4f}")
print(f"Status: {'GOOD (>0.2)' if gap > 0.2 else 'MARGINAL (0.1-0.2)' if gap > 0.1 else 'POOR (<0.1) — consider retraining'}")
```

---

## Chapter 8: Production Operations: Latency, Drift, Monitoring, Re-Indexing {#chapter-8}

Shipping a code search system is straightforward. Running it well for six months is where most of the real work happens. This chapter is about the operational patterns that prevent production systems from silently degrading.

### Latency Budget

A complete code search request has four latency components: query embedding, index search, result fetch, and response serialization. Know the budget for each before you optimize.

With ONNX Runtime at 384 dimensions, query embedding takes approximately 6ms. That's the baseline — switching from PyTorch to ONNX Runtime reduces model weights from 496MB to 30MB and full service RSS from 820MB to 197MB, while cutting embedding latency to 6ms. If your embedding step is taking 50ms, you're running PyTorch in eager mode without optimization.

```python
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

class OnnxEmbedder:
    def __init__(self, model_path: str, tokenizer_name: str):
        # ONNX Runtime with optimizations
        opts = ort.SessionOptions()
        opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        opts.intra_op_num_threads = 4

        self.session = ort.InferenceSession(
            model_path,
            sess_options=opts,
            providers=["CPUExecutionProvider"]
        )
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_name)

    def embed(self, texts: list[str]) -> np.ndarray:
        encoded = self.tokenizer(
            texts,
            padding=True,
            truncation=True,
            max_length=512,
            return_tensors="np"
        )
        outputs = self.session.run(
            None,
            {
                "input_ids": encoded["input_ids"],
                "attention_mask": encoded["attention_mask"]
            }
        )
        # Mean pooling
        token_embeddings = outputs[0]
        attention_mask = encoded["attention_mask"]
        mask_expanded = attention_mask[:, :, np.newaxis].astype(np.float32)
        embeddings = (token_embeddings * mask_expanded).sum(1) / mask_expanded.sum(1)

        # Normalize
        norms = np.linalg.norm(embeddings, axis=1, keepdims=True)
        return embeddings / norms
```

> **Key Insight**: ONNX Runtime reduces embedding model RSS from 820MB to 197MB while cutting latency to 6ms at 384 dimensions. If memory footprint matters (you're running the embedder in-process alongside your application), ONNX is the correct deployment format.

### Drift Detection

Embedding drift happens when the code your search system was indexed on diverges significantly from the code it's being searched against. This occurs through:
- Continuous commits that change function names and signatures
- Major refactors that move or consolidate functionality
- Addition of new modules with patterns not in the original index

Drift is invisible to standard latency and error rate monitoring. Queries still return results; the results are just increasingly wrong.

The monitoring signal: track your threshold pass rate. If calibration says relevant chunks should score above 0.75, and the fraction of queries where at least one result exceeds 0.75 drops from 85% to 60%, your index has drifted.

```python
import time
import json
from collections import deque
from pathlib import Path

class DriftMonitor:
    def __init__(
        self,
        threshold: float,
        window_size: int = 1000,
        alert_pct: float = 0.70
    ):
        self.threshold = threshold
        self.window = deque(maxlen=window_size)
        self.alert_pct = alert_pct

    def record_query(self, top_score: float):
        """Record whether the best result for a query exceeded threshold."""
        self.window.append(1 if top_score >= self.threshold else 0)

    def threshold_pass_rate(self) -> float:
        if not self.window:
            return 1.0
        return sum(self.window) / len(self.window)

    def is_drifted(self) -> bool:
        return self.threshold_pass_rate() < self.alert_pct

    def status(self) -> dict:
        rate = self.threshold_pass_rate()
        return {
            "threshold_pass_rate": rate,
            "is_drifted": rate < self.alert_pct,
            "window_size": len(self.window),
            "alert_threshold": self.alert_pct,
        }
```

> **Warning**: Latency and error rate metrics will not detect semantic drift. A system can have perfect uptime and p99 latency of 15ms while returning completely wrong results because the index hasn't been updated. You need a semantic quality signal in your monitoring stack.

### Re-Indexing Strategies

Full re-indexing is the safe choice but expensive for large corpora. The alternative is incremental re-indexing with change detection.

For Git-based codebases, the change detection strategy is straightforward: compare file hashes (or git object hashes) against what's in the index, update only changed files.

```python
import hashlib
import subprocess
from pathlib import Path

def get_git_changed_files(since_commit: str) -> list[str]:
    """Get files changed since a given commit."""
    result = subprocess.run(
        ["git", "diff", "--name-only", since_commit, "HEAD"],
        capture_output=True, text=True
    )
    return [f.strip() for f in result.stdout.strip().split("\n") if f.strip()]

def file_hash(path: str) -> str:
    return hashlib.sha256(Path(path).read_bytes()).hexdigest()

class IncrementalIndexer:
    def __init__(self, index, embedder, db_conn):
        self.index = index
        self.embedder = embedder
        self.db = db_conn

    def get_indexed_hashes(self) -> dict[str, str]:
        """file_path -> hash from database."""
        with self.db.cursor() as cur:
            cur.execute("SELECT DISTINCT file_path, file_hash FROM code_chunks")
            return dict(cur.fetchall())

    def update_file(self, file_path: str):
        """Re-index a single file: delete old chunks, insert new ones."""
        # Delete existing chunks for this file
        self.index.delete_by_file(file_path)

        # Re-chunk and re-embed
        from .chunker import chunk_at_all_levels, build_embedded_text
        chunks = chunk_at_all_levels(file_path)
        if not chunks:
            return

        texts = [build_embedded_text(c) for c in chunks]
        embeddings = self.embedder.embed(texts)

        chunk_dicts = [
            {
                "file_path": c.file_path,
                "chunk_type": c.chunk_type,
                "name": c.name,
                "source": c.source,
                "start_line": c.start_line,
                "end_line": c.end_line,
                "file_hash": file_hash(file_path),
            }
            for c in chunks
        ]
        self.index.insert_chunks(chunk_dicts, embeddings)

    def run_incremental_update(self, repo_path: str):
        """Update index for all modified files."""
        indexed_hashes = self.get_indexed_hashes()

        changed = []
        for py_file in Path(repo_path).rglob("*.py"):
            path_str = str(py_file)
            current_hash = file_hash(path_str)
            if indexed_hashes.get(path_str) != current_hash:
                changed.append(path_str)

        print(f"Re-indexing {len(changed)} changed files")
        for file_path in changed:
            try:
                self.update_file(file_path)
            except Exception as e:
                print(f"Error indexing {file_path}: {e}")

        print("Incremental update complete")
```

### Query Logging as a Monitoring Layer

Every query is an observation about what developers are searching for and whether the system is satisfying those needs. Log queries, top scores, and result counts. Over time, this log tells you:
- Which queries consistently score below threshold (system doesn't know about those topics)
- Which queries return too many results (threshold may need adjustment)
- Query volume patterns (which modules are most actively being navigated)

```python
import json
import logging
from datetime import datetime

query_logger = logging.getLogger("code_search.queries")

def log_query(
    query: str,
    results: list[dict],
    top_score: float,
    latency_ms: float,
    threshold: float
):
    query_logger.info(json.dumps({
        "ts": datetime.utcnow().isoformat(),
        "query": query[:200],
        "n_results": len(results),
        "top_score": round(top_score, 4),
        "above_threshold": top_score >= threshold,
        "latency_ms": round(latency_ms, 2),
    }))
```

> **Try This**: Enable query logging and run it for one week. Then compute: (1) the 10 most common queries that returned zero above-threshold results, and (2) the 10 most common queries. The first list tells you your system's blind spots. The second tells you what to optimize and test first.

### Scheduled Index Maintenance

For production systems, combine incremental updates (on commit) with scheduled full rebuilds (weekly or monthly). The incremental updates handle day-to-day drift; the full rebuild resets HNSW graph quality and catches edge cases where incremental updates missed changes.

```bash
#!/bin/bash
# index_maintenance.sh — run via cron
set -e

REPO_PATH="/path/to/your/codebase"
LOG_FILE="/var/log/code-search/maintenance.log"

echo "[$(date -u)] Starting index maintenance" >> "$LOG_FILE"

# Incremental update
python3 -m code_search.indexer incremental \
    --repo "$REPO_PATH" \
    --dsn "$PGVECTOR_DSN" \
    >> "$LOG_FILE" 2>&1

# Check drift
python3 -m code_search.monitor drift-check \
    --alert-webhook "$SLACK_WEBHOOK" \
    >> "$LOG_FILE" 2>&1

echo "[$(date -u)] Maintenance complete" >> "$LOG_FILE"
```

---

### Key Takeaways

- ONNX Runtime reduces embedding latency to 6ms and model RSS from 820MB to 197MB.
- Semantic drift is invisible to latency and error rate metrics — monitor threshold pass rate.
- Incremental re-indexing (file hash comparison) handles daily drift; schedule full rebuilds weekly.
- HNSW recall silently degrades; schedule periodic full rebuilds, not just incremental inserts.
- Query logs are your primary diagnostic tool — log queries, scores, result counts, and latency.

### Practical Exercise

Set up drift monitoring on your production system:

```bash
# Install the monitoring loop as a cron job
cat > /etc/cron.d/code-search-drift << 'EOF'
*/5 * * * * www-data python3 /app/code_search/monitor.py --check-drift >> /var/log/drift.log 2>&1
EOF

# Create the monitor script
python3 << 'SCRIPT'
import json
import sys
from pathlib import Path

# Read last 1000 queries from query log
log_path = Path("/var/log/code-search/queries.jsonl")
if not log_path.exists():
    print("No query log found")
    sys.exit(0)

lines = log_path.read_text().strip().split("\n")[-1000:]
records = [json.loads(l) for l in lines if l.strip()]

pass_rate = sum(1 for r in records if r.get("above_threshold")) / len(records)
print(f"Threshold pass rate (last {len(records)} queries): {pass_rate:.3f}")
if pass_rate < 0.70:
    print("ALERT: Drift detected — pass rate below 70%")
    # Send alert (Slack, PagerDuty, etc.)
SCRIPT
```

---

## Chapter 9: The Compounding Advantage: Why a Model Trained on Your Codebase Gets Better Over Time {#chapter-9}

This final chapter is about the long game. Every other chapter described components of a working system. This one describes why a well-built system becomes harder to replace over time — and why that's a strategic advantage worth designing for deliberately.

### The Query Log as Training Data

Every query your system processes is a signal: what was the developer looking for, and what did the system return? If you've built query logging (Chapter 8) and you have a feedback mechanism (even implicit, like "which result did the developer click?"), you have the raw material for the next model version.

This is the compounding dynamic. The first version of your model was trained on code structure and generic code triplets. The second version can be trained on actual queries from your developers against your actual codebase. It learns what *your team* searches for, in *your team's* vocabulary, against *your team's* code.

No general-purpose model can be trained on your data. By definition, they're trained on data that isn't yours. This is the moat: the longer you use the system, the more irreplaceable the model trained on it becomes.

> **Key Insight**: "The moat is the corpus. The longer you use Pyckle, the more irreplaceable it becomes." This is true for any code search system with a learning loop. General-purpose models can't be fine-tuned on your private query logs. A system you own can.

### What "Better Over Time" Means Concretely

Let's be specific. "Better over time" doesn't mean the model improves passively. It means you have a pipeline that:

1. Logs queries and results
2. Collects relevance feedback (explicit ratings, click-through, or implicit usage signals)
3. Converts feedback into training triplets
4. Periodically retrains or fine-tunes the model
5. Evaluates the new model against a held-out set
6. Deploys if it improves

Each of these steps is implementable with tools you already have.

```python
class FeedbackCollector:
    """Convert query logs + usage signals into training triplets."""

    def __init__(self, db_conn):
        self.db = db_conn

    def record_click(self, query_id: str, clicked_chunk_id: str):
        """Record that a developer clicked/selected a specific result."""
        with self.db.cursor() as cur:
            cur.execute("""
                INSERT INTO query_feedback (query_id, chunk_id, signal, created_at)
                VALUES (%s, %s, 'click', NOW())
            """, (query_id, clicked_chunk_id))
        self.db.commit()

    def extract_triplets(self, min_queries: int = 500) -> list[dict]:
        """
        Build training triplets from click data.
        Positive: the clicked chunk.
        Negative: other chunks returned in the same query that weren't clicked.
        """
        with self.db.cursor() as cur:
            cur.execute("""
                SELECT
                    q.query_text,
                    f.chunk_id AS positive_chunk_id,
                    array_agg(r.chunk_id) FILTER (WHERE r.chunk_id != f.chunk_id) AS negative_chunk_ids
                FROM query_log q
                JOIN query_feedback f ON f.query_id = q.id AND f.signal = 'click'
                JOIN query_results r ON r.query_id = q.id
                GROUP BY q.query_text, f.chunk_id
                HAVING count(r.chunk_id) >= 2
                LIMIT %s
            """, (min_queries,))
            rows = cur.fetchall()

        triplets = []
        for query_text, positive_id, negative_ids in rows:
            for neg_id in (negative_ids or [])[:3]:
                triplets.append({
                    "query": query_text,
                    "positive_id": positive_id,
                    "negative_id": neg_id
                })
        return triplets
```

### The Continuous Training Pipeline

The training pipeline for model updates is simpler than the initial training because you're fine-tuning on new data rather than training from scratch:

```python
import torch
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModel, AutoTokenizer

class TripletDataset(Dataset):
    def __init__(self, triplets: list[dict], tokenizer, chunks_db):
        self.triplets = triplets
        self.tokenizer = tokenizer
        self.chunks_db = chunks_db

    def __len__(self):
        return len(self.triplets)

    def __getitem__(self, idx):
        t = self.triplets[idx]
        anchor_text = t["query"]
        positive_text = self.chunks_db[t["positive_id"]]["text"]
        negative_text = self.chunks_db[t["negative_id"]]["text"]

        def tokenize(text):
            return self.tokenizer(
                text, padding="max_length", truncation=True,
                max_length=256, return_tensors="pt"
            )

        return {
            "anchor": tokenize(anchor_text),
            "positive": tokenize(positive_text),
            "negative": tokenize(negative_text),
        }

def fine_tune_on_feedback(
    base_model_path: str,
    triplets: list[dict],
    chunks_db: dict,
    output_path: str,
    epochs: int = 3,
    lr: float = 2e-5
):
    tokenizer = AutoTokenizer.from_pretrained(base_model_path)
    model = AutoModel.from_pretrained(base_model_path)

    dataset = TripletDataset(triplets, tokenizer, chunks_db)
    loader = DataLoader(dataset, batch_size=16, shuffle=True)
    optimizer = torch.optim.AdamW(model.parameters(), lr=lr)

    def mean_pool(output, attention_mask):
        token_embeds = output.last_hidden_state
        mask = attention_mask.unsqueeze(-1).float()
        return (token_embeds * mask).sum(1) / mask.sum(1)

    for epoch in range(epochs):
        total_loss = 0.0
        for batch in loader:
            for key in ["anchor", "positive", "negative"]:
                for k in batch[key]:
                    batch[key][k] = batch[key][k].squeeze(1)

            a_out = model(**batch["anchor"])
            p_out = model(**batch["positive"])
            n_out = model(**batch["negative"])

            a_vecs = mean_pool(a_out, batch["anchor"]["attention_mask"])
            p_vecs = mean_pool(p_out, batch["positive"]["attention_mask"])
            n_vecs = mean_pool(n_out, batch["negative"]["attention_mask"])

            loss = margin_ranking_loss_batch(a_vecs, p_vecs, n_vecs, margin=0.3)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            total_loss += loss.item()

        print(f"Epoch {epoch+1}/{epochs}  loss={total_loss/len(loader):.4f}")

    model.save_pretrained(output_path)
    tokenizer.save_pretrained(output_path)
    print(f"Fine-tuned model saved to {output_path}")
```

> **Try This**: Set up the feedback collection infrastructure before you have any feedback. The click-through logging, the query result table, the feedback schema — all of it. When you have 500 queries worth of feedback (usually 4-8 weeks of production traffic for a team of 10), you'll be ready to train the first updated model immediately rather than waiting to build the infrastructure.

### The 18x Goodput Improvement

Goodput is the measure of useful work delivered per unit of infrastructure. For a code search system that feeds context into an LLM, goodput is the ratio of relevant tokens to total tokens in the context window.

A naive system — retrieve top_k=20, no threshold — achieves roughly 5% goodput when there's one correct answer. 1 relevant chunk out of 20 returned.

A calibrated system with a tight threshold achieves 90%+ goodput: nearly every token in context is relevant.

The 18x goodput improvement and 96% token reduction (84 tokens/message vs. 2,000) reflects the combined effect of:
1. Tight calibration (returning 1-3 results instead of 20)
2. Small, well-chunked units (function-level, 150-300 tokens each)
3. 100% context hit rate (every query returns at least one relevant result)

The 100% context hit rate is worth examining: it means the index is comprehensive enough that every query has a relevant result above threshold. This doesn't happen by accident — it requires aggressive chunking (including file-level chunks for broad queries), coverage monitoring, and a good embedding model.

### Why This Compounds

Each improvement in the system creates conditions for further improvement:

- Better embeddings → tighter calibration → lower noise in context → better LLM responses → developers trust the system more → more usage → more query logs → more training data
- Better training data → better model → higher threshold pass rate → fewer missed queries → more coverage

> **Warning**: The compounding advantage only compounds if you close the loop. Query logs that aren't used for retraining are just storage costs. Feedback mechanisms that aren't wired to training pipelines are just UX features. The feedback-to-training loop is the machinery that drives improvement.

### Designing for the Long Game

Three architectural decisions that enable the compounding loop:

**1. Own your query logs.** Don't route queries through a managed vector store that doesn't expose logs to you. Every query is a data point.

**2. Schema your feedback from day one.** Even if you don't collect explicit feedback initially, design the database schema to accept it. Retrofitting a feedback schema into a production system is painful.

**3. Version your index.** Store the model version that produced each embedding alongside the embedding. When you retrain, you can compare the new model's results directly against the old model's on the same queries.

```sql
-- Schema designed for the long game
ALTER TABLE code_chunks ADD COLUMN model_version TEXT DEFAULT 'v1.0';
ALTER TABLE code_chunks ADD COLUMN indexed_by_commit TEXT;

CREATE TABLE model_versions (
    version         TEXT PRIMARY KEY,
    model_path      TEXT NOT NULL,
    training_triplets INT,
    eval_mrr        FLOAT,
    deployed_at     TIMESTAMPTZ,
    notes           TEXT
);

CREATE TABLE query_log (
    id              BIGSERIAL PRIMARY KEY,
    query_text      TEXT NOT NULL,
    model_version   TEXT REFERENCES model_versions(version),
    n_results       INT,
    top_score       FLOAT,
    latency_ms      FLOAT,
    created_at      TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE query_results (
    query_id        BIGINT REFERENCES query_log(id),
    chunk_id        BIGINT REFERENCES code_chunks(id),
    rank            INT,
    score           FLOAT
);

CREATE TABLE query_feedback (
    query_id        BIGINT REFERENCES query_log(id),
    chunk_id        BIGINT REFERENCES code_chunks(id),
    signal          TEXT CHECK (signal IN ('click', 'accept', 'reject', 'edit')),
    created_at      TIMESTAMPTZ DEFAULT NOW()
);
```

The model version column on code_chunks lets you compare retrieval quality between model versions on the same corpus. The query_feedback table captures implicit signals (which result the developer actually used) alongside explicit ones. The full schema from day one means you can start training updates as soon as you have enough data.

---

### Key Takeaways

- Every query is a training signal; every click is a labeled triplet.
- The compounding advantage requires a closed loop: query logs → training data → model update → deployment.
- Fine-tuning on domain-specific feedback data takes hours on modern hardware, not days.
- Own your query logs, schema your feedback infrastructure from day one, version your index.
- The 18x goodput improvement and 96% token reduction come from tight calibration combined with comprehensive indexing.

### Practical Exercise

Set up the feedback loop infrastructure:

```bash
# Create the feedback schema migration
cat > migrations/001_feedback_schema.sql << 'SQL'
CREATE TABLE IF NOT EXISTS query_log (
    id BIGSERIAL PRIMARY KEY,
    session_id TEXT,
    query_text TEXT NOT NULL,
    n_results INT,
    top_score FLOAT,
    latency_ms FLOAT,
    model_version TEXT DEFAULT 'v1.0',
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE TABLE IF NOT EXISTS query_results (
    query_id BIGINT REFERENCES query_log(id) ON DELETE CASCADE,
    chunk_id BIGINT REFERENCES code_chunks(id),
    rank INT,
    score FLOAT
);

CREATE TABLE IF NOT EXISTS query_feedback (
    id BIGSERIAL PRIMARY KEY,
    query_id BIGINT REFERENCES query_log(id) ON DELETE CASCADE,
    chunk_id BIGINT,
    signal TEXT CHECK (signal IN ('click', 'accept', 'reject')),
    created_at TIMESTAMPTZ DEFAULT NOW()
);

CREATE INDEX ON query_log(created_at);
CREATE INDEX ON query_feedback(query_id);
SQL

psql "$PGVECTOR_DSN" < migrations/001_feedback_schema.sql
echo "Feedback schema ready"
```

After 4 weeks of production traffic, run `FeedbackCollector.extract_triplets()` and see how many triplets you have. If you have more than 200, you have enough to start the first fine-tuning run.

---

## Conclusion

Code search is a narrow problem with wide leverage. When developers can find what they're looking for instantly — the function that handles a specific edge case, the pattern used elsewhere in the codebase, the error handler they need to mirror — they spend less time navigating and more time building.

The systems described in this book aren't theoretical. Every number cited reflects a real measurement: the 6ms embedding latency from ONNX Runtime, the 0.456 MRR@10 from a 40M-parameter model trained from scratch, the 79% cost reduction from self-hosting on pgvector, the 96% token reduction from tight calibration. These numbers aren't presented to impress — they're presented so you have specific benchmarks to work toward and compare against.

The architecture this book describes has a specific property that most alternatives lack: it gets better over time without rebuilding. A static embedding model deployed against a growing codebase will drift. A system with query logging, feedback collection, and a retraining pipeline compounds its advantage with every production query.

### What to Build First

If you're starting from zero, the build order matters:

**Week 1**: Build the chunker. Get function-level chunks out of your codebase and understand the size distribution. Fix truncation issues before you touch embeddings.

**Week 2**: Add an embedding model and an in-memory index. Deploy a basic search endpoint. This is your baseline — measure its quality before improving anything.

**Week 3**: Add BM25 and RRF-based hybrid search. Measure recall improvement on your eval set. If your workload includes identifier searches (it does), you'll see a meaningful improvement.

**Week 4**: Calibrate thresholds using your eval set. Switch from static top_k to adaptive thresholding. Measure token cost reduction.

**Month 2+**: Add query logging, migrate from in-memory to pgvector when scale requires it, set up the feedback collection schema, start planning the first model update.

### The Decision That Matters Most

Of all the decisions in this book — model, dimensions, chunking strategy, vector store, hybrid vs. pure semantic — the one that matters most in the long run is whether you build a learning loop. A system that learns from its own usage is fundamentally different from one that doesn't. The learning loop turns the system from a deployed artifact into a compounding asset.

That loop doesn't require exotic infrastructure. It requires query logs, a feedback schema, a training script, and the discipline to run it periodically. Everything else in this book is more immediately impactful on day one, but the learning loop is what determines where you are in two years.

Build the loop. Even before you have data for it. The schema costs nothing. The habit of logging costs nothing. When you have 500 queries worth of feedback, you'll be glad the infrastructure was waiting.

The rest is a measurement problem. Build eval sets, measure everything, and let the data tell you what to improve next. The developers who build the best retrieval systems aren't the ones with the deepest theoretical understanding — they're the ones who measure most rigorously and iterate fastest.

---

## Appendix A: Glossary

**BM25** (Best Match 25): A term frequency-based ranking function for text retrieval. Scores documents by how often query terms appear, adjusted for document length and corpus frequency. The industry-standard keyword search algorithm.

**Calibration**: The process of setting retrieval thresholds so that score distributions of relevant and irrelevant results are separated by a meaningful margin. A well-calibrated system has a predictable relationship between score and relevance.

**Chunk**: A unit of code that is embedded and indexed as a single vector. Typically a function with its signature and immediate context, but can be file-level or arbitrary sliding windows.

**Cosine Similarity**: A metric measuring the angle between two vectors. Range: -1 (opposite) to 1 (identical direction). Magnitude-invariant, making it well-suited for text and code embeddings.

**Domain Gap**: The performance reduction that occurs when a model trained on one distribution (e.g., general text) is applied to a different distribution (e.g., domain-specific code). Manifests as poor discriminability between domain-specific identifiers.

**Drift**: The degradation in retrieval quality that occurs when the indexed corpus diverges from its state at index time, typically through code changes. Invisible to latency and error rate monitoring.

**Embedding**: A fixed-length vector representation of content (text, code, image) produced by a neural model. Similar content produces nearby vectors; the model's training data defines what "similar" means.

**Fine-tuning**: Continuing training of a pretrained model on domain-specific data. For code search, this means updating model weights on code-specific triplets, often using margin ranking loss.

**Goodput**: The ratio of useful (relevant) content to total content delivered. In LLM context injection, goodput = relevant tokens / total tokens. Tightly calibrated systems achieve high goodput.

**Hard Negative**: A training example that is superficially similar to the anchor but semantically different. Hard negatives are the primary driver of embedding model discriminability.

**HNSW** (Hierarchical Navigable Small World): The approximate nearest neighbor algorithm used by most vector stores. Fast but subject to silent recall degradation as the index grows.

**Margin Ranking Loss**: A training loss function that penalizes any case where the score gap between a true positive and a hard negative is less than a specified margin. Produces calibratable score distributions.

**Mean Reciprocal Rank (MRR)**: An evaluation metric measuring how high the first correct result appears in ranked results, averaged over queries. MRR@10 measures within the top 10.

**ONNX Runtime**: A framework for running neural network inference with optimized execution. For embedding models, reduces weight size from ~500MB to ~30MB and latency to single-digit milliseconds.

**pgvector**: A Postgres extension that adds vector storage and similarity search. Allows co-location of vector index with relational data on existing Postgres infrastructure.

**Pooling**: The operation that converts per-token embeddings (from a transformer) to a single vector for the input. Mean pooling (averaging all token embeddings) is the most common.

**RRF** (Reciprocal Rank Fusion): A method for merging multiple ranked lists without requiring calibrated scores. Each document's combined score is the sum of its reciprocal ranks across lists.

**Threshold**: A minimum similarity score below which results are excluded. An adaptive threshold returns all results above the threshold rather than a fixed number.

**Triplet**: A training example consisting of (anchor, positive, negative) — the anchor is a query or code unit, positive is a relevant match, negative is an irrelevant one.

**Vector Store**: A database optimized for storing and searching embedding vectors. Options include in-memory NumPy arrays, pgvector, Pinecone, Weaviate, Qdrant, and Chroma.

---

## Appendix B: Tools and Resources

### Embedding Frameworks

**Sentence Transformers** (`sentence-transformers`): The most practical library for working with embedding models in Python. Supports hundreds of pretrained models, easy batch encoding, and fine-tuning utilities.

```bash
pip install sentence-transformers
```

**ONNX Runtime** (`onnxruntime`): For production embedding inference. Reduces memory footprint and latency compared to PyTorch eager mode.

```bash
pip install onnxruntime optimum[exporters]
# Export a model to ONNX:
optimum-cli export onnx --model sentence-transformers/all-MiniLM-L6-v2 ./model_onnx/
```

**Hugging Face Transformers**: The underlying library for loading, fine-tuning, and exporting models. Required for training custom models.

```bash
pip install transformers torch
```

### Vector Storage

**pgvector**: Postgres extension for vector similarity search.

```bash
# Install on Ubuntu/Debian
sudo apt install postgresql-16-pgvector
# Or from source: https://github.com/pgvector/pgvector
```

**ChromaDB**: Lightweight, embeddable vector store for prototyping and small-scale production.

```bash
pip install chromadb
```

**Qdrant**: High-performance vector store with a simple HTTP API. Good self-hosted option for larger scales.

```bash
docker pull qdrant/qdrant
docker run -p 6333:6333 qdrant/qdrant
```

### Search and Indexing

**rank-bm25**: Pure Python BM25 implementation. Good for prototyping; switch to Elasticsearch or Tantivy for production.

```bash
pip install rank-bm25
```

**Tantivy** (via `tantivy-py`): Rust-based full-text search engine, significantly faster than pure Python BM25 at scale.

```bash
pip install tantivy
```

### Evaluation

**BEIR**: Benchmark suite for information retrieval evaluation. Includes CodeSearchNet and other code-relevant benchmarks.

```bash
pip install beir
```

**ir-measures**: Unified interface for computing IR metrics (MRR, NDCG, MAP, Recall@K).

```bash
pip install ir-measures
```

### Code Parsing

**tree-sitter**: Language-grammar-based parser for extracting code structure across languages. The production choice for multi-language chunking.

```bash
pip install tree-sitter
# Language grammars:
pip install tree-sitter-python tree-sitter-javascript tree-sitter-typescript
```

**Python `ast` module**: Built-in abstract syntax tree parser for Python. Sufficient for single-language Python codebases without additional dependencies.

### Monitoring

**Prometheus + Grafana**: Standard production monitoring stack. Add custom metrics for threshold pass rate, query volume, and score distributions.

```python
from prometheus_client import Histogram, Counter, Gauge

query_latency = Histogram("code_search_latency_ms", "Query latency in ms")
threshold_pass_rate = Gauge("code_search_threshold_pass_rate", "Fraction of queries above threshold")
queries_total = Counter("code_search_queries_total", "Total queries processed")
```

---

## Appendix C: Further Reading

### Foundational Papers

**"Dense Passage Retrieval for Open-Domain Question Answering"** (Karpukhin et al., 2020): The paper that established bi-encoder dense retrieval. Directly applicable to code search architecture.

**"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"** (Reimers & Gurevych, 2019): The architecture underlying most practical embedding models. Explains the siamese training setup and mean pooling.

**"CodeSearchNet Challenge: Evaluating the State of Semantic Code Search"** (Husain et al., 2019): The benchmark paper. Essential reading for understanding what the benchmark measures and where it falls short.

**"GraphCodeBERT: Pre-training Code Representations with Data Flow"** (Guo et al., 2021): Structural code representations using data flow graphs. Context for understanding what structural code information looks like in neural form.

**"Approximate Nearest Neighbor Search on High Dimensional Data"** (Wang et al., 2021): Survey paper covering HNSW, IVF, and PQ in depth. Useful when you need to understand the tradeoffs in vector index construction.

### Practical Resources

**"Billion-scale similarity search with GPUs"** (Johnson et al., 2017, FAISS): The foundational paper for approximate nearest neighbor search at scale. The FAISS library implements these algorithms and is the backend for many vector stores.

**"Reciprocal Rank Fusion outperforms Condorcet and individual Rank Learning Methods"** (Cormack et al., 2009): The original RRF paper. Short and readable — worth understanding the derivation when you need to explain the approach to teammates.

**"BM25: The Next Generation of Lucene Relevance"** (Elasticsearch blog, 2016): Accessible explanation of how Elasticsearch implements BM25 with practical tuning advice.

### On Training

**"Training Data is Everything"** (various practitioners): Not a paper, but a recurring empirical observation: training data quality dominates architecture choice. For code-specific models, this means collecting high-quality triplets from your actual codebase.

**"Hard Negative Mining for Retrieval Models"** (various): Multiple papers on this topic; search for it in the context of dense retrieval. Hard negative mining is the single most impactful training technique for retrieval models.

**"Learning to Rank for Information Retrieval"** (Liu, 2009): Survey of learning-to-rank approaches including pointwise, pairwise, and listwise methods. Chapter on pairwise methods directly covers margin ranking loss.

### Code Search Specific

**"IntelliCode"** (Svyatkovskiy et al., 2019): Microsoft's approach to code completion using code context. Useful for understanding how industry-scale code intelligence systems are built.

**"CodeBERT: A Pre-Trained Model for Programming and Natural Languages"** (Feng et al., 2020): The standard pretrained code model baseline. Understanding its architecture and training data explains its strengths and limitations on domain-specific tasks.

**"UniXcoder: Unified Cross-Modal Pre-training for Code Representation"** (Guo et al., 2022): Multi-modal code representation learning. Shows how to bridge code, comments, and AST structures in a single model.

### On Production Systems

**"Building a Retrieval System that Works"** (practical guides from Pinecone, Weaviate, and Cohere): Each vector store vendor publishes practical guides. Even if you're not using their infrastructure, the operational advice generalizes. Read with awareness of the vendor perspective.

**Elastic's "Search Relevance" documentation**: Elasticsearch's production documentation on BM25 tuning, hybrid search, and relevance feedback. One of the best practical resources on production search systems.

**pgvector GitHub repository and documentation**: The issues and discussions in the pgvector repository contain detailed performance characterizations and HNSW tuning advice from real production deployments.
```

---

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*