```markdown
---
title: "Custom Embedding Models: Fine-Tuning, Evaluation, and Deployment"
subtitle: "When Off-the-Shelf Isn't Enough and What to Do About It"
author: "Kelly Price"
date: "2026-04-21"
description: "A technical deep-dive for teams who need more than a general-purpose embedding model. Covers training data collection, triplet loss, architecture decisions, H100 training, ONNX deployment, monitoring, and building the data flywheel."
tags: [embeddings, ai, developer-tools, machine-learning]
---

# Custom Embedding Models: Fine-Tuning, Evaluation, and Deployment
## When Off-the-Shelf Isn't Enough and What to Do About It

*Kelly Price*

---

## About This Guide

This book started as a collection of notes from a production failure.

We had built a semantic code search tool on top of a general-purpose embedding model — one of the well-regarded ones, trained on hundreds of millions of web documents. It worked fine in demos. It worked fine on toy codebases. The moment we pointed it at a real monorepo with 400,000 files and years of accumulated internal naming conventions, it fell apart. Queries that any developer on the team would have answered immediately — "find the function that maps organization IDs to user lists" — returned results that were technically related to organizations and users, but in entirely the wrong sense. The model had never seen `getUsersByOrganizationId`. It had never seen any of the dozens of internal framework patterns that made up the actual semantic layer of that codebase. Its training corpus was the web. Our codebase was not the web.

The solution was to train a custom embedding model. That process took longer than it should have, not because the technical pieces were hard to find, but because the information was scattered across research papers, blog posts of wildly varying quality, framework documentation, and hard-won lessons that never made it out of private Slack channels. This book is an attempt to put that scattered information in one place.

The intended reader is a software engineer or ML engineer who understands transformers at a conceptual level, has some Python experience, and needs to solve a real retrieval problem that off-the-shelf models are failing at. You do not need to have trained a model from scratch before. You do not need a research background. You do need patience with the details, because the details are where most custom embedding projects succeed or fail.

What you will find here: a direct treatment of when custom embedding is actually warranted (the threshold is higher than you think), how to construct training data that will produce a useful model rather than one that merely looks good on synthetic benchmarks, the loss function choices that matter and the ones that don't, architecture decisions for teams that need full IP ownership, the practical realities of H100 training versus consumer hardware, how to export to ONNX and what will go wrong when you do, how to build evaluation sets that tell you something real, how to monitor a deployed model before it quietly degrades, and how to build the feedback loop that makes the model improve over time.

The production numbers in this book — training run durations, memory footprints, MRR@10 scores — are real. They come from PyckLM, a code and knowledge search model trained from random weight initialization specifically to avoid pretrained licensing constraints. Where those numbers appear, they appear as data points, not as performance claims. Your numbers will differ based on your data, your architecture, and what you're trying to retrieve. The methodology will transfer.

One note on scope: this book covers bi-encoder embedding models for retrieval, not cross-encoders, not generative models, and not contrastive learning for vision tasks. The techniques here are specific to dense retrieval with transformer-based encoders.

---

## Table of Contents

1. [When a General Model Isn't Enough](#chapter-1)
2. [Training Data Collection: What You Need and How to Get It](#chapter-2)
3. [Loss Functions: Triplet Loss, Margin Ranking, and Why MSE Fails](#chapter-3)
4. [Architecture Choices: Parameters, Dimensions, Pooling Strategy](#chapter-4)
5. [Training Infrastructure: H100 vs Consumer GPU, BF16, Mixed Precision](#chapter-5)
6. [ONNX Export and Production Deployment](#chapter-6)
7. [Evaluation: Building Eval Sets, MRR@10, NDCG, Cosine Calibration](#chapter-7)
8. [Monitoring in Production: Drift Detection and Quality Signals](#chapter-8)
9. [The Data Flywheel: Using Production Queries to Improve Training](#chapter-9)
- [Conclusion](#conclusion)
- [Appendix A: Glossary](#appendix-a)
- [Appendix B: Tools and Resources](#appendix-b)
- [Appendix C: Further Reading](#appendix-c)

---

## Chapter 1: When a General Model Isn't Enough {#chapter-1}

The first question to answer honestly is whether you need a custom model at all. Training a custom embedding model is a meaningful investment — weeks of engineering time at minimum, more if you're building the data pipeline from scratch, more still if you discover mid-process that your training data has structural problems. The general-purpose models are good. For many retrieval tasks, they are good enough. Starting with the assumption that you need custom is how teams waste significant time on a problem they didn't actually have.

So let's be precise about where general models break down.

A general embedding model is trained on a broad corpus — typically a mix of web text, Wikipedia, books, code repositories, and sometimes domain-specific data depending on the model. The training objective pushes semantically similar content toward the same region of the embedding space. For the kinds of semantic similarity that appear frequently in web text — documents about similar topics, sentences with similar meanings, code that implements common algorithms — this works well. The model has seen enough examples of each pattern that it develops a reasonable geometric intuition about what belongs near what.

The problem is distributional. When you ask the model to embed `getUsersByOrganizationId`, it decomposes the token into subword pieces and tries to situate it in an embedding space built around the things it has seen repeatedly. It has seen `get`, `users`, `by`, `organization`, `id`. It has not seen this specific compound as a semantic unit, and more importantly, it has not seen the internal patterns of your codebase that give this function its actual meaning in context. It does not know that in your codebase, functions matching the pattern `get{EntityPlural}By{JoinField}` are always data layer functions that return database results as typed structs. That pattern is meaningful to your team. It is invisible to the general model.

> **Key Insight**
>
> The domain gap problem is not about vocabulary. A general model can tokenize your internal function names. The gap is semantic: the model has no geometric intuition about which of your internal concepts belong near each other, because it has never seen those concepts in relationship.

This matters most in three scenarios. The first is monorepos with deep internal conventions. When your codebase has developed years of consistent naming patterns, architectural layers, and domain-specific abstractions, a general model cannot infer the semantic relationships between those patterns. `OrderFulfillmentOrchestrator` and `ShipmentDispatchCoordinator` might be tightly coupled in your system — sibling classes in the same subsystem — but a general model may place them far apart because the words "fulfillment" and "dispatch" don't co-occur frequently in web text in ways that signal tight coupling.

The second scenario is proprietary or domain-specific languages. If your team has built a DSL, uses an internal query language, or works heavily with a framework that has limited public presence (an internal rules engine, a custom data pipeline definition language), the general model has essentially no signal for it. Every embedding will be a near-random projection based on the surface-form tokens.

The third scenario is knowledge bases with highly specific internal linking. Technical documentation, runbooks, and architectural decision records often develop their own vocabulary. "The NOVA pattern" means something specific to your team. "The staging-fanout issue" refers to a known incident. A retrieval system that can't match on internal reference terms is retrieving over the wrong semantic space.

The inverse is also worth stating clearly. General models are competitive — genuinely, not as a consolation — for codebases under roughly 10,000 files that follow standard naming conventions for the language. If your Python codebase follows PEP 8 and uses idiomatic patterns, `text-embedding-3-small` will likely perform well on it. The training data for most general models includes substantial amounts of Python with standard conventions. The model has seen enough of it to have a reasonable geometric intuition about which functions and classes belong near each other.

> **Warning**
>
> Before committing to custom training, run a rigorous evaluation of the best available general model on a held-out sample of real queries from your use case. If MRR@10 is above 0.35 and your use case can tolerate that, you may not need custom training. Build the eval set first — it's work you'll need anyway, and it may save you weeks of training.

The fine-tuning threshold is worth naming concretely: you need approximately 50,000 high-quality triplets before fine-tuning a general model meaningfully and consistently outperforms a well-calibrated general model on your specific domain. Below that threshold, the improvements are noisy — sometimes the fine-tuned model is better, sometimes it's worse, and you cannot reliably predict which. With 10,000 real hard-negative triplets you can often show improvement, but the variance is high enough that you'll spend significant effort validating that the improvement is real and not an artifact of your eval set construction.

Training from random initialization — as we did with PyckLM — requires more data than fine-tuning but gives you complete weight ownership and no pretrained lineage to navigate. For teams with IP constraints (meaning the model will be used in a commercial product and licensing of the base model matters), full initialization is sometimes the only clean option. The tradeoffs between fine-tuning and full training are covered in depth in Chapter 4.

The decision tree looks roughly like this: if your retrieval failures are primarily about general semantic relevance (user is searching for concepts that exist in web vocabulary), the problem is likely in your search architecture, not the model. If your failures are about domain-specific naming, proprietary patterns, or internal conventions that have no analog in public corpora, you have a domain gap problem that a custom model can address. If your codebase is small and conventional, fix the general model's calibration before training anything. If your codebase is large, heavily patterned, and generating retrieval failures that you cannot attribute to query formulation problems, custom training is the right investment.

One final thing to consider before moving forward: the model is only one layer of your retrieval system. Bad chunking strategy, missing metadata, poor query preprocessing, and inadequate reranking can all produce retrieval failures that look like embedding model failures. Before attributing your retrieval problems to the model, eliminate the other variables. Train on a clean problem.

### Key Takeaways

- General embedding models fail primarily on domain-specific naming, proprietary frameworks, and internal conventions — not on general semantic tasks.
- The fine-tuning threshold is ~50K high-quality triplets before consistent improvement over calibrated general models; below that, results are noisy.
- For codebases under 10,000 files with standard naming conventions, general models are competitive and likely sufficient.
- The domain gap is semantic, not lexical — the model can tokenize your identifiers but has no geometric intuition about their relationships.
- Build your evaluation set before deciding to train. It may reveal the problem is elsewhere.
- Full initialization from random weights is the cleanest option for commercial IP ownership; fine-tuning carries pretrained model lineage.

### Practical Exercise

Take 30 real queries from your retrieval system — queries that users have actually issued. For each query, record the top-5 results returned by the general model you're currently using. Then ask two developers familiar with the codebase to rate each result as relevant, somewhat relevant, or not relevant. Compute MRR@10 (detailed in Chapter 7) from those ratings. If MRR@10 > 0.35, hold off on custom training and address other system factors first. If MRR@10 < 0.20, you have a real domain gap problem. This exercise also produces the seed of your eval set.

---

## Chapter 2: Training Data Collection: What You Need and How to Get It {#chapter-2}

The quality of your training data determines your model's ceiling. This is not a platitude — it is the single most important technical decision in the entire process, and it is the one most teams underinvest in. A better architecture trained on poor data will underperform a simpler architecture trained on good data. Before you think about loss functions or hardware, get clear about what constitutes good training data for your use case.

The fundamental unit of training data for a retrieval model is the triplet: (anchor, positive, hard_negative). The anchor is a query. The positive is the correct retrieval target — the document, file, or passage that correctly answers the query. The hard negative is a document that is superficially similar to the positive but is not a correct answer for the anchor. This triplet structure is what allows contrastive training to work: the model is pushed to place the anchor near the positive and away from the hard negative in embedding space.

The most common mistake in triplet construction is using easy negatives — random documents from the corpus. If your corpus has 100,000 files and you randomly sample negatives, the vast majority of them will be clearly, obviously irrelevant to the anchor. The model learns to distinguish your codebase's `authentication middleware` from a random database migration file. This is a trivially easy distinction. The model gets good at it quickly, the loss drops, and training looks like it's going well. What you've actually trained is a model that can separate clearly different things — which a general model already does reasonably well. You haven't trained the model to handle the hard cases: the two authentication middlewares that serve different use cases, the three database utilities with similar names in different modules, the pair of functions that look related but aren't.

Hard negatives are where the discrimination happens. A hard negative is a document that shares vocabulary, structure, and surface-form similarity with the positive, but is not the correct answer for the anchor. For code search: if the anchor is "function that validates JWT tokens for the admin API", the positive is `validate_admin_jwt()`, and the hard negative might be `validate_user_jwt()` — same module, same pattern, wrong context. If the anchor is "database connection pooling setup", the positive is your main connection pool configuration, and the hard negative is a test fixture that also sets up connection pooling but for a different database.

> **Key Insight**
>
> Ten thousand triplets with real hard negatives will outperform one hundred thousand triplets with random negatives. The ratio is not an exaggeration. Hard negatives are the signal. Random negatives are noise with extra steps.

The minimum viable dataset for training with meaningful results is around 10,000 triplets with genuine hard negatives drawn from your actual corpus. For fine-tuning a general model to surpass its uncalibrated baseline consistently, the threshold rises to around 50,000. For training from random initialization on a specialized domain, you want north of 500,000 — though the exact number depends heavily on domain complexity, vocabulary size, and how semantically rich your triplets are.

For PyckLM, Run 2 used 968,692 triplets drawn from Python, JavaScript, TypeScript, Go, Rust, and a corpus of Obsidian knowledge notes. The breadth of that dataset was intentional: the model needed to handle both code retrieval and knowledge base retrieval without mode collapse onto one domain. If your use case is narrower — say, exclusively Python code search in a single codebase — you can achieve strong results with far fewer triplets, but they need to be dense with hard negatives from your actual corpus.

### Automated Triplet Generation from Code Structure

For code search specifically, the AST is your best source of automated triplet generation. Functions that call each other are strong positive pairs: if `process_payment()` calls `validate_card_number()`, a query about payment processing should retrieve both. Functions in the same module with similar names are strong hard negative pairs: `get_user_by_id()` and `get_user_by_email()` are similar enough to be confusable but answer different queries.

Here is a concrete implementation using Python's `ast` module:

```python
import ast
import os
from pathlib import Path
from typing import Generator
from dataclasses import dataclass

@dataclass
class Triplet:
    anchor: str
    positive: str
    hard_negative: str

def extract_function_calls(source: str) -> dict[str, list[str]]:
    """Map each function to the functions it calls."""
    tree = ast.parse(source)
    call_map: dict[str, list[str]] = {}

    for node in ast.walk(tree):
        if isinstance(node, ast.FunctionDef):
            calls = []
            for child in ast.walk(node):
                if isinstance(child, ast.Call):
                    if isinstance(child.func, ast.Name):
                        calls.append(child.func.id)
                    elif isinstance(child.func, ast.Attribute):
                        calls.append(child.func.attr)
            call_map[node.name] = calls
    return call_map

def generate_call_graph_triplets(
    repo_path: str,
    synthetic_query_template: str = "function that {action}"
) -> Generator[Triplet, None, None]:
    """
    Generate positive pairs from call relationships.
    The caller is the anchor-positive pair; a same-file non-called function
    is the hard negative.
    """
    for py_file in Path(repo_path).rglob("*.py"):
        try:
            source = py_file.read_text(encoding="utf-8", errors="ignore")
            call_map = extract_function_calls(source)
            all_funcs = list(call_map.keys())

            for func_name, called_funcs in call_map.items():
                for called in called_funcs:
                    if called not in call_map:
                        continue
                    # Positive: func_name calls called → they're semantically related
                    # Hard negative: another function in same file that func_name doesn't call
                    hard_neg_candidates = [
                        f for f in all_funcs
                        if f != func_name and f != called and f not in called_funcs
                    ]
                    if not hard_neg_candidates:
                        continue
                    hard_neg = hard_neg_candidates[0]

                    # Use actual source as document representations
                    anchor = f"function called {func_name}"
                    positive = f"def {called}  # called by {func_name}"
                    negative = f"def {hard_neg}  # in same file as {func_name}"

                    yield Triplet(anchor=anchor, positive=positive, hard_negative=negative)
        except SyntaxError:
            continue
```

This gives you a starting corpus, but it's not enough on its own. The anchor texts here are synthetic — they don't reflect how developers actually write queries. The hardest part of training data construction for code search is the query side: you need anchors that sound like real developer queries, not like function name descriptions.

### Collecting Real Developer Queries

Real queries are irreplaceable. Synthetic query templates produce models that are good at answering synthetic queries. If you have an existing search interface — even a basic one — instrument it to collect the queries developers actually issue. Even 500 real queries with manually verified relevant documents are worth more than 50,000 synthetic queries.

If you don't have a search interface yet, the next best option is to collect queries from developer behavior: GitHub issue descriptions that reference specific code, PR descriptions that describe what was changed, code review comments that say "this looks like it should use X instead." These are all natural language queries with implicit positive documents (the files referenced or modified).

```python
import subprocess
import json
from collections import defaultdict

def extract_pr_query_pairs(repo_path: str) -> list[dict]:
    """
    Extract (PR description, changed files) pairs as weak positive training signal.
    PR descriptions often describe what the changed code does — natural queries.
    """
    result = subprocess.run(
        ["git", "log", "--format=%H|||%s|||%b", "--diff-filter=M", "-n", "500"],
        cwd=repo_path,
        capture_output=True,
        text=True
    )

    pairs = []
    for line in result.stdout.strip().split("\n"):
        parts = line.split("|||", 2)
        if len(parts) < 2:
            continue
        commit_hash, subject, body = parts[0], parts[1], parts[2] if len(parts) > 2 else ""

        # Get files changed in this commit
        files_result = subprocess.run(
            ["git", "diff-tree", "--no-commit-id", "-r", "--name-only", commit_hash],
            cwd=repo_path,
            capture_output=True,
            text=True
        )
        changed_files = [f for f in files_result.stdout.strip().split("\n") if f.endswith(".py")]

        if changed_files and len(subject) > 20:
            pairs.append({
                "query": f"{subject} {body}".strip(),
                "positive_files": changed_files
            })

    return pairs
```

> **Warning**
>
> Commit messages are noisy. "fix bug", "update", "WIP", and similar low-information messages produce useless training pairs. Filter for messages with at least 30 characters and that don't match a blocklist of low-information patterns. A dirty training set doesn't just fail to help — it actively degrades the model by teaching it wrong associations.

### Data from Public Sources

CodeSearchNet is the standard benchmark dataset for code search and also a useful training source. It contains 2 million (docstring, function) pairs across Python, JavaScript, Go, PHP, Java, and Ruby. The docstrings serve as natural language queries; the functions are the positive documents. The dataset is structured for direct use with contrastive training.

The limitation of CodeSearchNet for custom training is that it covers general open-source patterns. If your codebase uses idiomatic patterns specific to your domain or framework, CodeSearchNet will help with the general vocabulary but won't teach the domain-specific distinctions you need. Treat it as a base layer, not a complete solution.

For knowledge base retrieval, if your documents are structured (like Obsidian notes with consistent link patterns), you can generate triplets from the link graph: linked notes are positive pairs, notes in the same directory with similar tags but no direct link are hard negatives.

### Data Quality Over Data Quantity

The temptation, once you have a data generation pipeline, is to run it until you have millions of triplets. Resist this. More triplets from a low-quality generator makes your model better at the wrong things. Before scaling your data generation, validate the triplets manually. Sample 200 triplets from your generated set. For each one, ask: is this hard negative actually hard? Would a developer confuse the positive and the hard negative if they were retrieving for the anchor query? If more than 20% of your sampled hard negatives are obviously irrelevant to the positive, your generation strategy needs work before you scale it.

The investment in data quality review is not glamorous work. It is the work that determines whether your training run produces a useful model or a waste of H100 time.

### Key Takeaways

- Hard negatives are the signal in contrastive training; easy (random) negatives are noise. A small dataset of hard negatives outperforms a large dataset of random negatives.
- The minimum viable dataset with real hard negatives is ~10,000 triplets; consistent improvement over general models with fine-tuning requires ~50,000.
- AST-based pair generation (call graphs for positives, same-module similar-name functions for hard negatives) is the most reliable automated approach for code search.
- Real developer queries are irreplaceable — synthetic templates produce models optimized for synthetic queries.
- Sample and manually review 200 triplets before scaling any generation strategy.
- CodeSearchNet provides a strong general-vocabulary foundation but does not capture domain-specific patterns.

### Practical Exercise

Write a script that generates 1,000 triplets from your codebase using the call-graph approach above. Sample 50 of them randomly and evaluate each one manually: is the hard negative actually hard (could a developer confuse the positive and negative for the given anchor), or is it obviously irrelevant? Record the "hard negative quality rate." If it's below 60%, modify your generation strategy before proceeding. If it's above 80%, you have a solid pipeline — scale it.

---

## Chapter 3: Loss Functions: Triplet Loss, Margin Ranking, and Why MSE Fails {#chapter-3}

Loss function selection for embedding training is an area where the literature is dense and the practical advice is thin. Papers introduce new contrastive objectives with impressive-sounding names; framework documentation lists a dozen options without much guidance on which to use when. This chapter cuts through that and focuses on what actually matters for retrieval-oriented embedding models.

Start with the fundamental question: what are we asking the loss function to do? We want the model's embedding space to reflect semantic relationships relevant to our retrieval task. In practice, this means: documents that should be retrieved for the same queries should have high cosine similarity; documents that should not be retrieved together should have low cosine similarity. The loss function is the mechanism that translates training signal (our triplets) into gradient updates that reshape the embedding space toward this goal.

### Why MSE Fails for Retrieval

Mean Squared Error loss, applied to embeddings directly, pushes the model to reproduce a target embedding vector for each input. This requires target vectors, which means you need a reference embedding space to define what "correct" looks like. There are two approaches: use a teacher model's embeddings as targets (knowledge distillation), or define target vectors manually (which is impractical at scale).

Knowledge distillation is sometimes useful — training a small model to approximate the output of a large model — but it has a fundamental limitation for retrieval: it teaches the student to reproduce the teacher's geometry, not to optimize for your retrieval task. If the teacher model has the domain gap problem described in Chapter 1, the student inherits it. You've spent training compute approximating the wrong model.

The deeper problem with MSE for embeddings is that it penalizes any deviation from the target vector, including deviations that are irrelevant to retrieval performance. Two embeddings can be geometrically distant while still perfectly separating relevant from irrelevant documents. MSE doesn't understand retrieval — it understands point-to-point distance in a fixed reference frame.

> **Key Insight**
>
> MSE loss for retrieval embeddings is like training a sprinter by measuring how closely they match the running form of a reference athlete on video. The form might correlate with speed, but it's a proxy for the actual objective. Contrastive loss measures what you actually care about: relative distance between the things you want to separate.

### Triplet Loss: The Standard for Retrieval

Triplet loss is the workhorse of retrieval embedding training. Given a triplet (anchor $a$, positive $p$, hard negative $n$), the loss pushes the model to satisfy:

$$\text{sim}(a, p) - \text{sim}(a, n) > \text{margin}$$

With cosine similarity, this is:

```python
import torch
import torch.nn.functional as F

def triplet_loss(
    anchor: torch.Tensor,     # [batch_size, embed_dim]
    positive: torch.Tensor,   # [batch_size, embed_dim]
    negative: torch.Tensor,   # [batch_size, embed_dim]
    margin: float = 0.1
) -> torch.Tensor:
    # L2-normalize embeddings before cosine similarity
    anchor = F.normalize(anchor, p=2, dim=1)
    positive = F.normalize(positive, p=2, dim=1)
    negative = F.normalize(negative, p=2, dim=1)

    pos_sim = (anchor * positive).sum(dim=1)   # [batch_size]
    neg_sim = (anchor * negative).sum(dim=1)   # [batch_size]

    # Loss is zero when pos_sim - neg_sim > margin
    loss = F.relu(margin - (pos_sim - neg_sim))
    return loss.mean()
```

The margin hyperparameter deserves attention. A margin of 0.1 means the model only receives a gradient signal when the positive and negative are within 0.1 cosine similarity of each other. This is a useful inductive bias: once a pair is well-separated, the model isn't pushed to separate it further. The practical effect is that training focuses on the hard cases — the pairs that are close together in embedding space — which is exactly where you want the model to improve.

Setting the margin too high (0.5+) causes training instability: the model receives large gradient signals even for well-separated pairs, which can cause previously good representations to degrade. Setting it too low (0.01) causes undertraining: the model satisfies the loss too easily and stops learning before the embedding space is well-organized. Values in the range 0.05–0.2 work well for most retrieval tasks. For PyckLM, we used the default margin with cosine similarity and found it stable across the full training run.

### In-Batch Negatives: The Efficiency Multiplier

Pure triplet loss uses one negative per triplet. In-batch negative sampling uses all other examples in the batch as negatives for each anchor-positive pair. This dramatically increases the number of learning signals per compute step.

```python
def in_batch_contrastive_loss(
    anchors: torch.Tensor,    # [batch_size, embed_dim]
    positives: torch.Tensor,  # [batch_size, embed_dim]
    temperature: float = 0.07
) -> torch.Tensor:
    anchors = F.normalize(anchors, p=2, dim=1)
    positives = F.normalize(positives, p=2, dim=1)

    # Similarity of every anchor against every positive in batch
    # [batch_size, batch_size]
    sim_matrix = torch.matmul(anchors, positives.T) / temperature

    # The diagonal is the correct (anchor_i, positive_i) pair
    labels = torch.arange(sim_matrix.size(0), device=anchors.device)

    # Cross-entropy: maximize diagonal, minimize off-diagonal
    loss = F.cross_entropy(sim_matrix, labels)
    return loss
```

This is the InfoNCE / NT-Xent objective used in SimCSE and many production retrieval models. The key property is that larger batches provide harder negatives implicitly: with a batch size of 512, every anchor is contrasted against 511 negatives, some of which will inevitably be semantically close. This is a form of automatic hard negative mining.

The temperature parameter controls the sharpness of the contrast. Lower temperature (0.05–0.1) creates sharper separation but can cause training instability. Higher temperature (0.2–0.5) is more stable but learns slower. The most common production setting is 0.07.

> **Try This**
>
> Before training on your full dataset, run 1,000 steps with in-batch contrastive loss on a small batch (64 examples) and log the cosine similarity distribution of the positive pairs and the hardest-per-batch negative pairs every 100 steps. If the positive distribution isn't clearly moving higher and the negative distribution lower, you have a data problem, not a loss function problem.

### Margin Ranking Loss

Margin ranking loss is a simpler variant that operates on pairs rather than triplets:

```python
def margin_ranking_loss(
    anchor: torch.Tensor,
    positive: torch.Tensor,
    negative: torch.Tensor,
    margin: float = 0.1
) -> torch.Tensor:
    anchor = F.normalize(anchor, p=2, dim=1)
    positive = F.normalize(positive, p=2, dim=1)
    negative = F.normalize(negative, p=2, dim=1)

    pos_scores = (anchor * positive).sum(dim=1)
    neg_scores = (anchor * negative).sum(dim=1)

    # Equivalent to torch.nn.MarginRankingLoss
    labels = torch.ones(anchor.size(0), device=anchor.device)
    return F.margin_ranking_loss(pos_scores, neg_scores, labels, margin=margin)
```

In practice, margin ranking loss and triplet loss produce similar results for retrieval tasks — they're mathematically nearly equivalent given the same input structure. The choice between them is largely a matter of what your training framework expects. Sentence Transformers uses `TripletLoss` as the primary contrastive objective; if you're using that library, stick with `TripletLoss` rather than reimplementing.

### Multiple Negatives Ranking Loss

For applications where you have high-quality positive pairs but can't easily generate hard negatives, Multiple Negatives Ranking Loss (MNRL) is worth knowing. It treats all other positives in the batch as negatives for each anchor, which requires only (anchor, positive) pairs — no explicit negative mining.

```python
from sentence_transformers import losses

# MNRL from Sentence Transformers
# Only requires (anchor, positive) pairs
# Uses in-batch examples as negatives automatically
train_loss = losses.MultipleNegativesRankingLoss(model)
```

MNRL works well when your batch size is large (256+) and your positive pairs are diverse enough that in-batch negatives are genuinely hard. For smaller batches or datasets with many near-duplicate queries, explicit hard negatives are more reliable.

### Combining Loss Functions

For PyckLM, training used TripletLoss with cosine similarity throughout. For many production models, a combination of in-batch contrastive loss for the main training signal and explicit hard negative triplets for fine-tuning produces the best results: the in-batch objective shapes the overall geometry efficiently, and the hard negative triplets sharpen discrimination at the boundaries.

> **Warning**
>
> If you're combining multiple loss functions, watch for conflicting gradients. Two loss functions that both apply to the same batch can push the model in contradictory directions, causing oscillation rather than convergence. Use one primary loss and introduce secondary objectives only with appropriate weighting (typically 0.1× the primary loss scale). Monitor the per-loss values independently during training.

### Key Takeaways

- MSE loss is inappropriate for retrieval embedding training because it optimizes point-to-point distance to a reference, not relative separation of relevant and irrelevant documents.
- Triplet loss with cosine similarity and a margin of 0.05–0.2 is the standard for retrieval; it focuses gradient signal on the hard cases.
- In-batch negative sampling (InfoNCE / NT-Xent) dramatically increases learning efficiency by using every example in the batch as a negative for every anchor.
- Temperature (0.07 is standard) controls contrast sharpness; lower values learn faster but destabilize more easily.
- MNRL is appropriate when you have high-quality positive pairs but limited hard negatives and large batch sizes.
- Combining loss functions requires careful weighting; conflicting gradients cause oscillation.

### Practical Exercise

Implement a training loop that logs, every 100 steps: (1) the mean cosine similarity between anchor and positive embeddings, (2) the mean cosine similarity between anchor and hard negative embeddings, and (3) the loss value. After 1,000 steps of training on your data, plot these three curves. The positive similarity should increase, the negative similarity should decrease, and the loss should decrease monotonically. If positive and negative similarities are converging toward each other rather than diverging, your hard negatives are not hard enough, or your learning rate is too high.

---

## Chapter 4: Architecture Choices: Parameters, Dimensions, Pooling Strategy {#chapter-4}

Architecture is the decision most teams spend the most time debating and the least time justifying empirically. The transformer literature has explored an enormous range of configurations, and the temptation is to pick the largest architecture that fits your compute budget and assume bigger is better. For retrieval embedding models, this reasoning is often backward.

The central insight is that embedding models for retrieval have a different optimization target than language models or even cross-encoders. A retrieval embedding model needs to produce vectors that efficiently separate relevant from irrelevant documents in cosine space. This requires the model to learn a good low-dimensional representation of semantic content, not to model complex conditional distributions over token sequences. Bigger architectures can learn more expressive representations, but they also overfit more readily on small training sets, run slower in production, and consume more memory — without necessarily producing better retrieval performance on your specific domain.

### The Parameter-Performance Tradeoff

The standard BERT architecture is 12 layers × 768 hidden dimensions × 12 attention heads = 110M parameters. For embedding-only use (no generation), this is substantial overhead. The representations at layers 10-12 are often rich with language model knowledge that's irrelevant to your retrieval task.

The smaller transformer configurations — 6L × 384H × 8A — sit in a practical sweet spot for production retrieval:

- 40M parameters: small enough to fit multiple instances in GPU memory for batched inference
- 384-dimensional output: enough capacity to represent most retrieval distinctions, small enough for fast cosine search in vector databases
- 8 attention heads with 48 dimensions each: sufficient attention expressivity for code and prose retrieval
- Training speed: roughly 2.5× faster than BERT-base on the same data

PyckLM uses exactly this configuration (6L × 384H × 8A, 40M parameters), and the choice was validated empirically: at the 1M triplet scale, the 40M model outperformed a 110M model fine-tuned from a general checkpoint because the smaller model's regularization pressure was better matched to the training set size.

> **Key Insight**
>
> For retrieval embedding models specifically, parameter count per training triplet matters more than absolute parameter count. A 40M parameter model trained on 1M triplets has 25 triplets per parameter. A 110M parameter model on the same data has 9 triplets per parameter — a ratio that invites overfitting on the surface patterns of your training set rather than the underlying semantics.

### Full Initialization vs. Fine-Tuning

The choice between training from random initialization and fine-tuning a pretrained model involves tradeoffs in four dimensions: training data requirements, IP ownership, initial convergence speed, and transfer learning from general corpus.

**Fine-tuning** starts from a pretrained checkpoint (BERT, RoBERTa, DistilBERT, or a general embedding model). The pretrained weights encode substantial general language knowledge that transfers to your retrieval task. Fine-tuning typically requires less data to reach a useful performance level — in the 10K–100K triplet range, fine-tuning generally outperforms from-scratch training. The learning rate for fine-tuning should be 10× lower than a scratch training run (1e-5 versus 1e-4) to avoid catastrophic forgetting: the gradient updates should refine the pretrained representations, not destroy them. In PyckLM's fine-tune Run 3, we used exactly 1e-5 for this reason.

**Full initialization** starts from random weights and learns everything from your training data. This requires more data (500K+ triplets for most tasks) and more compute, but gives you complete weight ownership — the model has no pretrained lineage that could create licensing questions. For teams building commercial products, full initialization is often the cleaner legal position. The performance ceiling is higher given enough data: PyckLM at 968,692 triplets from scratch achieved 0.456 MRR@10 on CodeSearchNet, which is 62% better than GraphCodeBERT on the same benchmark. That result required the right data and architecture, not just scale.

```python
from transformers import BertConfig, BertModel
import torch

def create_model_from_scratch(
    num_layers: int = 6,
    hidden_size: int = 384,
    num_attention_heads: int = 8,
    intermediate_size: int = 1536,
    max_position_embeddings: int = 512,
    vocab_size: int = 30522
) -> BertModel:
    config = BertConfig(
        num_hidden_layers=num_layers,
        hidden_size=hidden_size,
        num_attention_heads=num_attention_heads,
        intermediate_size=intermediate_size,
        max_position_embeddings=max_position_embeddings,
        vocab_size=vocab_size,
    )
    # Random initialization — no pretrained weights
    model = BertModel(config)

    param_count = sum(p.numel() for p in model.parameters())
    print(f"Parameters: {param_count:,}")  # ~40M for this config
    return model
```

### Embedding Dimensions: 384 vs 768 vs 1536

The output embedding dimension controls the capacity of the vector space your model produces. Higher dimensions can represent finer distinctions, but the benefits plateau quickly for most retrieval tasks, while the costs (storage, search latency, memory) scale linearly.

For most code and knowledge base retrieval tasks, 384 dimensions is sufficient. The cosine search space in 384 dimensions is rich enough to separate the distinctions you care about, and the storage cost is half of 768-dimensional vectors. Vector database operations scale with dimension — query time for approximate nearest neighbor search grows roughly O(d log n) in most implementations — so 384-dimensional vectors are meaningfully faster at query time than 768.

The case for 768 or higher is strong only when you have retrieval tasks requiring very fine-grained semantic distinctions across a very large corpus (millions of documents), or when you're attempting to represent multiple distinct semantic properties in the same vector (e.g., a single embedding that serves both keyword search and conceptual similarity). For most production retrieval systems, 384 works well.

> **Warning**
>
> Do not use 1536-dimensional embeddings in production retrieval unless you have measured the performance delta over 768 and verified it justifies the storage and query latency costs. Many teams adopt high-dimensional embeddings because a provider's API returns them by default, not because the extra dimensions improve their specific retrieval task.

### Pooling Strategy

Transformer models produce one embedding per input token. To get a single vector representing the whole document, you need a pooling strategy. The main options are:

**Mean pooling**: average all token embeddings (excluding padding tokens). This is the standard for Sentence-BERT and most production embedding models. It distributes the representational signal across all tokens, giving equal weight to each non-padding token's hidden state.

**CLS pooling**: use only the embedding of the `[CLS]` token, which in BERT-style models is a special token prepended to the input that learns to aggregate sequence information during pretraining. CLS pooling was dominant in early BERT work but is generally inferior to mean pooling for semantic similarity tasks when training from scratch, because the CLS token's aggregation behavior is learned from the masked language modeling objective, not from similarity objectives.

**Max pooling**: take the maximum value across each embedding dimension. This preserves the most salient features per dimension but loses information about the distribution of features across tokens. Rarely used for retrieval.

Mean pooling followed by L2 normalization is the production standard:

```python
import torch
from transformers import AutoTokenizer, AutoModel

def mean_pool_and_normalize(
    model_output: torch.Tensor,   # [batch, seq_len, hidden]
    attention_mask: torch.Tensor  # [batch, seq_len]
) -> torch.Tensor:
    # Mask out padding tokens before averaging
    mask_expanded = attention_mask.unsqueeze(-1).float()
    sum_embeddings = (model_output * mask_expanded).sum(dim=1)
    sum_mask = mask_expanded.sum(dim=1).clamp(min=1e-9)
    mean_embeddings = sum_embeddings / sum_mask

    # L2 normalize to unit sphere for cosine similarity
    return torch.nn.functional.normalize(mean_embeddings, p=2, dim=1)
```

For PyckLM, mean pooling + L2 normalization was used throughout training and inference. The L2 normalization step is not optional — it converts the dot product operation in your vector database from inner product to cosine similarity, which is what the training objective (cosine triplet loss) optimized for. If you train with cosine loss and deploy with inner product search, your production similarity scores will not match the geometry your model learned.

### Vocabulary and Tokenizer Decisions

For code search specifically, the tokenizer choice matters more than it does for prose retrieval. Standard BPE tokenizers (GPT-2, Llama) and WordPiece tokenizers (BERT) both fragment code identifiers in ways that lose structural information. `getUsersByOrganizationId` becomes something like `get`, `##User`, `##s`, `##By`, `##Organization`, `##Id` — which technically preserves the content but destroys the camelCase structure that encodes semantic information.

If you're training from scratch, consider using a tokenizer trained on a code-heavy corpus, or training a custom BPE tokenizer on your codebase. A tokenizer trained to recognize camelCase and snake_case patterns as meaningful units will produce better code embeddings than one that fragments them aggressively.

For PyckLM, we used the standard BERT tokenizer (WordPiece, 30,522 vocabulary) because the training corpus included both code and prose and a shared vocabulary was required. For pure code search, a code-tuned tokenizer would improve performance on identifier-heavy queries.

### Key Takeaways

- 6L × 384H × 8A (40M parameters) is the practical sweet spot for production retrieval: fast inference, 4x memory efficiency versus BERT-base, sufficient capacity for most retrieval tasks.
- Triplets-per-parameter ratio matters: a smaller model trained on the same data typically generalizes better than a larger model in the 10K–1M triplet range.
- Fine-tuning requires ~10× lower learning rate than scratch training (1e-5 versus 1e-4) to avoid catastrophic forgetting.
- Mean pooling + L2 normalization is the standard; CLS pooling is inferior for similarity tasks trained from scratch.
- L2 normalization at inference time is mandatory if you trained with cosine loss; mismatched geometry between training and deployment silently degrades retrieval.
- Custom tokenizers trained on code-heavy corpora improve identifier representation for code search.

### Practical Exercise

Train the same 40M architecture twice on 50,000 of your training triplets: once with CLS pooling and once with mean pooling. Evaluate both on your held-out eval set (30 real queries, manually rated). Report the MRR@10 for each. In most cases, mean pooling will outperform CLS pooling by 5–15%. If CLS pooling wins by more than 5%, your training data may be too small for the model to learn effective CLS aggregation — consider increasing data first.

---

## Chapter 5: Training Infrastructure: H100 vs Consumer GPU, BF16, Mixed Precision {#chapter-5}

The hardware question for embedding model training is more tractable than it might appear. Unlike language model training, which often requires hundreds of GPUs and weeks of compute, a solid embedding model can be trained to production quality on a single GPU in under an hour — if you're using the right hardware and precision settings.

This chapter is organized around the practical decisions: what hardware is actually required, how to set up mixed precision training correctly, and what to watch for during training runs.

### GPU Requirements: H100 vs Consumer Options

For the scales discussed in this book (up to ~1M triplets, 40M parameter model), training is feasible on consumer hardware. An RTX 4090 (24GB VRAM) can handle batch sizes of 256–512 for this architecture with BF16 precision. Training 968,692 triplets on an RTX 4090 would take roughly 3–4 hours. On an H100 80GB, the same run completes in 38.8 minutes — PyckLM Run 2's actual time.

The choice between consumer hardware and cloud H100s is primarily economic and logistical, not technical. If you're iterating rapidly on training data and architecture, the H100's throughput means you get more experiments per day. If you're doing a single production training run after careful data preparation, an RTX 4090 is entirely viable.

The H100's advantages are:

1. **Raw throughput**: roughly 3× the FP16/BF16 TFLOPS of an RTX 4090 in practice for transformer workloads
2. **Memory bandwidth**: 3.35 TB/s vs 1.008 TB/s — critical for attention computation
3. **NVLink for multi-GPU**: if you scale to larger runs, H100s can be connected with NVLink for efficient multi-GPU training without the PCIe bottleneck
4. **Reliable BF16 support**: consumer cards support BF16 starting with RTX 30 series, but H100 has first-class hardware support

For teams without on-premise H100 access, Lambda Labs, CoreWeave, and similar cloud providers offer H100 instances at roughly $2–4/hour. A 38-minute training run costs under $3 in cloud compute — the economics heavily favor doing more experiments with clean data over optimizing a single run.

> **Key Insight**
>
> The total cost of a PyckLM-scale training run (968K triplets, 40M model, 38.8 minutes on H100) in cloud compute is approximately $2–4. If you're spending more than 2 hours debugging training infrastructure instead of just running on cloud H100, you're optimizing in the wrong direction.

### BF16 vs FP16 vs FP32

Precision selection is one of the highest-leverage settings in training. FP32 (32-bit float) is numerically stable but wastes memory and compute. FP16 (16-bit float with 5-bit exponent) is fast but has a narrow dynamic range that causes gradient underflow with learning rates that would be stable in FP32. BF16 (bfloat16, 16-bit with 8-bit exponent, same as FP32) has the same range as FP32 but half the precision — crucially, gradient underflow is not a problem, and most training procedures work with BF16 at the same hyperparameters as FP32.

For modern transformer training on supported hardware, BF16 is the default recommendation:

```python
import torch
from torch.cuda.amp import autocast, GradScaler
from transformers import TrainingArguments

# BF16 training — native BF16, no loss scaling needed
training_args = TrainingArguments(
    output_dir="./model_output",
    per_device_train_batch_size=256,
    learning_rate=1e-4,
    bf16=True,           # BF16 for Ampere/Hopper GPUs (A100, H100, RTX 3090+)
    fp16=False,          # Don't mix with fp16
    num_train_epochs=3,
    warmup_ratio=0.1,
    weight_decay=0.01,
    dataloader_num_workers=4,
    logging_steps=100,
    save_steps=1000,
    evaluation_strategy="steps",
    eval_steps=500,
)
```

If you're on an older GPU (V100 or earlier) that doesn't support BF16 natively, use FP16 with gradient scaling:

```python
# FP16 with gradient scaling — use when BF16 is unavailable
scaler = GradScaler()

def training_step(model, batch, optimizer):
    with autocast(dtype=torch.float16):
        anchor_emb = model(**batch["anchor"])
        pos_emb = model(**batch["positive"])
        neg_emb = model(**batch["negative"])
        loss = triplet_loss(anchor_emb, pos_emb, neg_emb)

    scaler.scale(loss).backward()
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
    scaler.step(optimizer)
    scaler.update()
    optimizer.zero_grad()
    return loss.item()
```

The gradient clipping (`max_norm=1.0`) is important with FP16 to prevent gradient explosions from propagating through the scaler. With BF16 on H100, gradient clipping is still good practice but gradient scaling is not needed.

### Batch Size and Learning Rate Scheduling

For contrastive training, batch size directly affects training quality through its effect on in-batch negatives. Larger batches provide more diverse negatives per step. The practical guideline: use the largest batch size that fits in VRAM without gradient accumulation.

For the 40M architecture with 512-token sequences in BF16 on an H100 80GB, batch sizes of 512–1024 are feasible. For the RTX 4090 with 24GB VRAM, batch sizes of 256–384 work without gradient accumulation.

Linear learning rate warmup followed by cosine decay is the standard schedule:

```python
from transformers import get_cosine_schedule_with_warmup
from torch.optim import AdamW

def build_optimizer_and_scheduler(model, num_training_steps, warmup_ratio=0.1):
    # AdamW is standard for transformer training
    optimizer = AdamW(
        model.parameters(),
        lr=1e-4,       # For scratch training; use 1e-5 for fine-tuning
        weight_decay=0.01,
        eps=1e-8
    )

    num_warmup_steps = int(num_training_steps * warmup_ratio)
    scheduler = get_cosine_schedule_with_warmup(
        optimizer,
        num_warmup_steps=num_warmup_steps,
        num_training_steps=num_training_steps
    )
    return optimizer, scheduler
```

Warmup prevents early training instability: with random initialization, the initial gradient magnitudes are large and can destabilize training if the full learning rate is applied immediately. A 10% warmup over the first epoch is standard.

### Monitoring Training Runs

Logging loss alone is insufficient for embedding model training. Log these metrics every 100 steps:

```python
import wandb
import torch
import torch.nn.functional as F

def log_training_metrics(
    step: int,
    loss: float,
    anchor_emb: torch.Tensor,
    pos_emb: torch.Tensor,
    neg_emb: torch.Tensor
):
    with torch.no_grad():
        a = F.normalize(anchor_emb, dim=1)
        p = F.normalize(pos_emb, dim=1)
        n = F.normalize(neg_emb, dim=1)

        pos_sim = (a * p).sum(dim=1).mean().item()
        neg_sim = (a * n).sum(dim=1).mean().item()
        margin = pos_sim - neg_sim

        # Fraction of triplets that violate the margin constraint
        violation_rate = ((a * n).sum(dim=1) > (a * p).sum(dim=1)).float().mean().item()

    wandb.log({
        "train/loss": loss,
        "train/pos_cosine": pos_sim,
        "train/neg_cosine": neg_sim,
        "train/margin": margin,
        "train/violation_rate": violation_rate,
        "step": step,
    })
```

The violation rate is particularly useful: it's the fraction of triplets in the batch where the model currently ranks the negative above the positive. At the start of training, this will be ~0.5 (random). As training progresses, it should drop toward 0.05–0.15. A violation rate that plateaus above 0.2 usually indicates hard negatives that are too hard (mislabeled data where the "negative" is actually a correct answer), or a learning rate that's too high.

> **Warning**
>
> Never save a checkpoint based solely on training loss. Always checkpoint based on validation MRR@10 on your held-out eval set. Training loss can decrease while validation retrieval quality plateaus or degrades — the classic sign of overfitting to the surface patterns in your training set.

### Multi-GPU Training

For runs requiring multi-GPU setup (multiple training epochs over millions of triplets), PyTorch's Distributed Data Parallel is the standard approach. With Hugging Face's `Trainer`, it requires minimal configuration:

```bash
# Launch on 4 GPUs with DDP
torchrun --nproc_per_node=4 train_embedding.py \
    --bf16 \
    --per_device_train_batch_size=256 \
    --gradient_accumulation_steps=1 \
    --output_dir ./model_output
```

The effective batch size with DDP is `per_device_batch_size × num_gpus`. With 4 GPUs and batch size 256 per device, you're effectively training with 1024-example batches, which provides richer in-batch negatives and is one of the primary benefits of multi-GPU training for contrastive objectives.

### Key Takeaways

- A single H100 can train a 40M parameter model on 1M triplets in under 40 minutes; at $2–4/hour for cloud H100s, the economics strongly favor cloud compute over optimizing consumer GPU training.
- BF16 is the default precision for H100/A100/RTX 30+ series; FP16 with gradient scaling is the fallback for older GPUs.
- Batch size directly affects in-batch negative quality for contrastive training — use the largest batch that fits in VRAM.
- Cosine learning rate schedule with 10% warmup is standard; scratch training uses 1e-4, fine-tuning uses 1e-5.
- Log pos/neg cosine similarity and violation rate in addition to loss; checkpoint on validation MRR@10, not training loss.
- Effective batch size scales linearly with GPU count in DDP training, improving in-batch negative diversity.

### Practical Exercise

Set up a training run on your triplet dataset with explicit W&B (or TensorBoard) logging of loss, pos_cosine, neg_cosine, and violation_rate. Run for exactly 1,000 steps and plot the four curves. By step 1,000, pos_cosine should be above 0.7, neg_cosine below 0.4, and violation_rate below 0.2. If these targets aren't met, diagnose: is the learning rate appropriate, are your hard negatives genuinely hard, is your batch size large enough to provide useful in-batch signal?

---

## Chapter 6: ONNX Export and Production Deployment {#chapter-6}

Training a model that performs well on eval sets is the midpoint, not the finish line. Getting that model into production with acceptable memory footprint, latency, and operational stability requires a separate set of decisions. This chapter covers the practical path from PyTorch training checkpoint to a production-serving artifact.

The primary deployment decision for embedding models is the serving format. PyTorch native (`.pt` files served with `torch.load()`) is simple but carries the full framework overhead: the PyTorch model weights for a 40M parameter model are approximately 160MB, but the full PyTorch process consuming those weights will use around 820MB of resident memory, because PyTorch loads the full framework, all registered CUDA kernels, and the various runtime components. For a model that will be queried millions of times per day, that memory overhead is real infrastructure cost.

ONNX export reduces this substantially. The PyckLM ONNX model is approximately 30MB on disk and uses 197MB of resident memory in production — a 4x reduction in memory footprint compared to the PyTorch runtime, with no loss in output quality. The inference speed is comparable on CPU and often faster on GPU due to ONNX Runtime's graph optimization passes.

### Basic ONNX Export

```python
import torch
from transformers import AutoTokenizer, AutoModel
import onnx
import onnxruntime as ort
import numpy as np

def export_to_onnx(
    model_path: str,
    output_path: str,
    opset_version: int = 14,
    max_seq_length: int = 512
) -> None:
    tokenizer = AutoTokenizer.from_pretrained(model_path)
    model = AutoModel.from_pretrained(model_path)
    model.eval()

    # Create dummy inputs for tracing
    dummy_text = "example input for export"
    dummy_inputs = tokenizer(
        dummy_text,
        return_tensors="pt",
        max_length=max_seq_length,
        padding="max_length",
        truncation=True
    )

    # Export with dynamic axes for variable sequence length and batch size
    torch.onnx.export(
        model,
        (dummy_inputs["input_ids"], dummy_inputs["attention_mask"]),
        output_path,
        opset_version=opset_version,
        input_names=["input_ids", "attention_mask"],
        output_names=["last_hidden_state", "pooler_output"],
        dynamic_axes={
            "input_ids": {0: "batch_size", 1: "sequence"},
            "attention_mask": {0: "batch_size", 1: "sequence"},
            "last_hidden_state": {0: "batch_size", 1: "sequence"},
        },
        do_constant_folding=True,
    )

    # Verify the export
    onnx_model = onnx.load(output_path)
    onnx.checker.check_model(onnx_model)
    print(f"ONNX model exported and validated: {output_path}")
```

This basic export handles the transformer backbone, but it doesn't include the mean pooling and L2 normalization steps that convert the per-token hidden states to a single unit-normalized embedding vector. For production use, you want to wrap the full pipeline — tokenization excluded, since that's handled separately — into the exported artifact.

### The TorchScript Export Approach

A more robust approach for production is to use TorchScript to export the full SentenceEmbedder pipeline (model + pooling + normalization) as a single traced module. This is the approach we settled on for PyckLM, specifically because it bypasses a bug in certain versions of the `transformers` library where `masking_utils` interferes with ONNX export through the standard `model.forward()` path:

```python
import torch
import torch.nn.functional as F
from transformers import AutoModel

class SentenceEmbedder(torch.nn.Module):
    """
    Wraps transformer + mean pooling + L2 normalization into a single module.
    TorchScript-compatible.
    """
    def __init__(self, model_path: str):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_path)

    def forward(
        self,
        input_ids: torch.Tensor,
        attention_mask: torch.Tensor
    ) -> torch.Tensor:
        outputs = self.encoder(
            input_ids=input_ids,
            attention_mask=attention_mask
        )
        hidden = outputs.last_hidden_state  # [batch, seq, hidden]

        # Mean pool with mask
        mask = attention_mask.unsqueeze(-1).float()
        summed = (hidden * mask).sum(dim=1)
        counts = mask.sum(dim=1).clamp(min=1e-9)
        pooled = summed / counts

        # L2 normalize
        return F.normalize(pooled, p=2, dim=1)

def export_sentence_embedder(model_path: str, output_path: str) -> None:
    embedder = SentenceEmbedder(model_path)
    embedder.eval()

    # Trace with representative inputs
    dummy_ids = torch.randint(0, 30522, (1, 128))
    dummy_mask = torch.ones(1, 128, dtype=torch.long)

    traced = torch.jit.trace(embedder, (dummy_ids, dummy_mask))

    # Export traced model to ONNX
    torch.onnx.export(
        traced,
        (dummy_ids, dummy_mask),
        output_path,
        input_names=["input_ids", "attention_mask"],
        output_names=["embeddings"],
        dynamic_axes={
            "input_ids": {0: "batch_size", 1: "sequence"},
            "attention_mask": {0: "batch_size", 1: "sequence"},
            "embeddings": {0: "batch_size"},
        },
        opset_version=14,
    )
    print(f"SentenceEmbedder exported to ONNX: {output_path}")
```

### ONNX Runtime Inference with Graceful CUDA Degradation

For production serving, you want CUDA inference when available and a clean fallback to CPU when it's not — without hard-coding GPU availability assumptions into your deployment:

```python
import onnxruntime as ort
import numpy as np
from transformers import AutoTokenizer

class OnnxEmbedder:
    def __init__(self, model_path: str, tokenizer_path: str, max_length: int = 512):
        self.tokenizer = AutoTokenizer.from_pretrained(tokenizer_path)
        self.max_length = max_length

        # Detect available providers cleanly — no hardcoded CUDA assumption
        available = ort.get_available_providers()
        if "CUDAExecutionProvider" in available:
            providers = ["CUDAExecutionProvider", "CPUExecutionProvider"]
        else:
            providers = ["CPUExecutionProvider"]

        sess_options = ort.SessionOptions()
        sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
        sess_options.intra_op_num_threads = 4

        self.session = ort.InferenceSession(
            model_path,
            sess_options=sess_options,
            providers=providers
        )
        active_provider = self.session.get_providers()[0]
        print(f"ONNX Runtime using: {active_provider}")

    def encode(self, texts: list[str], batch_size: int = 64) -> np.ndarray:
        all_embeddings = []

        for i in range(0, len(texts), batch_size):
            batch = texts[i:i + batch_size]
            encoded = self.tokenizer(
                batch,
                max_length=self.max_length,
                padding=True,
                truncation=True,
                return_tensors="np"
            )

            outputs = self.session.run(
                ["embeddings"],
                {
                    "input_ids": encoded["input_ids"].astype(np.int64),
                    "attention_mask": encoded["attention_mask"].astype(np.int64),
                }
            )
            all_embeddings.append(outputs[0])

        return np.vstack(all_embeddings)
```

The `get_available_providers()` call is the correct pattern for CUDA detection in ONNX Runtime. Checking for CUDA availability via `torch.cuda.is_available()` in an ONNX Runtime serving context is incorrect — they're separate frameworks with separate CUDA registrations.

### The LoRA + BnB Export Gotcha

If your training used LoRA adapters on a quantized (4-bit BitsAndBytes) base model, you will encounter a specific failure mode during ONNX export: merging the LoRA adapter while the base model is in 4-bit BnB format produces corrupt weights. The BnB quantization and the LoRA merge operation are not compatible at the bit level.

The correct procedure is to reload the LoRA adapter on a fresh, unquantized (fp16) copy of the base model before exporting:

```python
from peft import PeftModel
from transformers import AutoModel
import torch

def load_for_export(base_model_id: str, adapter_path: str) -> torch.nn.Module:
    """
    Load adapter on fp16 base model — NOT the 4-bit quantized version.
    Merging LoRA on a 4-bit BnB base corrupts weights.
    """
    # Fresh fp16 base — no quantization config
    base = AutoModel.from_pretrained(
        base_model_id,
        torch_dtype=torch.float16,
        device_map="cpu"
    )

    # Load adapter and merge into fp16 base
    model = PeftModel.from_pretrained(base, adapter_path)
    model = model.merge_and_unload()
    model.eval()
    return model
```

> **Warning**
>
> The corrupt-ONNX-from-4bit-BnB-merge failure mode is silent in many cases: the export completes without error, the model loads in ONNX Runtime without error, but inference outputs are nonsense. Always validate ONNX outputs against PyTorch outputs on a representative sample before deploying.

### Validating the Export

Before deploying any ONNX model to production, validate numerical equivalence with the PyTorch reference:

```python
def validate_onnx_export(
    pytorch_model: torch.nn.Module,
    onnx_embedder: OnnxEmbedder,
    test_texts: list[str],
    tolerance: float = 1e-4
) -> bool:
    pytorch_model.eval()

    # Get PyTorch embeddings
    from transformers import AutoTokenizer
    tokenizer = AutoTokenizer.from_pretrained("your-model-path")
    encoded = tokenizer(test_texts, return_tensors="pt", padding=True, truncation=True)
    with torch.no_grad():
        pt_emb = pytorch_model(encoded["input_ids"], encoded["attention_mask"])
        pt_emb = pt_emb.numpy()

    # Get ONNX embeddings
    onnx_emb = onnx_embedder.encode(test_texts)

    max_diff = np.abs(pt_emb - onnx_emb).max()
    mean_diff = np.abs(pt_emb - onnx_emb).mean()

    print(f"Max absolute difference: {max_diff:.6f}")
    print(f"Mean absolute difference: {mean_diff:.6f}")

    return max_diff < tolerance
```

BF16 training introduces small numerical differences that will appear in the comparison. A max absolute difference below 1e-4 is acceptable for production use; the cosine similarity between the PyTorch and ONNX embeddings should be above 0.9999 for corresponding inputs.

### Key Takeaways

- ONNX export reduces production memory from ~820MB to ~197MB (4× reduction) for a 40M parameter model with no loss in output quality.
- The TorchScript SentenceEmbedder wrapper (model + mean pooling + L2 normalize) is more robust for ONNX export than using `model.forward()` directly, avoiding `masking_utils` bugs.
- Use `ort.get_available_providers()` for CUDA detection in ONNX Runtime — not `torch.cuda.is_available()`.
- Merging LoRA on a 4-bit BnB base produces corrupt ONNX; always reload adapters on a fresh fp16 base before exporting.
- Validate ONNX outputs against PyTorch reference before deploying; the silent corruption failure mode makes validation mandatory, not optional.
- Dynamic axes in the ONNX export are required for variable batch size and sequence length in production.

### Practical Exercise

Export your trained model to ONNX using the SentenceEmbedder wrapper. Benchmark the inference time for a batch of 64 texts on your production hardware: compare PyTorch inference time vs. ONNX Runtime inference time. Also compare the RSS memory usage of a Python process running just the PyTorch model vs. just the ONNX Runtime session. Report both numbers. On most hardware, ONNX Runtime will be 20–40% faster on CPU and comparable on GPU, with consistently lower memory footprint.

---

## Chapter 7: Evaluation: Building Eval Sets, MRR@10, NDCG, Cosine Calibration {#chapter-7}

Evaluation is the most underinvested component in most custom embedding projects. Teams train models, observe that the loss went down, maybe check that a few hand-picked queries return better results, and call it done. This is how you ship models that look good in demos and degrade silently in production.

Good evaluation for retrieval embedding models requires three things: a representative eval set built from real queries, the right metrics, and an understanding of what the metric values mean in practice. Each of these takes work, and all three are necessary.

### Building the Eval Set

An eval set for retrieval consists of (query, relevant_documents) pairs, where relevance has been manually judged by people who actually understand your domain. The key word is "manually judged" — automatically generated relevance labels are fine for training data, but using them for evaluation creates circular reasoning: you're measuring how well the model learned to reproduce your generation heuristics, not how well it retrieves the right answers.

The minimum eval set size for meaningful variance estimation is 30 queries. With fewer than 30, the confidence intervals on your MRR@10 are so wide that differences between model variants may be noise. For stable variance that lets you detect improvements of 0.02–0.05 MRR, aim for 100+ queries.

The queries must come from real developer behavior, not from what you think developers will ask. The distribution of real queries is always surprising: people ask about error messages they saw, about the output of a function they're confused by, about architectural patterns they've heard described but haven't seen implemented. Synthetic query generation produces queries that look like documentation descriptions of functions — cleaner, more structured, and less representative than what developers actually type.

Sources for real queries:
- Log of existing search queries (if you have a search interface)
- GitHub issue titles and descriptions (filtered for ones that reference code)
- Code review comments asking "where does X happen?"
- Onboarding questions from new team members (these are especially valuable — they reveal what the codebase's concepts look like from the outside)

```python
import json
from pathlib import Path

def build_eval_set_from_search_logs(
    log_path: str,
    codebase_path: str,
    min_query_length: int = 20,
    max_queries: int = 200
) -> list[dict]:
    """
    Extract unique queries from search logs; prepare for manual relevance judging.
    """
    logs = json.loads(Path(log_path).read_text())

    # Deduplicate and filter short/trivial queries
    seen = set()
    queries = []
    for entry in logs:
        q = entry.get("query", "").strip()
        if len(q) >= min_query_length and q.lower() not in seen:
            seen.add(q.lower())
            queries.append({
                "query": q,
                "timestamp": entry.get("timestamp"),
                "relevant_documents": [],   # To be filled by human judges
                "judgment_notes": ""
            })
        if len(queries) >= max_queries:
            break

    return queries

def export_for_judging(eval_set: list[dict], output_path: str) -> None:
    """Export eval set as JSON for human annotation."""
    Path(output_path).write_text(
        json.dumps(eval_set, indent=2)
    )
    print(f"Exported {len(eval_set)} queries for judging: {output_path}")
```

For each query in your eval set, have at least two domain experts judge the top-10 results from your retrieval system. For each result, they should mark it as: relevant (3), partially relevant (2), or not relevant (0). This graded relevance supports both MRR (binary) and NDCG (graded) metrics.

### MRR@10: The Standard Metric

Mean Reciprocal Rank at 10 (MRR@10) answers a simple question: on average, how high in the top-10 results does the first relevant document appear? It's the standard metric for code search and retrieval systems where the user is looking for a specific answer.

$$\text{MRR@10} = \frac{1}{|Q|} \sum_{q \in Q} \frac{1}{\text{rank}_q}$$

Where $\text{rank}_q$ is the position of the first relevant document in the top-10 results for query $q$. If no relevant document appears in the top 10, the reciprocal rank is 0.

```python
def mean_reciprocal_rank(
    results: list[list[str]],      # For each query: list of retrieved doc IDs in order
    relevant: list[set[str]],      # For each query: set of relevant doc IDs
    k: int = 10
) -> float:
    rr_sum = 0.0
    for query_results, relevant_docs in zip(results, relevant):
        rr = 0.0
        for rank, doc_id in enumerate(query_results[:k], start=1):
            if doc_id in relevant_docs:
                rr = 1.0 / rank
                break
        rr_sum += rr
    return rr_sum / len(results)
```

MRR@10 interpretation for code search:
- **> 0.4**: Strong. The first relevant result appears in positions 1–2 on average. PyckLM achieves 0.456 on CodeSearchNet.
- **0.2–0.4**: Usable. The first relevant result typically appears in positions 2–5. Worth deploying with good UI affordances.
- **< 0.2**: Problematic. Average rank of the first relevant result is 5 or lower. Training data mismatch is the most likely cause.

The 0.2 boundary is not arbitrary: below that threshold, users typically abandon search-based workflows in favor of manual navigation, because the effort to find the relevant result among the top-10 exceeds the effort of just browsing. Above 0.4, search workflows are genuinely preferred over browsing.

> **Warning**
>
> MRR@10 is not the same as "how often the top result is relevant." A model with MRR@10 = 0.4 might return the correct result first 50% of the time and second 30% of the time. The 0.4 average obscures this distribution. Always report both MRR@10 and the distribution of first-relevant-rank (what fraction of queries had the relevant result at position 1, 2, 3, etc.) for a complete picture.

### NDCG@10: For Graded Relevance

Normalized Discounted Cumulative Gain at 10 (NDCG@10) handles graded relevance — when some retrieved documents are more relevant than others. If your eval set has "very relevant" (3), "somewhat relevant" (2), and "not relevant" (0) grades, NDCG@10 captures the value of returning highly relevant documents at the top of the list.

```python
import numpy as np

def ndcg_at_k(
    results: list[list[str]],
    grades: list[dict[str, int]],  # For each query: {doc_id: relevance_grade}
    k: int = 10
) -> float:
    def dcg(relevances: list[int]) -> float:
        return sum(
            rel / np.log2(rank + 1)
            for rank, rel in enumerate(relevances[:k], start=1)
        )

    ndcg_sum = 0.0
    for query_results, query_grades in zip(results, grades):
        # Actual DCG
        actual_relevances = [query_grades.get(doc, 0) for doc in query_results[:k]]
        actual_dcg = dcg(actual_relevances)

        # Ideal DCG: sort by relevance descending
        ideal_relevances = sorted(query_grades.values(), reverse=True)
        ideal_dcg = dcg(ideal_relevances)

        ndcg = actual_dcg / ideal_dcg if ideal_dcg > 0 else 0.0
        ndcg_sum += ndcg

    return ndcg_sum / len(results)
```

NDCG@10 is more informative than MRR@10 when your use case involves multiple degrees of relevance — for example, when you want to distinguish between "exactly the right function" and "a closely related function in the same subsystem." For pure code search where there's typically one correct answer, MRR@10 is sufficient. For knowledge base retrieval where partial relevance is common, NDCG@10 captures more signal.

### Cosine Calibration

Cosine calibration is the diagnostic that tells you whether your model's similarity scores are semantically meaningful as thresholds. A well-calibrated model should produce a bimodal cosine similarity distribution: one peak around high similarity (0.8+) for genuinely relevant document pairs, and one peak around low similarity (0.1–0.3) for unrelated pairs. The gap between the peaks is your usable threshold range.

```python
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import roc_auc_score

def plot_cosine_calibration(
    embedder,
    relevant_pairs: list[tuple[str, str]],    # (query, relevant_doc) pairs
    irrelevant_pairs: list[tuple[str, str]],  # (query, irrelevant_doc) pairs
) -> float:
    """
    Plot cosine similarity distributions for relevant vs irrelevant pairs.
    Returns AUROC as a single-number calibration quality metric.
    """
    rel_queries, rel_docs = zip(*relevant_pairs)
    irr_queries, irr_docs = zip(*irrelevant_pairs)

    all_queries = list(rel_queries) + list(irr_queries)
    all_docs = list(rel_docs) + list(irr_docs)

    q_emb = embedder.encode(all_queries)
    d_emb = embedder.encode(all_docs)

    n_rel = len(relevant_pairs)
    rel_sims = (q_emb[:n_rel] * d_emb[:n_rel]).sum(axis=1)
    irr_sims = (q_emb[n_rel:] * d_emb[n_rel:]).sum(axis=1)

    labels = np.concatenate([np.ones(n_rel), np.zeros(len(irrelevant_pairs))])
    sims = np.concatenate([rel_sims, irr_sims])
    auroc = roc_auc_score(labels, sims)

    plt.figure(figsize=(10, 4))
    plt.hist(rel_sims, bins=50, alpha=0.6, label=f"Relevant (n={n_rel})", color="green")
    plt.hist(irr_sims, bins=50, alpha=0.6, label=f"Irrelevant (n={len(irrelevant_pairs)})", color="red")
    plt.xlabel("Cosine Similarity")
    plt.ylabel("Count")
    plt.title(f"Cosine Similarity Distribution (AUROC: {auroc:.3f})")
    plt.legend()
    plt.savefig("cosine_calibration.png", dpi=150, bbox_inches="tight")

    return auroc
```

If the distributions overlap substantially (both peaks are in the 0.4–0.7 range with no clear separation), your model's cosine scores are not threshold-stable: any threshold you choose will either include many irrelevant results or exclude many relevant ones. This is a training problem, not a threshold problem, and adding a threshold won't fix it.

> **Key Insight**
>
> A model with good MRR@10 but poor cosine calibration is still useful for retrieval (ranking works fine) but cannot support similarity-threshold filtering. If your downstream application needs to say "only return results with similarity > 0.7", calibration matters as much as ranking quality.

### Key Takeaways

- Eval sets must use manually judged real queries (minimum 30, ideally 100+); synthetic or automatically generated relevance labels create circular evaluation.
- MRR@10 > 0.4 is strong for code search; 0.2–0.4 is usable; < 0.2 means fundamental training data mismatch.
- NDCG@10 is appropriate when partial relevance is common; MRR@10 is sufficient for pure code search with single correct answers.
- Cosine calibration (bimodal separation of relevant vs irrelevant distributions) is required for any system that uses similarity thresholds.
- Always report MRR@10 alongside the distribution of first-relevant-rank positions, not just the average.
- Checkpoint and evaluate on MRR@10 during training — training loss alone does not predict retrieval performance.

### Practical Exercise

Take your 30-query eval set from Chapter 1 and compute MRR@10 for both the general model and your custom model. Then plot the cosine similarity distributions for the relevant and irrelevant pairs from each model. Compute the AUROC for each. A well-trained custom model should show AUROC > 0.90 and a clear bimodal separation in the cosine distribution. If the custom model's AUROC is lower than the general model's, you have a training data problem to diagnose before shipping.

---

## Chapter 8: Monitoring in Production: Drift Detection and Quality Signals {#chapter-8}

A deployed embedding model is not a static artifact. The codebase it's serving will change. The queries developers issue will change. The distribution of content in the index will shift as new modules are added and old ones are removed. Without active monitoring, you will not know when your model's performance has degraded — you'll find out when enough developers stop using the search tool that someone files a bug report.

This chapter is about building the monitoring infrastructure that gives you early warning before that happens.

### The Three Failure Modes

Embedded retrieval systems degrade in three distinct ways, each requiring a different monitoring signal:

**Index drift**: the codebase changes in ways that move it out of the distribution the model was trained on. New frameworks are adopted, new naming conventions emerge, a significant portion of the codebase is refactored. The model's embedding space was calibrated for the old distribution. The new code gets embedded correctly (the model still runs), but the similarity structure is wrong — things that should be near each other in embedding space aren't, because the model was trained on examples that no longer represent the codebase.

**Model version mixing**: two documents that should be comparable are embedded by different versions of the model and have embeddings in incompatible spaces. This happens when you update the model and incrementally re-index rather than fully re-indexing. The cosine similarity between old and new embeddings is essentially meaningless — the model has changed its geometric interpretation — but the retrieval system doesn't know this and will happily return results mixing embeddings from both spaces.

**Query distribution shift**: the vocabulary of developer queries changes as new teams join, new projects are added, or the use case evolves. A model trained on queries about Python microservices may degrade when the team starts working on a Rust systems component — not because the model is broken, but because its training distribution doesn't cover the new query vocabulary.

### Embedding Drift Detection

The primary signal for index drift is the average cosine similarity between newly indexed documents and the existing index. In a stable distribution, new code is similar to existing code — the average cosine similarity of a new document's top-k nearest neighbors reflects how well-distributed the embedding space is around that document. When this average drops significantly, new code is landing in sparse regions of the embedding space.

```python
import numpy as np
from typing import Optional
import chromadb
from datetime import datetime, timedelta

class EmbeddingDriftMonitor:
    def __init__(
        self,
        collection: chromadb.Collection,
        baseline_window_days: int = 30,
        drift_threshold: float = 0.05
    ):
        self.collection = collection
        self.baseline_window_days = baseline_window_days
        self.drift_threshold = drift_threshold
        self._baseline: Optional[float] = None

    def compute_neighborhood_similarity(
        self,
        embedding: np.ndarray,
        k: int = 10
    ) -> float:
        """
        Average cosine similarity of an embedding to its k nearest neighbors.
        Low values indicate the embedding is in a sparse/unfamiliar region.
        """
        results = self.collection.query(
            query_embeddings=[embedding.tolist()],
            n_results=k + 1,   # +1 because the embedding itself may be in the index
            include=["distances"]
        )
        # ChromaDB returns L2 distances; convert or use cosine directly
        distances = results["distances"][0]
        # Filter out self-match (distance ≈ 0)
        distances = [d for d in distances if d > 1e-6][:k]
        # Convert L2 to cosine similarity (for unit vectors: cos_sim = 1 - d²/2)
        cos_sims = [1 - d**2 / 2 for d in distances]
        return float(np.mean(cos_sims))

    def update_baseline(self, embeddings: list[np.ndarray]) -> None:
        """Compute baseline average neighborhood similarity on stable index."""
        sims = [self.compute_neighborhood_similarity(e) for e in embeddings]
        self._baseline = float(np.mean(sims))
        print(f"Drift baseline set: {self._baseline:.4f}")

    def check_drift(self, new_embeddings: list[np.ndarray]) -> dict:
        if self._baseline is None:
            raise RuntimeError("Call update_baseline() before check_drift()")

        current_sims = [self.compute_neighborhood_similarity(e) for e in new_embeddings]
        current_mean = float(np.mean(current_sims))
        drift = self._baseline - current_mean

        return {
            "baseline": self._baseline,
            "current": current_mean,
            "drift": drift,
            "alert": drift > self.drift_threshold,
            "timestamp": datetime.utcnow().isoformat(),
        }
```

Log this drift metric daily. A drift value above 0.05 (a 5% drop in average neighborhood similarity) is a soft alert that warrants investigation. A drift above 0.10 is a hard alert — the model's embedding space is significantly misaligned with the current codebase distribution, and retraining should be considered.

> **Warning**
>
> Never mix embeddings from different model versions in the same index. If you update the model, re-index everything from scratch. A mixed-version index produces cosine similarity values between embeddings in incompatible spaces. The retrieval system will appear to work (it returns results) but the similarity ordering will be partially or fully meaningless.

### Search Miss Rate

The search miss rate is the fraction of queries that return no results above your relevance threshold. A rising miss rate is a direct signal of quality degradation — the model is no longer placing relevant documents in the high-similarity region for incoming queries.

```python
from collections import deque
from dataclasses import dataclass, field
from datetime import datetime
import statistics

@dataclass
class SearchEvent:
    query: str
    num_results_above_threshold: int
    top_similarity: float
    timestamp: datetime = field(default_factory=datetime.utcnow)

class SearchMissRateMonitor:
    def __init__(
        self,
        threshold: float = 0.65,
        window_size: int = 1000,
        alert_miss_rate: float = 0.15
    ):
        self.threshold = threshold
        self.window_size = window_size
        self.alert_miss_rate = alert_miss_rate
        self.events: deque[SearchEvent] = deque(maxlen=window_size)

    def record(self, query: str, similarities: list[float]) -> None:
        above = sum(1 for s in similarities if s >= self.threshold)
        top_sim = max(similarities) if similarities else 0.0
        self.events.append(SearchEvent(
            query=query,
            num_results_above_threshold=above,
            top_similarity=top_sim
        ))

    def current_metrics(self) -> dict:
        if not self.events:
            return {}

        misses = sum(1 for e in self.events if e.num_results_above_threshold == 0)
        miss_rate = misses / len(self.events)
        top_sims = [e.top_similarity for e in self.events]

        return {
            "miss_rate": miss_rate,
            "mean_top_similarity": statistics.mean(top_sims),
            "p50_top_similarity": statistics.median(top_sims),
            "alert": miss_rate > self.alert_miss_rate,
            "window_size": len(self.events),
        }
```

Track miss rate over rolling 1,000-query windows. A baseline miss rate below 10% is typical for a well-functioning system. A rate above 15% in a stable window warrants investigation; above 25% is a system-level alert.

> **Try This**
>
> Set up a weekly report that shows: (1) search miss rate over the last 7 days vs the prior 7 days, (2) average cosine similarity of top-1 results vs prior week, (3) number of new files indexed vs total index size. These three numbers tell you whether the model is keeping up with codebase changes, whether retrieval quality is stable, and whether the index is growing in proportion you'd expect. Send this report to whoever is responsible for the retrieval system.

### Held-Out Eval Set Monitoring

The most reliable long-term quality signal is periodic evaluation on your held-out eval set. Run MRR@10 on your 100-query eval set weekly. A drop of more than 0.05 from your training-time baseline is the trigger for retraining consideration. A drop of more than 0.10 is the trigger for immediate retraining action.

```python
import json
from pathlib import Path
import time

def run_periodic_eval(
    embedder,
    eval_set_path: str,
    retrieval_fn,
    baseline_mrr: float,
    output_log_path: str
) -> dict:
    eval_set = json.loads(Path(eval_set_path).read_text())

    results_list = []
    relevant_list = []

    for item in eval_set:
        query_emb = embedder.encode([item["query"]])[0]
        retrieved_ids = retrieval_fn(query_emb, k=10)
        results_list.append(retrieved_ids)
        relevant_list.append(set(item["relevant_document_ids"]))

    current_mrr = mean_reciprocal_rank(results_list, relevant_list)
    delta = current_mrr - baseline_mrr

    result = {
        "timestamp": time.strftime("%Y-%m-%dT%H:%M:%SZ", time.gmtime()),
        "mrr_at_10": current_mrr,
        "baseline_mrr": baseline_mrr,
        "delta": delta,
        "soft_alert": delta < -0.05,
        "hard_alert": delta < -0.10,
        "num_queries": len(eval_set),
    }

    # Append to log
    log_path = Path(output_log_path)
    existing = json.loads(log_path.read_text()) if log_path.exists() else []
    existing.append(result)
    log_path.write_text(json.dumps(existing, indent=2))

    return result
```

### Key Takeaways

- Embedding drift (average neighborhood similarity dropping for new documents) is the primary early signal of index-model misalignment; alert threshold is 0.05 drop from baseline.
- Never mix embeddings from different model versions in the same index — the cosine similarity between embeddings in incompatible spaces is meaningless.
- Search miss rate (fraction of queries returning no results above threshold) is the user-facing quality signal; monitor over rolling 1,000-query windows.
- Periodic MRR@10 on a held-out eval set is the most reliable quality signal; a drop of 0.05 triggers retraining consideration, 0.10 triggers immediate action.
- Retraining triggers are data-driven thresholds, not calendar schedules; retrain when quality degrades, not because six months have passed.
- Log all three signals (drift, miss rate, eval MRR) with timestamps to build the quality history that informs future retraining decisions.

### Practical Exercise

Build the monitoring stack: a drift monitor that logs daily average neighborhood similarity for newly indexed documents, a search miss rate monitor that tracks rolling 1,000-query windows, and a weekly eval job that runs MRR@10 against your held-out set. Deploy all three. After two weeks, review the logs. If miss rate is consistently below 10% and MRR@10 is within 0.03 of your training baseline, your model is stable. If either signal shows degradation, use the logs to pinpoint when the degradation began, which will point you toward the cause (a codebase migration, a new team, a new framework).

---

## Chapter 9: The Data Flywheel: Using Production Queries to Improve Training {#chapter-9}

The most sustainable custom embedding model is one that gets better as it runs in production. Every query a developer issues, every result they click, every search they abandon is a signal about the gap between the model's current embedding space and the semantic relationships that matter to your users. Capturing and using these signals closes the loop between deployment and training.

This is the data flywheel: production usage generates training data, which improves the model, which improves production usage, which generates better training data. The wheel spins faster as the model gets better, because better models return more relevant results, which developers interact with, which generates stronger training signal.

Getting this flywheel started requires three components: implicit feedback collection (what users interact with), signal extraction (turning interactions into training triplets), and a training pipeline that can incorporate new triplets without catastrophic forgetting of the old ones.

### Implicit Feedback Collection

Explicit relevance feedback (thumbs up/down buttons, rating widgets) has low adoption in developer tooling. Developers are task-focused — they want to find the code, not evaluate the search tool. Implicit signals are more scalable:

**Click-through**: when a developer clicks a search result, the clicked document is a positive signal for that query. This is weak positive signal — clicking doesn't mean the document was the right answer — but it's signal.

**Dwell time**: if a developer opens a file from a search result and spends more than 30 seconds in it, that's stronger positive signal than a brief glance.

**Search abandonment after engagement**: if a developer issues a query, clicks no results, and then navigates directly to a file (via file tree or another search), the documents they navigated to are likely the correct answers for that query.

**Copy events**: if a developer copies text from a search result, it's a strong positive signal — they found what they needed.

```python
import uuid
import time
import json
from pathlib import Path
from dataclasses import dataclass, field, asdict
from typing import Optional

@dataclass
class SearchInteraction:
    session_id: str
    query: str
    timestamp: float
    results: list[str]                   # Ordered list of result document IDs
    clicked: list[str] = field(default_factory=list)
    dwell_times: dict[str, float] = field(default_factory=dict)  # doc_id → seconds
    copied_from: list[str] = field(default_factory=list)

class FeedbackCollector:
    def __init__(self, log_path: str):
        self.log_path = Path(log_path)
        self.interactions: dict[str, SearchInteraction] = {}

    def start_search(self, query: str, results: list[str]) -> str:
        session_id = str(uuid.uuid4())
        self.interactions[session_id] = SearchInteraction(
            session_id=session_id,
            query=query,
            timestamp=time.time(),
            results=results
        )
        return session_id

    def record_click(self, session_id: str, doc_id: str, open_time: float) -> None:
        if session_id in self.interactions:
            self.interactions[session_id].clicked.append(doc_id)
            self.interactions[session_id].dwell_times[doc_id] = open_time

    def record_copy(self, session_id: str, doc_id: str) -> None:
        if session_id in self.interactions:
            self.interactions[session_id].copied_from.append(doc_id)

    def close_session(self, session_id: str) -> Optional[dict]:
        if session_id not in self.interactions:
            return None
        interaction = self.interactions.pop(session_id)
        record = asdict(interaction)

        # Append to JSONL log
        with open(self.log_path, "a") as f:
            f.write(json.dumps(record) + "\n")

        return record
```

### Signal Extraction: From Interactions to Triplets

Raw interaction logs need to be converted to training triplets. The conversion requires a signal quality hierarchy: strong signals become hard training triplets; weak signals become soft ones that might serve better as validation data.

```python
import json
from pathlib import Path

def extract_triplets_from_interactions(
    log_path: str,
    min_dwell_seconds: float = 30.0,
    require_copy_or_dwell: bool = True
) -> list[dict]:
    """
    Convert interaction logs to training triplets.

    Quality hierarchy:
    - Copied doc: strong positive
    - Dwell > 30s: strong positive
    - Clicked but abandoned quickly: weak positive (skip or use carefully)
    - Not clicked despite appearing in results: hard negative candidate
    """
    triplets = []

    for line in Path(log_path).read_text().splitlines():
        if not line.strip():
            continue
        interaction = json.loads(line)

        query = interaction["query"]
        results = interaction["results"]  # Ordered by rank

        # Strong positives: copied or high dwell
        strong_positives = set(interaction.get("copied_from", []))
        for doc_id, dwell in interaction.get("dwell_times", {}).items():
            if dwell >= min_dwell_seconds:
                strong_positives.add(doc_id)

        if not strong_positives:
            continue

        # Hard negatives: appeared in top-5 but were not positives
        top_5 = results[:5]
        hard_neg_candidates = [d for d in top_5 if d not in strong_positives]

        if not hard_neg_candidates:
            continue

        for positive in strong_positives:
            hard_neg = hard_neg_candidates[0]  # Hardest: highest-ranked non-positive
            triplets.append({
                "anchor": query,
                "positive": positive,
                "hard_negative": hard_neg,
                "signal_quality": "strong"
            })

    return triplets
```

The hardest negative from production data is, by definition, very hard: it's a document the model ranked highly but the user found unhelpful. These are exactly the discrimination failures you want to train against. Production-sourced hard negatives are often more valuable than synthetically generated ones, because they reflect actual model failures rather than engineered difficulty.

> **Key Insight**
>
> Production hard negatives are the model's actual mistakes — documents it placed too close to relevant documents in embedding space. Training on these directly repairs the specific failure modes the model exhibits in the wild, rather than the failure modes you anticipated during initial data construction.

### Incremental Fine-Tuning Without Catastrophic Forgetting

Once you have a batch of production triplets, the next question is how to incorporate them into training. You have two options: retrain from scratch on the combined dataset, or fine-tune the deployed model on the new triplets.

Retraining from scratch is safer — you get a clean model with consistent performance across the full distribution — but it requires the full dataset and full compute. For a production system where new triplets accumulate continuously, retraining from scratch monthly or quarterly is the practical cadence.

Fine-tuning on incremental batches is faster and cheaper but risks catastrophic forgetting: updating the model aggressively on new triplets causes it to "forget" the relationships it learned during full training, degrading performance on the parts of the distribution not represented in the incremental batch.

The mitigations for catastrophic forgetting in incremental fine-tuning are:

1. **Low learning rate**: use 1e-5 or lower (10× below scratch training) to ensure gradients reshape only the boundaries of the embedding space, not the core geometry
2. **Replay buffer**: include a random sample of original training triplets (5–10% of the incremental batch size) to anchor the model to its original distribution
3. **Elastic Weight Consolidation (EWC)**: penalize parameter changes proportional to their importance to the original task. This is more complex to implement but theoretically sound

```python
import torch
from sentence_transformers import SentenceTransformer, InputExample, losses
from torch.utils.data import DataLoader
import random

def incremental_finetune(
    model_path: str,
    new_triplets: list[dict],
    replay_triplets: list[dict],  # Sample from original training set
    output_path: str,
    learning_rate: float = 1e-5,
    epochs: int = 1,
    batch_size: int = 64
) -> None:
    model = SentenceTransformer(model_path)

    # Combine new triplets with replay sample (prevents forgetting)
    n_replay = max(len(new_triplets) // 10, 100)  # 10% replay
    replay_sample = random.sample(replay_triplets, min(n_replay, len(replay_triplets)))
    all_triplets = new_triplets + replay_sample
    random.shuffle(all_triplets)

    train_examples = [
        InputExample(
            texts=[t["anchor"], t["positive"], t["hard_negative"]]
        )
        for t in all_triplets
    ]

    train_dataloader = DataLoader(train_examples, shuffle=True, batch_size=batch_size)
    train_loss = losses.TripletLoss(model)

    model.fit(
        train_objectives=[(train_dataloader, train_loss)],
        epochs=epochs,
        warmup_steps=50,
        optimizer_params={"lr": learning_rate},
        output_path=output_path,
        show_progress_bar=True,
    )
```

> **Warning**
>
> Without a replay buffer, fine-tuning on incremental batches will catastrophically forget the original training distribution if the batches are small (<1,000 triplets) relative to the original training set. A model that degrades across the board after an incremental update is not useful — the update must either improve or hold steady on the full eval set, not just on the new query categories.

### The Flywheel Cadence

The practical cadence for the data flywheel in a production code search system:

**Weekly**: review search miss rate and interaction logs. Flag queries with zero clicks for manual review — these are either correct (the search failed) or indicate a gap in your codebase coverage.

**Monthly**: extract triplets from the past month's interaction logs. Evaluate the quality of the extracted triplets manually (sample 50). If quality is high, fine-tune the model with replay buffer. Run full eval set before and after — only ship the updated model if MRR@10 holds or improves.

**Quarterly**: retrain from scratch on the combined dataset (original training data + accumulated production triplets). This is the clean-slate reset that prevents incremental fine-tuning drift from accumulating.

```python
import json
from pathlib import Path
from datetime import datetime

def build_flywheel_report(
    interaction_log_path: str,
    eval_results_path: str,
    output_path: str,
    lookback_days: int = 30
) -> None:
    """Summarize flywheel health for the past N days."""
    interactions = [
        json.loads(line)
        for line in Path(interaction_log_path).read_text().splitlines()
        if line.strip()
    ]

    recent = [
        i for i in interactions
        if i["timestamp"] > (datetime.utcnow().timestamp() - lookback_days * 86400)
    ]

    total_queries = len(recent)
    zero_click = sum(1 for i in recent if not i.get("clicked"))
    strong_signal = sum(
        1 for i in recent
        if i.get("copied_from") or
        any(d >= 30 for d in i.get("dwell_times", {}).values())
    )

    report = {
        "period_days": lookback_days,
        "total_queries": total_queries,
        "zero_click_rate": zero_click / total_queries if total_queries > 0 else 0,
        "strong_signal_rate": strong_signal / total_queries if total_queries > 0 else 0,
        "estimated_new_triplets": strong_signal,
        "generated_at": datetime.utcnow().isoformat(),
    }

    Path(output_path).write_text(json.dumps(report, indent=2))
    print(f"Flywheel report: {total_queries} queries, "
          f"{strong_signal} strong-signal interactions, "
          f"zero-click rate {zero_click/max(total_queries,1):.1%}")
```

### When the Flywheel Doesn't Spin

Two common failure modes prevent the flywheel from generating useful signal:

**Low search adoption**: if developers don't use the search tool, there's no interaction data. The most common cause is a first-version model that's not good enough to be preferred over manual navigation. The solution is investment in the initial training quality (Chapter 2) rather than relying on the flywheel to improve a poor model. The flywheel accelerates a good model; it cannot rescue a bad one.

**Biased feedback distribution**: if certain query types are never used in the search tool (developers have learned that the tool doesn't handle them well), the feedback distribution will be biased toward the query types the model is already good at. The model improves in its strong areas and stagnates in its weak ones. The mitigation is actively soliciting feedback on known-weak query categories, or using the monitoring signals from Chapter 8 to identify and manually annotate the failing query categories.

### Key Takeaways

- Implicit feedback signals (clicks, dwell time, copy events) produce more scalable training signal than explicit ratings in developer tooling.
- Production hard negatives — documents the model ranked highly but users found unhelpful — are the most valuable training signal because they directly represent model failure modes.
- Incremental fine-tuning requires low learning rate (1e-5) and a replay buffer (10% original training data) to prevent catastrophic forgetting.
- The practical cadence: weekly monitoring, monthly incremental fine-tuning, quarterly full retraining.
- The flywheel accelerates good models; low search adoption typically means the initial model is too poor to generate useful signal.
- Biased feedback distributions reinforce existing strengths and ignore existing weaknesses — actively solicit feedback on known-weak query categories.

### Practical Exercise

Build the interaction logging infrastructure described in this chapter. After two weeks of production operation, extract the interaction logs and compute: (1) zero-click rate, (2) number of strong-signal interactions (dwell > 30s or copy), (3) the top-10 most common queries that received zero clicks. Inspect those zero-click queries manually. They are the direct specification for your next round of training data generation.

---

## Conclusion {#conclusion}

The arc of this book traces a specific claim: that the gap between what a general embedding model knows and what your codebase means is a tractable engineering problem, not a fundamental limitation. The gap exists, it matters, and it can be closed with the right combination of training data, loss function, architecture, and monitoring.

The most important lesson from building PyckLM and from the production retrieval systems that informed this book is that data quality dominates everything else. A 40M parameter model trained on 968,692 carefully constructed triplets outperforms GraphCodeBERT by 62% on CodeSearchNet. That result isn't explained by architecture innovation or novel loss functions. It's explained by hard negatives: triplets constructed specifically to make the model discriminate between genuinely related code and superficially similar code that serves different purposes. The model had to learn to tell them apart. So it did.

The corollary is discouraging but useful: if you build the data pipeline carelessly — if you rely on random negatives, synthetic queries, or auto-generated relevance labels — you will produce a model that looks good on the metrics you generated and fails on the ones that matter. The eval set built from real developer queries will expose this. That's why the eval set is the first thing to build, not the last.

The other observation worth carrying forward is the importance of the feedback loop between deployment and training. The best training data for your model is the record of its production failures: the queries it couldn't answer, the documents it ranked too high that developers ignored, the searches that ended in abandonment. This data doesn't exist until you deploy. Deploying a model that's good enough to get real usage, then collecting and learning from the failures, then deploying an improved model — this cycle converges faster than trying to train a perfect model before initial deployment.

On infrastructure: the economics of custom embedding model training have shifted significantly. A meaningful training run — nearly a million triplets, 40M parameters — costs under $5 in cloud compute on an H100. The barrier to iteration is now almost entirely the time cost of data quality work and evaluation, not compute. This means the right investment is in evaluation infrastructure and data pipelines, not in squeezing compute efficiency out of a single expensive run. Build eval first. Build data infrastructure second. Train fast and often.

On monitoring: embedding models degrade silently. Index drift, model version mixing, query distribution shift — none of these generate error logs. They generate slightly wrong results that developers increasingly work around until the search tool is effectively abandoned. The monitoring infrastructure in Chapter 8 is not optional overhead; it's the early warning system that tells you when to retrain before the silent degradation becomes a team productivity problem.

The broader principle is worth stating directly: the retrieval system that compounds improvements over time is not the one with the best initial training run. It's the one with the best feedback loop between production behavior and training data. The data flywheel is the architecture decision that matters most in the long run.

You should now have the technical foundation to: decide whether custom training is warranted, construct training data that will produce a meaningful model, make architecture and infrastructure decisions that reflect your actual requirements, evaluate your model on metrics that matter for your use case, deploy to production with the right serving format and operational footprint, monitor for degradation before it becomes visible to users, and build the feedback loop that compounds improvements over time.

What you do with that foundation depends on your specific domain, your team's capacity, and the quality problems you're actually trying to solve. Start with the eval set. Measure the problem before you train a solution. Then train a solution worth the compute.

---

## Appendix A: Glossary {#appendix-a}

**Anchor**: In a training triplet, the query or reference document around which the training signal is organized. The model is trained to place the anchor near the positive and away from the hard negative in embedding space.

**BF16 (bfloat16)**: A 16-bit floating-point format with the same exponent range as FP32 but 7 bits of mantissa precision instead of 23. Supports the same dynamic range as FP32, making it stable for training without gradient scaling. Native hardware support in H100, A100, and RTX 30-series and later.

**Catastrophic Forgetting**: The phenomenon where training a neural network on new data causes it to lose performance on old data, because gradient updates for the new distribution overwrite parameters that encoded the old distribution. A risk in incremental fine-tuning without replay buffers.

**ChromaDB**: An open-source vector database designed for embedding storage and similarity search. Commonly used for development and small-to-medium production retrieval systems.

**Cosine Calibration**: The property of a well-trained embedding model where cosine similarity scores meaningfully separate relevant from irrelevant document pairs with a clear bimodal distribution. A model with poor cosine calibration cannot be used with similarity thresholds.

**DCG (Discounted Cumulative Gain)**: The cumulative relevance of retrieved documents, discounted by their rank position. Relevant documents at higher ranks contribute more to the DCG score than the same documents at lower ranks. Used to compute NDCG.

**Domain Gap**: The mismatch between the semantic patterns a model learned during pretraining (from its training corpus) and the semantic patterns relevant to a specific deployment domain. The domain gap is the fundamental problem that custom embedding training addresses.

**DDP (Distributed Data Parallel)**: PyTorch's standard mechanism for multi-GPU training that replicates the model on each GPU, distributes batches across GPUs, and synchronizes gradients via all-reduce operations.

**EWC (Elastic Weight Consolidation)**: A continual learning technique that penalizes changes to parameters that were important for previous tasks, mitigating catastrophic forgetting during incremental fine-tuning.

**FP16**: 16-bit floating-point with 5-bit exponent. Narrower dynamic range than BF16; requires gradient scaling to prevent gradient underflow during training.

**Hard Negative**: A training example that is superficially similar to the positive document for a given query but is not actually a correct retrieval result. Hard negatives are the primary source of discriminative signal in contrastive training.

**In-Batch Negative Sampling**: A contrastive training technique that treats all other examples in the training batch as negatives for each anchor-positive pair, dramatically increasing the number of learning signals per compute step.

**InfoNCE**: Information Noise-Contrastive Estimation — the loss function underlying CLIP, SimCSE, and many production retrieval models. Equivalent to cross-entropy over the similarity matrix of anchors and positives with in-batch negatives.

**L2 Normalization**: Dividing an embedding vector by its L2 norm to produce a unit-length vector on the unit hypersphere. Required when training with cosine loss to ensure that dot product and cosine similarity are equivalent.

**LoRA (Low-Rank Adaptation)**: A parameter-efficient fine-tuning technique that adds trainable low-rank matrices to specific layers of a pretrained model, reducing the number of trainable parameters while preserving most of the fine-tuning benefit.

**MRR@10 (Mean Reciprocal Rank at 10)**: The average of the reciprocal rank of the first relevant document in the top-10 retrieved results across all queries. Values above 0.4 are strong for code search; below 0.2 indicates fundamental training data mismatch.

**Mean Pooling**: Averaging the token-level hidden states from a transformer model (excluding padding tokens) to produce a single document-level embedding. The standard pooling strategy for bi-encoder retrieval models.

**NDCG@10 (Normalized Discounted Cumulative Gain at 10)**: A retrieval metric that accounts for graded relevance (documents can be more or less relevant). Normalized to [0, 1] by dividing by the ideal DCG achievable with the given relevance grades.

**ONNX (Open Neural Network Exchange)**: An open format for neural network models that enables interoperability between frameworks and optimized inference via ONNX Runtime, typically with lower memory footprint than native PyTorch.

**ONNX Runtime**: Microsoft's inference engine for ONNX models, supporting CUDA, DirectML, and CPU backends with graph optimization passes that often improve inference speed over native PyTorch.

**Positive**: In a training triplet, the document that is a correct retrieval result for the anchor query. The model is trained to place the positive near the anchor in embedding space.

**Replay Buffer**: A stored sample of original training examples used during incremental fine-tuning to anchor the model to its original distribution and prevent catastrophic forgetting.

**RSS (Resident Set Size)**: The portion of a process's memory that is held in RAM at a given moment. The relevant measure of production memory footprint for a serving process.

**TorchScript**: A way to create serializable and optimizable PyTorch models by tracing or scripting them to a static computation graph. TorchScript models can be exported to ONNX and deployed without a full Python environment.

**Triplet Loss**: A contrastive training loss that operates on (anchor, positive, hard_negative) triplets, pushing the anchor-positive distance below the anchor-negative distance by a specified margin.

**WordPiece**: A subword tokenization algorithm used in BERT-based models that segments text into vocabulary tokens and `##`-prefixed continuation tokens. Fragments compound identifiers like camelCase names, which can lose structural information for code tokenization.

---

## Appendix B: Tools and Resources {#appendix-b}

### Training Frameworks

**Sentence Transformers** (`sentence-transformers`)
The standard Python library for training and evaluating bi-encoder embedding models. Includes `TripletLoss`, `MultipleNegativesRankingLoss`, and `InformationRetrievalEvaluator` out of the box. The most direct path from training data to a production embedding model.
```bash
pip install sentence-transformers
```

**Hugging Face Transformers** (`transformers`)
The underlying model library used by Sentence Transformers. Required for `BertConfig`, `BertModel`, tokenizers, and training with `Trainer`.
```bash
pip install transformers
```

**Hugging Face PEFT** (`peft`)
Parameter-efficient fine-tuning library with LoRA, LoHa, and other adapters. Required for LoRA-based fine-tuning and the adapter-merge-before-ONNX-export workflow.
```bash
pip install peft
```

### Inference and Deployment

**ONNX Runtime** (`onnxruntime`, `onnxruntime-gpu`)
Production inference engine for ONNX models. The GPU package supports CUDA. Install only the GPU package on GPU machines — it includes CPU fallback.
```bash
pip install onnxruntime-gpu  # GPU + CPU fallback
pip install onnxruntime      # CPU only
```

**ONNX** (`onnx`)
The ONNX model format library — required for export validation (`onnx.checker.check_model`).
```bash
pip install onnx
```

### Vector Databases

**ChromaDB** (`chromadb`)
Open-source, Python-native vector database. Easy to embed in development environments; supports persistent storage and metadata filtering.
```bash
pip install chromadb
```

**Qdrant** (`qdrant-client`)
Production-grade vector database with Rust backend. Supports named vectors, payload filtering, and HNSW indexing. Good choice for production retrieval systems over 100K documents.
```bash
pip install qdrant-client
```

**pgvector** (`pgvector`)
PostgreSQL extension that adds vector similarity search to PostgreSQL. Best choice when your application data is already in PostgreSQL and you want to avoid managing a separate vector database.

### Training Data

**CodeSearchNet**
The standard benchmark dataset for code search containing 2M (docstring, function) pairs across Python, JavaScript, Go, PHP, Java, and Ruby.
- Dataset identifier: `code_search_net` (Hugging Face Datasets)
- Primary use: base training data for general code vocabulary
```python
from datasets import load_dataset
dataset = load_dataset("code_search_net", "python")
```

**The Stack** (BigCode)
Open-source code dataset covering 350+ programming languages, curated from permissively licensed repositories.
- Dataset identifier: `bigcode/the-stack`
- Primary use: large-scale code vocabulary; requires opt-in from the model weights

### Monitoring and Observability

**Weights & Biases** (`wandb`)
The standard experiment tracking platform for ML training runs. Supports custom metrics, embedding projector visualization, and alert rules.
```bash
pip install wandb
```

**Prometheus + Grafana**
Standard stack for production metrics monitoring. For embedding drift and search miss rate monitoring, instrument your serving code to export Prometheus metrics and build Grafana dashboards around them.

### Evaluation

**BEIR** (Benchmarking IR)
A benchmark suite for zero-shot information retrieval across 18 diverse datasets. Useful for evaluating how well a model generalizes beyond its training domain.
```bash
pip install beir
```

**Sentence Transformers InformationRetrievalEvaluator**
Built-in evaluator that computes MRR@k, NDCG@k, MAP@k, and Precision@k during training. Drop-in addition to any Sentence Transformers training loop.
```python
from sentence_transformers.evaluation import InformationRetrievalEvaluator
```

---

## Appendix C: Further Reading {#appendix-c}

### Foundational Papers

**"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"** (Reimers & Gurevych, 2019)
The paper that established bi-encoder embeddings with triplet loss as the practical standard for semantic similarity. The architecture and training approach used throughout this book descends directly from this work.

**"Dense Passage Retrieval for Open-Domain Question Answering"** (Karpukhin et al., 2020)
DPR introduced the in-batch negative approach for retrieval models trained with contrastive loss. The efficiency gains from in-batch negatives described in Chapter 3 come from this work.

**"SimCSE: Simple Contrastive Learning of Sentence Embeddings"** (Gao et al., 2021)
SimCSE demonstrated that dropout as a data augmentation technique produces strong hard negatives for contrastive training. Relevant for teams that want to generate hard negatives without external mining.

**"CodeBERT: A Pre-Trained Model for Programming and Natural Languages"** (Feng et al., 2020)
The paper introducing CodeBERT, one of the models PyckLM is compared against. Useful context for understanding what general-purpose code embedding models capture and where they fall short on domain-specific retrieval tasks.

**"GraphCodeBERT: Pre-training Code Representations with Data Flow"** (Guo et al., 2021)
The GraphCodeBERT paper, which introduces data flow graph pretraining for code representations. The 0.456 MRR@10 vs 0.281 MRR@10 comparison referenced in this book uses GraphCodeBERT as the baseline on CodeSearchNet.

**"Scaling Laws for Neural Language Models"** (Kaplan et al., 2020)
While focused on language models rather than embedding models, the scaling law framework is relevant for understanding the relationship between training data size, model size, and performance — the substrate of the "triplets per parameter" reasoning in Chapter 4.

### Practical Guides and Implementations

**Sentence Transformers Documentation** (`www.sbert.net`)
The most complete practical reference for training bi-encoder models with the Sentence Transformers library. Covers training data formats, loss functions, evaluators, and multi-task learning.

**Hugging Face Course: Semantic Search**
A practical walkthrough of bi-encoder retrieval systems, including FAISS indexing and production deployment patterns. Good complement to the training-focused content in this book.

**"MTEB: Massive Text Embedding Benchmark"** (Muennighoff et al., 2022)
The standard benchmark for comparing embedding models across 56 tasks in 112 languages. Relevant for understanding where your model sits in the broader embedding model landscape, particularly for multilingual or multi-task requirements.

### On Evaluation

**"Evaluation Measures for Information Retrieval"** (Manning, Raghavan & Schutze)
Chapter 8 of "Introduction to Information Retrieval" (freely available online) provides a rigorous treatment of MRR, NDCG, MAP, and precision/recall metrics. The canonical reference for retrieval evaluation methodology.

**"Beware of Naive Fine-tuning on Retrieval Tasks"** (Various, 2023+)
A collection of empirical observations across multiple papers showing that fine-tuning general models on narrow domains without careful data construction produces models that overfit to surface patterns rather than semantic relationships. The empirical grounding for the 50K triplet threshold claim in Chapter 1.

### On Production Deployment

**"Efficient Transformers: A Survey"** (Tay et al., 2022)
Survey of transformer efficiency techniques including attention approximations, quantization, and distillation. Useful for teams with extreme inference latency constraints that ONNX Runtime alone doesn't satisfy.

**ONNX Runtime Documentation: Performance Tuning**
The official guide to ONNX Runtime session options, execution provider configuration, and graph optimization levels. Essential reading before deploying ONNX models to production at scale.

**"Approximate Nearest Neighbor Oh Yeah"** (ANNOY GitHub)
Spotify's ANNOY library documentation provides a clear explanation of the tradeoffs between exact search (reliable but slow for large indices) and approximate nearest neighbor search (fast but inexact). Relevant for teams building retrieval over millions of documents where exact search latency is prohibitive.

---

*Custom Embedding Models: Fine-Tuning, Evaluation, and Deployment*
*Kelly Price — 2026*
```

---

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*
