We Trained Our Own Code Embedding Model From Scratch. Here's What Happened.

Most teams pull a pre-trained model off HuggingFace and call it done. We went the other direction — random weights, proprietary training data, no borrowed knowledge. This is what we learned.

The Problem With Off-the-Shelf Embeddings

Code search has a specific problem that generic embedding models aren't trained to solve.

When a developer asks "where is the authentication middleware," they don't mean the word "middleware" in a documentation sense — they mean the function that intercepts requests, validates tokens, and forwards or rejects. The vocabulary gap between natural language queries and source code is enormous.

General-purpose embedding models were trained on natural language pairs. They're decent at document similarity. They struggle to understand that def validate_token(request: Request) in auth.py is the answer to "where does token verification happen."

We needed a model that understood code specifically — the patterns, the naming conventions, the relationship between a function's name and what it actually does.

So we trained one from scratch.

Architecture: A Bi-Encoder Built for Code

We chose a BERT-style bi-encoder architecture — the same fundamental design behind most production embedding models. Two encoders that map query and document into the same vector space, optimized so semantically similar pairs land close together.

We landed on a compact architecture: enough layers and hidden dimensions to have real representational capacity, but small enough to run on a single GPU during inference and keep retrieval fast at scale. Smaller models tend to collapse into generic similarity spaces. Larger models slow things down unnecessarily for our use case.

The training objective is TripletLoss with cosine distance:

L = max(0, cos_d(query, positive) - cos_d(query, negative) + margin)

For every training example: a query, a code chunk that answers it (positive), and a chunk that doesn't (negative). The model learns to pull queries toward their answers and push them away from irrelevant chunks.

Training Data: Code Across Six Languages, Plus Domain Knowledge

The model only knows what you show it. We built a training corpus from two sources:

1. Open code datasets — function-docstring pairs across Python, JavaScript, TypeScript, Go, Rust, and Java. Each pair is converted into a triplet: the docstring as query, the function body as positive, a random function from the same language as negative.

2. Domain-specific knowledge — tens of thousands of additional triplets generated from internal technical notes, architecture decisions, and debugging logs. This is what closes the gap for domain-specific queries — the kind of questions a developer actually asks after working on a codebase for months, not the sanitized examples in public benchmarks.

Combined, the corpus spans over 57,000 training triplets and five programming languages.

Training: Under an Hour on a Modern GPU

We trained on a high-end data center GPU, batch size 128, with a conservative learning rate and linear warmup/decay schedule. Gradient clipping is non-negotiable at this scale — we learned that the hard way on a larger experimental model (more on that below).

The model converged quickly. It hit peak evaluation accuracy early in training and held it through completion.

Final accuracy on held-out triplets: 91.6%.

That number means: given a query, a correct code chunk, and a random negative, the model ranked the correct chunk higher in nearly every case.

Benchmark Results: Three Difficulty Levels

Accuracy on synthetic triplets is a training metric, not a deployment metric. The real test is whether the model finds the right code when someone uses it on an actual codebase.

We validated against three difficulty levels:

- L0 (exact match): Query directly describes what the code does. Example: "function that validates JWT tokens" → the JWT validation function. Strong hit rate. - L1 (semantic match): Query uses different vocabulary than the code. Example: "where does session verification happen" → the same JWT function. High hit rate — this is the core retrieval problem. - L2 (inferential): Query describes a symptom or behavior, not the code. Example: "why would a user get 401 after login" → middleware chain. Lower rates, expected for a first-pass model.

The model also delivers meaningful compression: the top-ranked results are dense enough that a small number of chunks typically contains what's needed to answer the question, versus many more chunks with naive cosine similarity.

What We Got Wrong (And Fixed)

The first lesson from training at this scale: high learning rates break large models without gradient clipping.

We ran an experimental larger "teacher" model as the first step toward knowledge distillation. Training appeared stable early on — loss dropping, accuracy climbing. Then it collapsed. Loss plateaued and accuracy stopped improving.

Root cause: gradient explosions at the boundary of the warmup region, compounded by the larger model's sharper loss landscape. Adding gradient clipping and reducing the learning rate fixed it completely on subsequent runs.

The second lesson came from the distillation experiment: some ML library methods run under torch.no_grad() by design. This is correct for inference, but if you're using the output to compute a training loss, the gradient graph is severed and no learning happens. The fix is to call the model's tokenizer and forward pass directly rather than through the high-level convenience method.

Both issues are the kind you only find by running at scale. Synthetic test sets don't catch them.

What's Next: Fine-Tuning and Distillation

The model described here is a pretrained base — strong on general code, but not yet tuned on how people actually search.

The next step is fine-tuning on real query logs: actual searches, graded by whether they returned useful results. This trains the model to rank based on what users found valuable, not just what's syntactically similar.

Beyond that, we're exploring knowledge distillation: train a larger "teacher" model, then compress its learned representations into the smaller deployment model. Teacher-distilled models typically outperform direct scratch training by a meaningful margin, at no increase in inference cost.

The core bet is this: the more query-answer pairs you accumulate from real usage, the more accurate the model becomes for the specific developer using it. That compounding is the thing a generic, static embedding model can never replicate.

We're building persistent memory for developer AI workflows — semantic search that gets more accurate the longer you use it. The Embeddings API is live at pyckle.co/products.