Fine-Tuned vs. General: The Embedding Model Decision You're Getting Wrong

Most teams pick their embedding model the wrong way. They benchmark on public datasets, pick the one with the best MTEB score, and ship it. Then they wonder why their semantic search returns irrelevant results for queries that feel obvious. The model isn't broken. It just doesn't speak your language.

The Question You're Actually Asking

The framing is wrong from the start. Teams debate "fine-tuned vs. general" as if fine-tuning is some exotic, expensive operation reserved for well-funded labs. That framing hides the real question: was your model ever trained on data that looks like yours?

A general-purpose embedding model trained on web crawls has seen a lot of text. Some of that text is code. It's seen GitHub READMEs, Stack Overflow answers, documentation pages. What it hasn't seen is your codebase — your naming conventions, your internal abstractions, your domain terminology, the way your team writes docstrings when they're in a hurry.

That gap matters more than people expect. When a developer on your team searches for "retry logic with exponential backoff," they have a mental model of what that means in your stack. The general model has a mental model built from the entire public internet's interpretation of that phrase. Those are not the same thing. In a small repo with clean, conventional code, the difference is negligible. In a monorepo with five years of accumulated abstractions, it's the difference between finding the right function and getting a list of plausibly related files that wastes ten minutes of someone's time.

What "General" Actually Means

Models like bge-large, text-embedding-3-large, and Codestral Embed are excellent. They generalize well precisely because they've been trained on massive, diverse corpora. That breadth is their strength. It's also their ceiling.

Breadth-trained models learn statistical associations across the entire distribution of human text. They know that async def is related to coroutines, that OAuth is related to authentication, that __init__ is related to class construction. That's useful. But they don't know that in your codebase, functions prefixed with _resolve_ are always part of the dependency injection layer, or that PolicyEngine is not a generic business rules class but a specific abstraction your team built three years ago that every billing-related query should surface.

When you run a semantic search and the top results are consistently "close but wrong," that's not a retrieval problem. That's a representation problem. The model is encoding your query and your code into a vector space where they don't sit as close to each other as they should. You can tune chunk size, adjust similarity thresholds, add keyword boosting — and you'll get marginal improvements. You won't fix the underlying signal.

How PyckLM Was Built — and Why It Matters

PyckLM started from random weights. No pretrained lineage, no transfer from a general language model. It was trained from scratch on code structure: Python, JavaScript, TypeScript, Go, Rust, plus linked Obsidian knowledge graphs. 968,692 triplets. 38.8 minutes on an H100. The final model hits 0.996 cosine accuracy on its validation set.

The triplets are the key. Each one is a (anchor, positive, negative) tuple: a query, a correct code match, and a hard negative — something that looks relevant but isn't. The hard negatives are what force the model to learn fine-grained distinctions rather than broad topic associations. A general model might cluster "authentication middleware" and "JWT validation" close together. A model trained on code-specific triplets learns to separate them when the codebase treats them as distinct concerns.

You can see the difference at query time:


# General model behavior: broad topic clustering
$ pyckle search "retry logic"
> network/client.py        # HTTP client with retries
> utils/resilience.py      # general retry decorator
> services/queue.py        # message queue with dead letter
> docs/api_reference.md    # mentions retry in passing

# Domain-calibrated behavior: structural signal
$ pyckle search "retry logic"
> utils/resilience.py      # retry decorator — direct match
> network/client.py        # uses the decorator — second-order match
> [stops — doesn't surface tangential mentions]

The difference isn't dramatic in a small repo. In a 400K-line codebase with 12 services, it saves the wrong results from compounding across every query.

When Fine-Tuning Is Overkill

If your codebase is under 10,000 files, a well-calibrated general code model will likely be sufficient. Fine-tuning — or training from scratch — needs enough data to matter. If you don't have the query volume to generate meaningful training signal, you're just adding complexity without payoff.

bge-large-en-v1.5 with hybrid search (BM25 + semantic) will outperform a badly trained domain model every time. The infrastructure overhead of maintaining a custom model is real. Don't pay it if you don't need to.

The calculus changes at scale. Monorepos. Proprietary internal frameworks with names that mean nothing to a general model. Teams that have built their own abstractions over years. Codebases where the naming conventions are so consistent and internal that a general model actively misfires — surfacing results that match the surface-level terminology but are structurally wrong. That's where domain-specific training becomes essential, not optional.

There's also a middle path that most teams skip: fine-tuning an existing code model on your specific codebase rather than training from scratch. If you have 50,000+ files and a few months of query logs, that's a tractable approach. You get domain specificity without the cold-start problem.

The Compounding Argument

Here's the part that doesn't show up in benchmarks. A static model — no matter how good — is a snapshot. It was trained on whatever data existed at training time, and it stays there. Every query you run against it tells you nothing about the next query.

The training data loop is a different kind of asset. Every search query against Pyckle is a signal: what developers are looking for, what they find useful, where the model missed. That signal feeds the next version of PyckLM. The model gets better as the tool gets used. A general embedding model from a major provider has no mechanism for this — it's updated on their schedule, for their reasons, optimized for their benchmark targets.

This is the moat that matters in the long run. Not the cosine accuracy number at a point in time, but the trajectory. A model that improves as your team uses it is worth more than a static model that's marginally better today. Compounding works on models the same way it works on everything else.

The teams that will get the most out of code search over the next two years aren't the ones who picked the highest MTEB score in 2024. They're the ones building systems where usage generates training signal. That's not a bet on fine-tuning as a technique. It's a bet on feedback loops as infrastructure.