Why General-Purpose Embedding Models Fail on Code

You plug in a general-purpose embedding model, run a few test queries, get results back, and move on. The problem is silent: the results look plausible. They're not hallucinations — they're actual files from your codebase. They're just the wrong ones. And you won't know until your AI assistant confidently suggests a fix based on the wrong auth middleware, or misses the one function that matters.

That's the failure mode nobody warns you about. Not a crash. Not an error. Just quietly wrong retrieval, compounding every downstream decision.

General models learned to read prose, not parse structure

BERT, its successors, and most models on the MTEB leaderboard were trained on natural language. Wikipedia. Books. Web text. The structural unit of natural language is the sentence, then the paragraph. Meaning flows linearly. Context is positional.

Code doesn't work that way. The structural units are functions, classes, modules, and the call graph that connects them. A function's meaning isn't contained in its lines — it's defined by what it calls, what calls it, what interface it implements, and what invariants it maintains. A general model has no grammar for any of that. It reads a function definition the way it reads a sentence: left to right, weighting frequent tokens.

This isn't a training data volume problem. You can't fix it by showing a general model more code. The architecture and training objectives are built around predicting masked tokens in natural language sequences. Code has fundamentally different distributional properties — token frequency distributions, syntactic constraints, semantic relationships between identifiers — that require purpose-built training objectives to capture.

Your identifiers are noise to a general model

Consider a query like: "find the middleware that validates JWT tokens for admin routes".

Your codebase has two files:

# auth/middleware/admin_jwt_validator.py
def useAdminJwtValidationMiddleware(request, next):
    ...

# auth/middleware/user_jwt_validator.py
def useUserJwtValidationMiddleware(request, next):
    ...

A general model sees useAdminJwtValidationMiddleware and useUserJwtValidationMiddleware as two opaque identifier strings. They share most of their subword tokens. The semantic difference — Admin vs User — gets drowned out by the shared structure. The model's embedding space doesn't have a learned representation for "this identifier refers to a role-scoped middleware variant." It has representations for words it saw in books.

You get retrieval that can't reliably distinguish between these two files. For a two-file example, maybe you notice. In a real codebase with 40 middleware files, you're flying blind.

Domain-specific naming conventions — Hungarian notation, scoped prefixes, framework-idiomatic patterns like use* hooks in React or *Handler in Go — are entirely opaque to a model that never encountered them during training in a meaningful context. They're not noise in your codebase. They're load-bearing semantic signal. A general model treats them as noise.

Chunking in isolation breaks structural meaning

The standard retrieval pipeline chunks your codebase into pieces and embeds each chunk. For prose, paragraph-level chunking works reasonably well. A paragraph is mostly self-contained.

A function definition is not self-contained. Embedding a function in isolation is like embedding a single frame from a film and calling it a movie summary. The frame might show a character holding a gun. Without the 90 minutes of context, you don't know if this is the inciting incident, the climax, or the red herring.

A 30-line function that calls db.beginTransaction(), delegates to inventoryService.reserve(), and catches a ConcurrencyException means something very specific in the context of your order processing system. Chunked in isolation and embedded with a general model, it's 30 lines of tokens that look vaguely similar to any other database transaction function in any other codebase. The embedding doesn't capture that this is the critical path for your inventory consistency guarantee.

Code-native models are trained to preserve structural relationships — the connection between a class definition and its methods, between a function signature and its docstring, between a module's exports and their callers. General models aren't. They embed whatever falls within the token window, with no structural weighting.

The benchmark you're looking at doesn't measure what you care about

MTEB is the standard leaderboard for embedding model evaluation. It covers 56 tasks across classification, clustering, retrieval, and more. It's a rigorous benchmark. It's also almost entirely natural language.

A model that ranks highly on MTEB is genuinely good at retrieving natural language documents. That tells you nothing about whether it can retrieve the right Python file when your developer asks "where do we handle rate limiting for the billing API."

PyckLM achieves 0.456 MRR@10 on CodeSearchNet. GraphCodeBERT — purpose-built for code, trained on the same domain — scores 0.281. That's a 62% improvement on a code-specific retrieval benchmark. But neither of those numbers tells you how either model performs on your codebase, with your naming conventions, your framework choices, your query patterns.

The only evaluation that matters is evaluation on your own retrieval queries against your own codebase. Run 50 queries you actually care about. Check rank 1. Check rank 5. Measure how often the right file appears in the top 10. Most teams never do this. They pick a model, eyeball a few results, and ship it. The silent failure accumulates.

The compounding problem

Bad retrieval doesn't produce bad AI answers in isolation. It produces bad AI answers that look authoritative.

When your retrieval returns the wrong three files as context, your AI assistant reasons over the wrong code. It might generate a patch that modifies a function that's not actually called on the relevant path. It might miss a constraint that's enforced in the file that didn't make the top-10. It confidently references an implementation detail from the wrong middleware.

Developers catch this and stop trusting the tool. The compounding effect runs in both directions: worse embeddings mean more irrelevant tokens in context, which means worse AI output, which means faster trust erosion. Most teams chalk this up to "LLMs hallucinate" and move on. The actual failure was upstream — at retrieval.

There's also a token cost dimension. Every irrelevant file injected as context is tokens you're paying for and tokens the model has to process. With a well-calibrated retrieval threshold, you can surface the right 2 files instead of the wrong 8. That's not just accuracy — it's latency and cost.

What to actually do about it

Use a model trained on code structure, not retrofitted from text. The training objective matters: models trained on code search tasks — mapping natural language queries to code snippets — learn representations that generalize to the retrieval problem you actually have.

Evaluate on your own codebase before shipping. Build a small evaluation set: 30-50 queries your developers actually ask, with known correct answers. MRR@10 is the right metric. If you can't get above 0.3 on your own queries, your retrieval layer is a liability.

Calibrate your similarity threshold. Most retrieval pipelines return a fixed top-k regardless of score. A function with cosine similarity 0.4 and a function with similarity 0.9 are not equally relevant — but a miscalibrated pipeline treats them the same. Set a minimum threshold. Return nothing rather than returning confidently wrong context.

The retrieval layer is where AI-assisted development either works or doesn't. Get the embeddings right first. Everything downstream depends on it.