Why Generic Embeddings Fail at Code Search

OpenAI's embedding models are excellent. For documentation search, customer support tickets, knowledge bases — they're hard to beat. They were trained on enormous corpora of natural language and they understand it well.

For code search, they fail in a specific and predictable way.

The Vocabulary Translation Problem

When a developer searches for "where does session verification happen," they're not looking for a function called session_verification. That function probably doesn't exist. What exists is validate_jwt_token() or checkAuthMiddleware or verifySession.

The search system needs to understand that "session verification" and validate_jwt_token are semantically equivalent — that one describes the behavior and the other implements it.

This is the L1 query problem: queries where the vocabulary doesn't match the code.

Generic embedding models don't solve this. They weren't trained on code-to-query pairs. They don't know that useAuthorizationMiddleware is about authentication. They place "session verification" far from validate_jwt_token in vector space because the surface tokens don't overlap.

The Training Distribution Matters

An embedding model learns to place semantically similar text close together in vector space. But "semantically similar" depends entirely on what the model was trained to consider similar.

A model trained on Wikipedia learns that "capital of France" is similar to "Paris." A model trained on code-to-query pairs learns that "where does authentication happen" is similar to def validate_jwt(token: str) -> bool.

These are different training objectives. The first is about factual relationships between natural language concepts. The second is about the mapping between how developers describe code and how that code is actually written.

Generic models do the first. They don't do the second.

What Code-Specific Training Looks Like

PyckLM was trained on over 57,000 code-to-query triplets: a query, a code chunk that answers it, and a hard negative (a chunk that looks similar but isn't the answer).

The training objective is contrastive: pull the query toward the positive chunk, push it away from the negative. At scale, this teaches the model to map developer vocabulary to implementation patterns.

Examples from the training set:

Query: "where does rate limiting happen" → Positive: class RateLimiter implementation → Negative: class RateLimitExceeded exception definition
Query: "how are new users created" → Positive: def create_user(email, password) → Negative: def update_user(user_id, data)
Query: "what runs on startup" → Positive: def on_startup() hook → Negative: def shutdown() hook

The hard negatives are critical. Without them, the model learns that anything mentioning "rate" is relevant to rate limiting. With them, the model learns the difference between the rate limiter implementation and code that merely mentions rate limits.

The Numbers

On held-out evaluation triplets, PyckLM achieves 91% cosine accuracy (the positive chunk is closer to the query than the hard negative). On real codebases:

L0 queries (exact vocabulary match): both generic and code-specific models perform well
L1 queries (vocabulary translation required): code-specific models significantly outperform generic
L2 queries (behavioral/symptomatic descriptions): smaller gap, but code-specific still wins

The gap is largest on L1 queries — the queries that matter most for developer productivity.

What This Means for Your Code Search

If you're building code search with text-embedding-3-small or ada-002, you'll get reasonable results on exact-match queries. But when developers ask "where does X happen" or "how does Y work," the results will be noisy.

The fix isn't better prompting or smarter chunking. The fix is an embedding model that was trained on the actual task: mapping developer queries to code.

That's what PyckLM does. If your code search matters, consider switching.

← Back to News