The Embedding Dimension Tradeoff Nobody Talks About

The number everyone optimizes for is the one that matters least. Dimension count is a proxy metric masquerading as a quality signal — and the entire ML community has been cargo-culting it for three years.

OpenAI releases text-embedding-ada-002 at 1536 dimensions. Everyone assumes bigger is better. Engineers benchmark by picking the model with the most dimensions and calling it done. Meanwhile, retrieval quality on their actual data is quietly terrible, and nobody connects the dots because the numbers look impressive.

More Dimensions Don't Mean More Information

The intuition behind "more dimensions = more expressive" isn't wrong in principle. Higher-dimensional spaces can represent more complex relationships. That's true. But it only holds when the training process actually uses those dimensions to encode meaningful signal.

Here's what happens in practice: a 1536-dimensional model trained on web crawl text learns to spread general English semantics across 1536 axes. A 384-dimensional model trained on code repositories learns to encode code-specific relationships — variable naming patterns, function call conventions, docstring-to-implementation alignment — across 384 axes. Put both models in front of a code search query and ask which one finds the right function.

The 384-dim model wins. Not because it's smarter, but because its dimensions encode the right things for the domain. The 1536-dim model has more axes, most of which are encoding distinctions that are irrelevant for code — sentiment gradients, named entity patterns, topic clusters from Wikipedia. That's not expressive capacity, it's noise with extra storage.

PyckLM runs at 384 dimensions with 40M parameters. On CodeSearchNet — the standard benchmark for code retrieval — it hits 0.456 MRR@10. GraphCodeBERT, a model purpose-built for code understanding, sits in the same range despite having more parameters and more architectural complexity. What PyckLM has is training data alignment. That's the variable that actually moves the needle.

The Storage Math You're Ignoring

Every float32 embedding value costs 4 bytes. Do the multiplication:

# Storage per embedding
384-dim  → 384 × 4 bytes = 1,536 bytes  (~1.5 KB)
768-dim  → 768 × 4 bytes = 3,072 bytes  (~3.0 KB)
1536-dim → 1536 × 4 bytes = 6,144 bytes (~6.0 KB)

# At 1M embeddings (mid-size codebase index)
384-dim  →  1.5 GB
1536-dim →  6.0 GB

That 4x storage multiplier compounds. Larger indexes mean more RAM pressure, slower cold-start times, and more expensive infrastructure if you're running this in the cloud. For a semantic code search tool that needs to stay resident in memory to hit low latency targets, this isn't a rounding error — it's the difference between fitting on a standard instance type and needing a memory-optimized one.

And this is before you account for index structure overhead. HNSW graphs — the standard ANN index format — store neighbor lists per node. Those neighbor lists scale with traversal complexity, which grows as vector density increases. The raw embedding storage is only part of the picture.

HNSW Degradation Is Silent

This is the one that burns people. When you scale up dimension count without rescaling your index parameters, HNSW retrieval quality degrades — but latency stays stable. So your dashboards look fine. P95 query time is still under 50ms. But the neighbors being returned are wrong.

HNSW works by building a hierarchical graph where each node connects to its approximate nearest neighbors during construction. "Approximate" is the key word. The algorithm trades exactness for speed. In low-dimensional spaces, the approximation is tight — the graph structure naturally captures true proximity well. As dimensions increase, the curse of dimensionality hits: distances between points compress toward uniformity, and the graph's neighbor structure starts reflecting construction artifacts rather than true semantic proximity.

The result: recall@10 drops while query time stays constant. You're getting fast answers — just wrong ones. In a code search context, this means a developer queries "authentication middleware" and gets results that are close in the embedding space but not actually relevant. They assume the tool is broken. The real issue is an index that's structurally misaligned with the dimensionality of the vectors it's storing.

Fixing this requires tuning ef_construction and M parameters upward as dimension count grows — which means more memory, slower index builds, and higher query-time traversal cost. You're paying to compensate for a problem you created by choosing the wrong model.

Latency That Actually Matters

Code search latency has a hard UX threshold. If results appear in under 10ms, it feels instant. 10-30ms is acceptable. Above 30ms, developers notice. Above 100ms, they stop using it.

384-dim embeddings with a properly tuned HNSW index hit 6ms on a mid-size codebase. The same query against a 1536-dim index on comparable hardware runs 20-40ms. That's not a benchmark footnote — it's the difference between a tool that feels like part of your workflow and one that feels like waiting.

For real-time code search triggered on keystrokes or file saves, the latency target is unforgiving. You don't get to say "it's only 25ms slower" when the baseline is 6ms. You've quadrupled latency to get worse retrieval from a model that wasn't trained on your domain.

The hybrid search approach — combining dense vector similarity with BM25 keyword scoring via Reciprocal Rank Fusion — adds some overhead, but it also narrows the candidate set faster, which means HNSW traversal depth can stay shallower. At 384 dimensions, the total pipeline (embed query → ANN search → RRF fusion → rerank) runs end-to-end well inside the perceptual threshold. At 1536 dimensions, you're budgeting just for the ANN step.

The Right Question to Ask

When evaluating an embedding model for a specific retrieval task, dimension count should be the last thing you look at. The questions that actually matter:

What was this model trained on? If your corpus is Python and Rust and the training data is Stack Overflow posts and GitHub READMEs, that's meaningful domain overlap. If it's CommonCrawl and Wikipedia, the model has learned to embed Wikipedia, not code.

What's the training objective? Models trained with contrastive learning on query-document pairs from your domain type will outperform models trained with masked language modeling on adjacent data, regardless of parameter count.

What does MRR@10 look like on a benchmark that resembles your retrieval task? CodeSearchNet is imperfect but it's standardized. A model that scores well there on code-to-code retrieval will generalize better to real codebases than a model that scores well on MTEB's text-heavy benchmarks.

Dimension count is downstream of all of this. It affects storage, latency, and index behavior — real engineering constraints worth understanding. But it tells you nothing about whether the model's learned representations will find what you're looking for.

The 384-dim model that understands your domain will beat the 1536-dim model that doesn't. Every time. The storage savings and the latency improvement are just a bonus for making the right choice.