Vector Databases Are Not the Problem You Think They Are

🎧
Listen to this article 10 min
Download MP3

Vector Databases Are Not the Problem You Think They Are

The vector database debate is a distraction. Pick any of them—Pinecone, ChromaDB, pgvector, Qdrant—and the retrieval quality of your system will be determined by decisions you made three steps earlier. The database is where results come out. It is not where retrieval quality is made.

This matters because teams routinely spend weeks on database selection and days on embedding model selection. The time investment is exactly backwards.

The choice of vector database rarely determines whether a retrieval augmented generation system works. What determines it is almost everything that happens before the query reaches the database.

---

How Vector Databases Actually Work

The core idea is straightforward. Text—a function, a document, a code comment—gets converted into a numerical representation called an embedding. These embeddings are high-dimensional vectors, typically 384 to 1536 numbers, where semantic relationships between concepts are encoded as geometric proximity. Things that mean similar things end up close together in vector space. Things that don't, end up far apart.

When a query arrives, it gets embedded using the same model, and the database returns the stored vectors nearest to it. That's approximate nearest neighbor (ANN) search. The "approximate" matters—most vector databases sacrifice some recall accuracy for significant speed gains using indexing algorithms like HNSW (Hierarchical Navigable Small World) or IVF (Inverted File Index).

The result is a list of candidate chunks that are semantically related to the query. In a codebase context, this is how you get from "where is authentication handled" to a ranked list of files and functions.

This is table stakes. All of the major options do this.

---

What the Databases Are Actually Optimized For

The meaningful differences between ChromaDB, Pinecone, and pgvector are operational, not algorithmic.

ChromaDB is designed to be easy. It runs in-process, has minimal configuration, and can operate entirely in memory or persist to a local directory. For local development and prototype work, this is genuinely useful—there is no separate service to manage, no cloud credentials, no networking overhead. The tradeoff is that it is not built for high-throughput production workloads, and the ecosystem around it is still maturing.

Pinecone is designed to be a managed service at scale. The infrastructure decisions are abstracted away. It handles replication, redundancy, and query performance without requiring the user to tune HNSW parameters or manage index builds. For teams that want to ship and not operate a database, this is appealing. The tradeoff is vendor lock-in and cost that scales with index size and query volume—which, at sufficient scale, becomes significant.

pgvector is an extension to PostgreSQL. The argument for it is that most production applications already run Postgres, and adding vector search to an existing, well-understood database simplifies the stack. The metadata filtering and JOIN capabilities are genuinely stronger than purpose-built vector databases, which handle metadata as secondary citizens. The tradeoff is that Postgres is not optimized for ANN search at extreme scale, and HNSW index builds in pgvector can be slow on large corpora.

There are others worth mentioning—Weaviate, Qdrant, Milvus, Redis with RediSearch—each with their own optimization profiles. The pattern holds: they differ primarily on operational characteristics, not on the quality of the retrieval math.

---

The Part That Actually Breaks Systems

Here is where most teams lose time: they assume the database is the retrieval system. It is not. The database is the last step in a pipeline that includes decisions which matter considerably more.

Embedding model selection. Generic models like text-embedding-ada-002 or off-the-shelf sentence transformers are trained on general text. Code is not general text. A codebase has naming conventions, domain abbreviations, internal terminology, and structural patterns that general embeddings represent poorly. A function named calcNAVWithFees will not embed near "calculate net asset value" the way a domain-tuned model would. The database returns whatever the embeddings tell it to return. Bad embeddings produce bad results regardless of which database stores them.

Chunking strategy. How content is divided before embedding affects retrieval quality more than most developers expect. Semantic chunking—splitting on logical boundaries like function definitions, class declarations, and documentation blocks—consistently outperforms naive approaches. Chunks that are too large dilute the semantic signal; the embedding averages across too much content and loses specificity. Chunks that are too small lose the context needed to make the result useful. Splitting on file size or token count without considering logical structure produces retrieval artifacts that are hard to debug.

Staleness. Codebases change. Embeddings do not update themselves. A retrieval system indexed three months ago is increasingly misrepresenting the current state of the code. The database continues to return results with confidence while the code it describes has been refactored, deleted, or renamed. This is not a database problem. It is an indexing lifecycle problem.

Query understanding. The query "how does the payment system handle retries" is not the same as the query "payment retry logic." Both might return relevant results. They might return different results. Whether natural language queries are well-matched to the way the codebase is described in its embeddings depends on how the index was built and whether there has been any calibration to the team's actual search patterns.

These four variables—embedding quality, chunking strategy, index freshness, and query alignment—collectively determine retrieval quality. The vector database influences query latency and operational complexity. It does not fix upstream problems.

---

What This Looks Like in Practice

A team builds a code search tool. They pick Pinecone because the blog posts make it sound production-ready. They use OpenAI's embedding model because it's the default. They split files every 500 tokens because that seems reasonable. They index the codebase once on launch.

The tool works. Results are roughly relevant. Developers use it occasionally. Six months later, it surfaces functions that no longer exist and misses entire services added after the initial index. The team assumes the database is degrading. It isn't. The index is just stale.

Another team picks ChromaDB for a local development tool. They write custom chunking logic that respects function boundaries. They retrain embeddings quarterly against their actual query logs. They set up automated re-indexing on merge to main. Their retrieval quality improves over time. They're running ChromaDB.

The operational discipline made the difference. Not the database.

---

Where the Industry Is Going

The conversation around vector databases is maturing. A few trends are worth tracking.

Hybrid search is becoming standard. Pure vector search has recall gaps—it finds semantically similar content but can miss exact matches, rare identifiers, or domain-specific terms that don't embed well. Hybrid approaches that combine vector search with BM25 keyword search, then use reciprocal rank fusion to merge the results, consistently outperform either method alone. Adding a reranking pass on top of the fused candidates improves precision further. The major platforms are adding this natively. Teams still running pure vector retrieval are leaving quality on the table.

Serverless and embedded architectures are growing. The operational burden of running a separate vector database service is real. Embedded options like ChromaDB in local mode, or Qdrant's WASM builds, are maturing toward production viability. The category of "things that require a separate infrastructure team to run" is shrinking.

Domain-specific embedding models are beginning to matter. General-purpose embedding models served a generation of retrieval systems that prioritized getting something working. The ceiling on those models is becoming visible. Teams at the frontier are training or fine-tuning on domain data, and the quality delta is measurable. This will trickle into standard practice faster than most people expect.

Stale index management is an unsolved problem at most organizations. The tooling for automated, incremental re-indexing tied to code changes is still nascent. Most teams handle this with cron jobs or not at all. This is a gap that the ecosystem hasn't fully addressed.

---

The Honest Take

The vector database selection debate gets substantial attention relative to its actual impact on outcomes. This is partly because it's a tractable, visible decision—you pick a vendor or a library, and you're done. The variables that actually drive retrieval quality require ongoing attention, domain knowledge, and feedback loops. Those are harder to optimize and harder to hand off to a vendor.

The choice of database matters for operational reasons: cost, latency, maintenance burden, scalability ceilings. Token efficiency and token cost are downstream of retrieval quality—bad results mean more context stuffed into prompts, more tokens consumed, less useful output. It's a real decision worth making carefully. But teams that spend more time on database selection than on embedding quality, chunking logic, and index lifecycle management are optimizing the wrong layer.

The retrieval system is only as good as what goes into it. The database is where the results come out.

← Back to News