Vector Databases Are Not Your RAG Bottleneck

🎧
Listen to this article 11 min
Download MP3

Every retrieval augmented generation tutorial in 2024 followed the same blueprint. Embed documents. Store vectors in a dedicated database. Retrieve semantically similar chunks. Feed them to an LLM. Pinecone, Weaviate, Milvus, Qdrant—pick your flavor. The assumption was universal: semantic search required specialized infrastructure.

That assumption is expensive. And for most use cases, flat-out wrong.

The Complexity Tax

Vector databases exist for a reason. Approximate nearest neighbor search at massive scale, with concurrent updates from millions of users. Recommendation engines. Search platforms spanning petabytes. At that scale, the infrastructure earns its keep.

Here is the disconnect: most RAG systems do not operate anywhere near that scale. Internal documentation. Coding assistants. Domain-specific knowledge bases measured in thousands of documents—tens of thousands at most. It is a bit like renting a warehouse to store a bookshelf. Not billions. Not petabytes.

Yet teams deploy the same infrastructure anyway. Every query now carries network latency. Serialization overhead. Someone has to own the monitoring, the backups, the re-indexing when documents change. Vendor lock-in if you went managed. Complexity without corresponding benefit.

The 2025 data makes this concrete: 87% of enterprise RAG deployments fail to meet ROI expectations. Eighty-seven percent. Not because embedding models are inadequate. Because over-engineered systems introduce friction at every layer—higher token cost, more latency, lower token efficiency. The infrastructure became the bottleneck it was supposed to eliminate.

What Vector Search Actually Requires

Strip away the branding and vector search is straightforward math. Dot products. Cosine similarities. Finding which stored vectors sit closest to a query vector. NumPy and scikit-learn have handled this efficiently for over a decade.

The numbers tell the story: 1 million vectors at 384 dimensions—a common embedding size—fit in 1.5GB of RAM. A standard server handles that easily. A laptop handles that. Most codebases have fewer than 100K files.

Specialized infrastructure makes sense when real constraints appear: the embedding matrix outgrows available RAM, constant write operations hit individual vectors, or complex filtered queries demand partitioned indexes. Most RAG applications load a document corpus once, embed it, and query a stable index. They never hit those constraints. The specialized infrastructure exists for problems they do not have.

The PostgreSQL Reality Check

PostgreSQL with pgvector has quietly become the pragmatic choice for production RAG. The benchmarks tell the story:

- 1.4x lower p95 latency than Pinecone - 1.5x higher throughput under load - 79% lower monthly cost when self-hosted - Six fewer operational steps and failure points

These numbers come from keeping vector search in your existing database infrastructure instead of adding a specialized service. One database to back up. One database to monitor. One database to reason about during incidents.

For teams already running Postgres—which describes most backend teams—pgvector is an extension install away from production-ready vector search. One database. One backup strategy. One set of operational knowledge. The simplicity compounds.

BM25 Still Works

The rush toward embedding-based search created a blind spot. Keyword search never stopped being effective. It just stopped being fashionable.

XetHub benchmarks comparing BM25 against OpenAI embeddings showed surprisingly close performance. BM25 achieved 85% recall with 8 retrieved results. OpenAI embeddings achieved that recall with 7 results. The difference is marginal. The infrastructure difference is not.

A hybrid approach—BM25 for exact matches, semantic search for conceptual similarity, with reranking to sort the combined results—often outperforms pure vector search. BM25 requires no embedding infrastructure. Just an inverted index.

For codebases, this matters more than people realize. Code has precise identifiers, function names, API calls. A developer searching for handleAuthCallback does not need semantic understanding. They need exact string matching. Fast. Cheap. Done.

The HNSW Problem at Scale

Even when vector databases are appropriate, their default algorithms carry hidden costs.

HNSW (Hierarchical Navigable Small World) powers most vector database implementations. Fast. Also approximate—and the approximation quality degrades silently as the database grows.

The failure mode is insidious. Latency stays stable while retrieval quality drops. Monitoring shows green. Users receive worse results. The index that worked at 100K vectors returns stale or irrelevant documents at 10M vectors. Nothing flags the degradation until users complain.

Exact nearest neighbor search—what NumPy does trivially—guarantees correct results. Approximate search trades that guarantee for speed at scale. If you do not need the scale, the trade is not worth making. Green dashboards do not mean good results.

For Coding Assistants: Context Engineering Over Vector Search

Modern AI coding assistants like Cursor and Windsurf demonstrate what actually matters for code retrieval. The challenge is not semantic search across massive corpora. The challenge is fitting the right context into limited token windows.

Context-aware RAG systems achieve accurate answers with up to 85% lower token consumption compared to naive retrieval. The improvement does not come from better vector search. It comes from:

- Chunking strategy that respects code structure—semantic chunking by functions, classes, and modules rather than arbitrary fixed-size splits - Dependency awareness that includes related code, not just similar code - Context compression and token budgeting that prioritize the most relevant information within the token limit

A typical codebase contains fewer than 100K files. That is well within in-memory search territory. The bottleneck is not retrieval speed—it is retrieval quality. Returning the function signature without the implementation. Including a helper file but missing the type definitions it depends on. Filling the context window with syntactically similar but semantically irrelevant code.

These problems do not get solved by switching to a faster vector database. They get solved by understanding what context an LLM actually needs to generate useful completions. Research on the lost in the middle problem confirms this—LLMs struggle with relevant information buried deep in large context windows regardless of how it got there.

The Practical Path

Start with the simplest solution that could work:

For prototypes and small corpora: NumPy or scikit-learn in-memory search. Zero infrastructure. Sub-millisecond latency. Works to approximately 1M vectors.

For production with existing Postgres: pgvector. One extension. No new services. Battle-tested database operations you already understand.

For hybrid search needs: Combine BM25 with semantic search. Many retrieval problems benefit from both exact and fuzzy matching.

For embedded/edge applications: SQLite with vector extensions like sqlite-vec. RAG in 30MB of memory.

Graduate to dedicated vector infrastructure when actual constraints force the move—memory limits, high write volume, complex filtered queries. Those constraints are real. They are also far less common than the marketing suggests.

Trade-offs Worth Acknowledging

Simpler solutions are not free solutions.

In-memory search requires loading the full embedding matrix on startup. Cold starts take longer. Memory usage scales linearly with corpus size.

pgvector runs on your database server. Vector queries compete for resources with your application queries. Under load, you might want that isolation.

BM25 misses conceptual connections. A search for "authentication" will not find documents that only mention "login" and "credentials." Semantic search catches those relationships.

Exact nearest neighbor search is O(n) per query. At sufficient scale, approximate algorithms genuinely perform better. The threshold varies by hardware and latency requirements, but it exists.

The question is not whether vector databases have value. They do. The question is whether they have value for your specific use case at your current scale. For most RAG systems, the answer is: not yet. Probably not ever.

The Honest Assessment

Vector databases solve real problems at real scales. Nobody disputes that. But documentation bots, coding assistants, and internal tools running against corpora that fit in RAM? Not those problems.

The 2025 consensus from production teams: start simple, measure actual constraints, add complexity only when simpler solutions demonstrably fail. PostgreSQL with pgvector handles 80% of real-world AI workloads. NumPy handles prototypes and small deployments. BM25 hybrid search outperforms pure vector retrieval in many cases.

The infrastructure choices that matter for RAG quality are not about vector databases. They are about chunking strategies, context selection, and understanding what information an LLM actually needs to answer a given query. Those problems exist regardless of where you store your embeddings.

Solve retrieval quality first. The infrastructure problem can wait until you have proven you have one. Most teams never do.

← Back to News

Go Deeper — Free Guides

Free Guides

Books & Guides — Code Intelligence

Free ebooks and guides on semantic search, embeddings, RAG, and AI-assisted development.

Browse all guides →