Retrieval-augmented generation promised to solve the context problem. Index the codebase, retrieve what is relevant, feed only that to the model. Elegant in theory.
In practice, it introduced a different problem. One that is harder to see and harder to fix. The chunking step—the part that seems boring and mechanical—is usually where things go wrong.
The Chunking Requirement
RAG systems need to break content into pieces. Embeddings work on chunks, not entire repositories. The system divides the codebase into segments, embeds each one, stores them in a vector database, and retrieves the most similar chunks when a query comes in.
For prose, this works reasonably well. Paragraphs are semi-independent units. A chunk containing three paragraphs about authentication can be retrieved and understood without the surrounding chapter.
Code does not work that way.
What Chunking Destroys
Code is a graph, not a sequence. A function calls other functions. A class inherits from a parent. A variable defined in one file gets imported and used in another. Meaning lives in relationships, not isolated blocks.
Naive chunking ignores this. Most chunking strategies split files by line count or token limit, treating code like prose. Recursive chunking, semantic chunking—none of it matters if the boundaries ignore how code actually connects. A function call lands in chunk 47, its definition in chunk 892, the critical context explaining why it exists gets scattered across a dozen fragments.
One researcher described it as "trying to understand a symphony by listening to random 10-second clips." Each clip is real audio. None of it conveys the composition. Adding more clips does not help. It just creates more disconnected noise.
The model receives pieces without connections. It can see that a function exists. It cannot see why it is called, what calls it, or how it fits into the larger flow. So it guesses. Sometimes right. Often subtly wrong. The subtle wrongness is the problem—it looks correct at first glance.
The Multi-Hop Problem
Real questions about code require following chains.
"How does user authentication flow through the system?" Route handler calls middleware. Middleware invokes service. Service queries database. Database returns through a transformer. Five hops, minimum. Five different files, five different contexts, all connected.
Standard retrieval handles one hop. It finds chunks that match the query terms. "Authentication" surfaces the auth middleware. Maybe the user service. Probably not the database adapter or the response transformer—those do not contain the word "authentication" even though they are essential to understanding the flow.
The co-founder of Cursor put it directly: "The hardest questions and queries in a codebase require several hops. Vanilla retrieval only works for one hop."
This is not a tuning problem. It is architectural. Embedding similarity finds locally relevant chunks. Code understanding requires globally connected reasoning. Different problems. One of them is unsolved.
The Quality vs. Effort Trade-Off
Better chunking strategies exist. AST-aware chunking that respects function and class boundaries. Graph-based approaches that preserve relationships. Hierarchical indexing that maintains multiple levels of abstraction.
These help. They do not solve the fundamental tension.
The more sophisticated the chunking, the more compute, maintenance, and domain-specific tuning required. Teams building RAG systems for codebases describe it as a "black hole"—endless resources poured in for incremental improvements. One practitioner observed that if you are optimizing for code quality, RAG will "drain your resources, time, and degrade reasoning."
That is not a condemnation of RAG. It is a recognition that retrieval augmented generation over code is genuinely hard, and the naive implementations that work for documentation and knowledge bases do not transfer cleanly. Context compression and prompt compression can reduce token cost, but they do not restore the structural relationships that chunking destroyed. What works for text does not automatically work for code. This is a category difference, not a tuning difference.
The Invisible Failure Mode
Poor retrieval does not announce itself. The system returns results. They look plausible. The model generates output based on them. Only when someone reads the output carefully does the gap become apparent—a suggestion that is almost right but misses a crucial dependency, or a refactor that breaks something the retrieval did not surface.
Teams adapt. They stop trusting the tool for complex queries. Use it for simple lookups. Do the hard thinking manually. The AI assistant becomes a fancy grep with a chat interface.
This adaptation masks the cost. The tool technically works. It just does not deliver on the promise. Nobody files a bug report because nothing is technically broken. It is just not as useful as the demo suggested. That kind of failure gets normalized, and normalized failures stick around.
What Actually Works
The teams getting better results share a few patterns:
Smaller, focused indexes. Rather than indexing the entire monorepo, they index specific subsystems relevant to current work. Less noise, more signal.
Metadata enrichment. Chunks include information about relationships—what imports what, what calls what. The retrieval system can use this to pull connected pieces, not just similar ones.
Hybrid search. Embedding similarity handles the initial retrieval. A reranking pass uses structural analysis to expand context along dependency lines. Two stages, different strengths.
Accepting limitations. For some queries, retrieval is not the right tool. The developers who use AI effectively know when to let the tool retrieve and when to manually assemble context. That judgment comes from understanding what retrieval can and cannot do.
The Underlying Reality
Chunking is a lossy operation. It discards structure to fit an indexing constraint—a token budget shaped by the context window, not by the code's architecture. Some information survives. Some does not. The question is whether what survives is enough for the task at hand.
For simple queries, it usually is. For the complex questions that actually require AI assistance—the ones a developer could not answer quickly with a text search—it often is not. Larger context windows help, but the lost in the middle problem means more tokens do not automatically mean better results.
Understanding that trade-off is the first step toward working around it.
---
Fourth in a series on context management for AI-assisted development.