Chunking Is Where RAG Goes Wrong

Split a document wrong and the AI retrieves garbage. That's the core problem with chunking—the seemingly simple act of breaking content into smaller pieces for retrieval systems. Get it wrong, and accuracy craters. Get it right, and retrieval works as advertised.

The gap between best and worst chunking strategies? Up to 9% in recall performance. That's not a rounding error. That's the difference between a useful system and one that hallucinates answers.

The Problem No One Explains Clearly

Retrieval-Augmented Generation (RAG) systems need to find relevant content and feed it to a language model. But most content—codebases, documentation, knowledge bases—is too large to feed in whole. So it gets chunked.

Here's where things break down.

Consider this chunk: "Its more than 3.85 million inhabitants make it the European Union's most populous city."

Which city? Which country? The chunk doesn't say. That context existed in the original document but got severed during splitting. Now when someone queries "What is Germany's largest city?", the retrieval system might surface this chunk—but the LLM has to guess what "it" refers to.

This is context severance. It happens constantly with naive chunking approaches.

Three Ways Chunking Fails

1. The Size Mismatch Problem

User queries are typically 10-50 tokens. Chunks range from 100-2000 tokens. When the content being embedded is wildly different in size from the query, similarity scores degrade.

Think about it geometrically. A short query creates a specific point in embedding space. A large chunk creates a more diffuse representation—it contains multiple concepts, multiple potential matches. The signal gets diluted.

2. The Goldilocks Dilemma

Too small (100 tokens): Context fragments across multiple retrievals. The LLM receives orphaned snippets it must somehow synthesize into coherent understanding. Critical relationships between statements get lost.

Too large (2000+ tokens): Noise overwhelms signal. The relevant sentence exists somewhere in that chunk, but it's buried among paragraphs of tangentially related content. Rerankers can only do so much when the haystack is too big.

The research points to 256-512 tokens with 10-20% overlap as the sweet spot for most use cases. But "most" isn't "all."

3. Broken References

Pronouns create cascading failures. "It," "they," "the company," "this method"—all become meaningless when split from their referents.

Late chunking research shows 10-12% accuracy drops for queries involving pronoun resolution. That's just pronouns. Add in other anaphoric references—"the aforementioned policy," "as described above," "the previous implementation"—and the problem compounds.

Why Code Gets Hit Hardest

Code has structure that naive chunking obliterates.

Split a function definition from its body and both chunks become useless. The signature means nothing without the implementation; the implementation lacks context without the signature.

Imports separated from usage break semantic understanding. Class definitions fragmented from their methods lose architectural meaning. Dependencies, relationships, and hierarchies—all the things that make code code rather than random text—get severed.

Traditional AI coding assistants operate within 4,000-8,000 token context windows. That's roughly 3,000-6,000 words, forcing developers to manually segment large codebases. The result: the AI lacks architectural context to generate code that actually integrates with existing systems.

A single context-blind suggestion can break integration contracts between services, duplicate business logic across boundaries, or introduce dependency conflicts affecting multiple teams.

The Hallucination Cascade

Poor chunking doesn't just return irrelevant results. It creates conditions for confident wrong answers.

Here's the pattern: The model retrieves a paragraph referencing "the new policy" without surrounding context explaining which policy changed or when. The retrieved text is technically correct but incomplete. The model infers details that were never provided.

Technically correct. Semantically wrong. And delivered with full confidence because the retrieved chunk did contain relevant keywords.

In procedural content—installation guides, medical protocols, safety instructions—this becomes dangerous. If chunking breaks procedural steps or safety qualifiers across boundaries, the retriever surfaces partial or misleading passages.

What Actually Improves Retrieval

Contextual Embeddings

Anthropic's approach prepends chunk-specific explanatory context before embedding.

Before: "The company's revenue grew by 3% over the previous quarter."

After: "This chunk is from an SEC filing on ACME corp's performance in Q2 2023; the previous quarter's revenue was $314 million. The company's revenue grew by 3% over the previous quarter."

The results: 35% fewer retrieval failures with contextual embeddings alone. Add BM25 hybrid search: 49% fewer failures. Add reranking: 67% fewer failures.

The token cost: roughly $1.02 per million document tokens with prompt caching.

Late Chunking

Rather than chunk first and embed second, late chunking embeds the entire document at the token level, then segments those embeddings into chunks. This preserves full contextual information within the document.

The trade-off: contextual retrieval preserves semantic coherence more effectively but requires greater computational resources. Late chunking offers higher efficiency but tends to sacrifice relevance and completeness.

Semantic Boundary Splitting

Split at semantic boundaries—sections, paragraphs, functions—not arbitrary character positions. This sounds obvious but requires actual understanding of content structure.

Teams using hierarchical chunking, preserved table structure, and semantic boundary validation report 30-40% fewer retrieval calls needed to answer the same question. That's efficiency and accuracy improving simultaneously.

Agentic Chunking

Let AI determine optimal segment boundaries dynamically. The system analyzes content and adjusts chunk boundaries according to meaning and significance rather than following fixed rules.

More expensive computationally. Harder to debug. But for high-value content where retrieval accuracy matters most, the overhead pays off.

The Counterintuitive Finding

Here's what the research reveals: the choice of chunking strategy often has more impact on retrieval performance than the choice of embedding model.

Teams chase better embedding models, bigger context windows, more sophisticated rerankers. They overlook chunk boundaries.

Tuning chunk size, overlap, and hierarchical splitting can deliver 10-15% recall improvements that rival expensive model upgrades. For many systems, fixing chunking is the highest-ROI improvement available.

Practical Implications

Start with 256-512 tokens and 10-20% overlap as a baseline. Split at semantic boundaries, not arbitrary positions. Preserve structured content—tables, code blocks, lists—intact.

Add contextual metadata to each chunk before embedding. Implement hybrid retrieval combining semantic search with keyword matching. Use reranking for final ordering.

Most importantly: test empirically against sample queries. No universal solution exists. The optimal configuration depends on content type, query patterns, and accuracy requirements.

Progressive context loading techniques have demonstrated 98% token reduction in some implementations—150K tokens down to 2K—by loading relevant code on-demand rather than dumping entire repositories into prompts. The architecture of how content gets chunked and retrieved matters more than raw context window size.

The Trade-offs Are Real

Every improvement has costs.

Contextual embeddings require additional LLM calls to generate context. Late chunking demands more compute at indexing time. Agentic chunking adds latency and complexity. Smaller chunks with overlap increase storage requirements and retrieval overhead.

Semantic chunking that understands content structure requires either sophisticated parsing or AI-powered analysis. Both add pipeline complexity.

There's no free lunch. But there are lunches that cost less than the alternative—which is watching retrieval accuracy crater because documents got split at the wrong boundaries.

The question isn't whether to optimize chunking. It's whether to do it thoughtfully or discover the consequences when production queries return irrelevant results and the LLM fills in the gaps with plausible-sounding fiction.