The Search Result Isn't the Answer

🎧
Listen to this article 6 min
Download MP3

There's a paper making the rounds this week — "Hierarchical Abstract Tree for Cross-Document Retrieval-Augmented Generation" — and the title alone is doing a lot of work. Hierarchical. Abstract. Tree. Cross-document. Each word is load-bearing. Together they describe something that a lot of teams are quietly rediscovering after a year or two of RAG disappointment: getting the right chunk isn't the same as getting the right answer.

The paper proposes organizing documents into a tree structure where higher nodes carry abstract summaries and lower nodes carry the actual content. Retrieval walks the tree. You pull from the top when you need conceptual orientation, from the bottom when you need specifics. It's not a new idea in computer science — hierarchical indexing has been around for decades — but applying it to LLM context assembly is a problem that most implementations skip entirely.

Most RAG systems are flat. You chunk, you embed, you retrieve the top-k, you hand it to the model. Done.

The problem is that flat retrieval assumes every query lives at the same semantic altitude. A question like "what does this system do?" and a question like "what does line 47 of the payment processor return when the idempotency key collides?" are not the same type of question. Giving both queries the same retrieval path produces context that is either too abstract or too specific. Sometimes both.


The Reddit thread this week — the one titled "grep/ls is probably all you need for finance documents" — is getting traction for the opposite reason. The argument there is that structured documents with consistent formatting don't need semantic search at all. BM25 and basic pattern matching outperform vector retrieval when the vocabulary is domain-specific and stable.

The thread is generating the usual responses. Some developers agree enthusiastically. Others cite cases where semantic search caught synonyms that grep would have missed. Both camps are right, depending on the document type.

What neither camp is fully accounting for is the retrieval-to-context pipeline. Even if grep finds the right chunk, the question is what the model receives. Is it one chunk? Three? What's adjacent to it? Does the surrounding context contradict it or support it? The retrieval mechanism is only one part of what determines answer quality.


The HAT paper addresses something in between. The idea is that document structure carries semantic information that flat embeddings lose. When a chunk is embedded in isolation, it loses its relationship to the document's argument. A chunk from section 3.2 of a technical report might be highly relevant to a query, but without understanding that section 3.2 is a response to the limitations described in section 2.4, the retrieved context is incomplete.

Hierarchical retrieval preserves some of that relational structure. The parent node carries the "what this section is about" signal. The child nodes carry the detail. A good retriever pulls from both levels depending on the question type.

The implementation complexity is real. Building and maintaining a tree index is not trivial. When source documents update — a code file changes, a wiki page gets revised — the tree needs to propagate those changes upward through the parent nodes. That's a non-trivial maintenance problem that flat indexing sidesteps entirely by just re-embedding the affected chunks.


There's also a separate paper in this week's crop that's worth noting: "LLM-Oriented Information Retrieval: A Denoising-First Perspective." The frame there is different. The argument is that standard retrieval optimizes for relevance but not for utility. A chunk can be topically relevant and still add noise to the context window — repeated information, adjacent content that pulls the model toward a wrong interpretation, high-relevance but low-information chunks that take up space.

Denoising-first retrieval tries to filter for signal rather than relevance. The distinction matters because relevance is measurable at retrieval time, while utility is only measurable after the model has generated a response. You can score a chunk's cosine similarity. You can't easily score whether including it made the answer better.

This is one of the harder unsolved problems in production RAG. Most evaluation frameworks measure retrieval accuracy — did the right chunk appear in the top-k? Very few measure whether the assembled context actually produced a better answer than a different assembly would have.


The practical picture that emerges from this week's research is not "here's the new technique that fixes RAG." It's more specific than that.

Flat retrieval with good embeddings works well when documents are homogeneous, the query space is predictable, and the vocabulary is general enough that semantic similarity is meaningful. That covers a lot of use cases. It doesn't cover all of them.

Cross-document retrieval — pulling from multiple sources to answer a single question — is where flat approaches start to fail in consistent ways. The model receives fragments from different documents with no structural information about how those fragments relate. The HAT paper is attempting to solve that specific failure mode. Whether the tree-building overhead is worth it depends entirely on whether cross-document coherence is a problem in the actual deployment.

The denoising frame is useful regardless of retrieval strategy. Relevant-but-noisy context is a real phenomenon. Most teams encounter it and attribute the model's poor performance to the model rather than to what was handed to it.


The honest summary is that retrieval research is maturing in the right direction. The early phase of RAG adoption was mostly about getting something working. The current phase is about understanding why it fails in specific ways and building targeted fixes.

Hierarchical indexing, denoising filters, cost-aware routing between retrieval strategies — these are all responses to real failure modes that teams encountered in production. None of them are magic. Each adds complexity that has to be justified by the quality improvement it produces.

Most teams are still on the flat retrieval path. That's probably the right place to be until the failure modes become clear enough to diagnose. When the same query reliably produces wrong answers despite high retrieval scores, that's the signal that the retrieval architecture itself needs revisiting.

The search result isn't the answer. It's the input to the answer. The distinction is starting to get the attention it deserves.

← Back to News

Go Deeper — Free Guides

Free Guides

Books & Guides — Code Intelligence

Free ebooks and guides on semantic search, embeddings, RAG, and AI-assisted development.

Browse all guides →