Why Naive Retrieval Breaks at Scale (And What We Built Instead)

At a few hundred files, vector search works fine. At tens of thousands of chunks, it starts returning the wrong things — confidently. Here's the architecture we built to fix that.

The Scale Problem Nobody Talks About

Most RAG benchmarks test against small, clean corpora. A few hundred documents, curated content, obvious queries. In that environment, a single embedding lookup works well enough.

Real codebases don't look like that.

A production monorepo might have 50,000+ indexable chunks after AST-aware splitting. At that scale, the top-k results from a single vector search aren't just noisy — they're systematically biased. The embedding space gets crowded. Popular patterns appear everywhere. Generic terms like "config," "handler," or "initialize" drown out the specific chunk you actually need.

We call this the retrieval cliff: the point where adding more code to the index starts hurting search quality rather than helping it.

The fix isn't a better embedding model. It's a smarter retrieval strategy.

What a SearchPlan Actually Is

The core idea is to treat retrieval as a planning problem, not a lookup.

Before issuing a single vector query, a lightweight orchestration layer analyzes the incoming query and decides how to search — which signals to use, how to weight them, and how many candidates to pull from each stage.

We represent this as a SearchPlan dataclass:

@dataclass
class SearchPlan:
    query: str
    entity_names: list[str]       # extracted symbol names
    hint_files: list[str]         # explicit file paths in query
    use_hierarchical: bool        # two-stage file → chunk search
    use_graph: bool               # AST define-vs-use signal
    use_bm25: bool                # keyword fusion
    top_k: int                    # final chunk count
    rerank: bool                  # cross-encoder reranking

A query like "where is the token validation logic" produces a different plan than "show me the tests for the auth module" — even though both are about authentication. The first triggers entity extraction and graph traversal. The second triggers file-hint detection and BM25 weighting toward test paths.

The Signal Stack

At scale, no single signal is reliable. We fuse five:

1. Semantic similarity — the base embedding score. Good at intent, bad at specificity.

2. BM25 keyword match — exact term overlap. Catches what semantic search misses when the query uses the same words as the code.

3. AST define-vs-use graph — files that define a queried entity get a score boost over files that merely use it. This surfaces the canonical implementation instead of every callsite.

4. Entity chunk filter — if the query contains an extractable symbol name (TokenValidator, validate_token), we filter candidates to chunks that contain that string before scoring. Eliminates most noise.

5. File hint detection — if the query explicitly mentions a file path or module name, we pull extra candidates from that file directly, bypassing the ranking entirely for those chunks.

Each signal has a weight. The weights aren't equal — explicit file hints are worth far more than keyword overlap, which is worth more than summary similarity. Getting these weights right required running against real query logs, not synthetic benchmarks.

Hierarchical Retrieval

For large codebases, we add a two-stage search before the chunk-level lookup.

Stage 1: File summaries. Each file gets a short embedding — a condensed representation of what the file does, not individual chunks. We search against these summaries first to identify the 10-15 most relevant files.

Stage 2: Chunk search within those files. The full chunk search runs only against the candidate files from Stage 1, dramatically reducing the search space.

This matters because at 50,000+ chunks, the nearest neighbors in embedding space are often structurally similar but semantically unrelated to what you need. Narrowing to candidate files first means the chunk search runs in a space where everything is at least plausibly relevant.

Hierarchical retrieval activates automatically above a chunk count threshold. Below that threshold, the overhead isn't worth it.

The Reasoning Layer

Signal fusion handles what to retrieve. The reasoning layer handles which files matter when the query is ambiguous or the codebase is large enough that even well-ranked results need prioritization.

A small fine-tuned model takes the query and the list of candidate files and outputs a structured plan — which files to read first, which are likely definitions versus usages, and whether any files should be skipped entirely.

The model was fine-tuned specifically on multi-repository orchestration patterns: the kind of decisions a developer makes when navigating an unfamiliar large codebase. Not general reasoning. Not code generation. Just: given these files and this question, what's the right reading order?

The output is always structured JSON — a ranked file list with rationale. If the model fails or times out, the system falls back to the heuristic signal scores.

This layer activates above a second threshold, higher than hierarchical search. For small codebases it adds latency without benefit. For large ones it's the difference between surfacing the right file on the first try and returning a list of plausible-looking noise.

Putting It Together

The full retrieval path for a large codebase query:

1. Query preprocessing — extract entity names, detect file hints, classify intent 2. Build SearchPlan — decide which signals and stages to activate 3. Stage 1 (if hierarchical) — file summary search → candidate file list 4. Stage 2 — chunk search, filtered by candidate files and entity name 5. Signal fusion — merge BM25 + semantic + graph + hint scores via weighted RRF 6. Reranking — cross-encoder reranker re-scores top candidates 7. Reasoning layer (if large codebase) — model prioritizes files, returns structured plan 8. Return — top-k chunks with ranked file context

Each layer is independently toggleable. On a small codebase, most of it turns off and you get a fast, simple lookup. On a large one, all of it runs and you get retrieval quality that holds up.

What the Numbers Look Like

On identifier-heavy queries — function names, class names, specific symbols — hybrid BM25 + semantic fusion improves Recall@10 by 15–25% over pure vector search. The BM25 component catches exact-match cases that embedding similarity consistently misses, and RRF fusion keeps the score distribution stable so threshold calibration doesn't drift.

At the model level, PyckLM achieves 0.456 MRR@10 on CodeSearchNet — 62% higher than GraphCodeBERT on the same benchmark. MRR@10 measures where the first relevant result lands in the ranked list; a score of 1.0 means always first. The semantic search foundation matters: if the embedding model can't rank relevant code above noise, no amount of orchestration recovers that loss.

Embedding latency at 384 dimensions on ONNX Runtime (CPU) runs at 6ms per query. The full orchestration layer — plan generation, hybrid retrieval, reranking — adds 30–80ms on top of that, depending on codebase size and which signals are active. For most developer workflows, sub-100ms total retrieval time is fast enough to feel instant.

The retrieval cliff is real. The fix is an orchestration layer that treats search as a multi-signal planning problem, not a single lookup.

We're building persistent memory for developer AI workflows — semantic search that gets smarter the larger your codebase gets, not worse.