Precision vs. Recall — The Last Mile of Code Search

In the previous post, we looked at why code search across large codebases is harder than it appears. The short version: a specific function name can appear in dozens of places across a codebase — imports, tests, documentation, other modules that call it — and the search system has to figure out which one is the definition.

Getting the right file into your results is one problem. Getting the right chunk within that file is a different problem, and it's the one that determines whether an AI assistant can actually do useful work.

Why File-Level Recall Isn't Enough

Imagine asking an AI assistant to add a parameter to a specific function. The assistant sends a search query to find that function. The search returns five results — and the correct file is in there. But the specific chunk containing the function definition is buried lower, displaced by chunks containing calls to that function from elsewhere in the same file.

The assistant reads the top results, doesn't see the definition, and either asks you to clarify, makes an edit in the wrong place, or hallucinates context it doesn't have. The file was found. The task still failed.

The distinction that matters: file-level recall tells you whether the right haystack is present. Entity-level precision tells you whether the needle is actually in your hand.

The Entity Precision Problem

A Django model file might contain 50 methods and properties. A FastAPI router might define 30 endpoints. When a search returns chunks from that file, all of them have a reasonable claim to relevance for a query about that file's code. The search system has to rank the specific definition above all the other plausible matches.

This is not a problem that gets easier with more data or bigger models in isolation. It requires the search system to understand the difference between this chunk defines the thing and this chunk references the thing — a distinction that demands reading the query and the code together, not independently.

What Pyckle Achieves

We validated Pyckle's search system across three production-scale Python codebases using thousands of entity-level search tasks — queries that require surfacing a specific function or class definition, not just a related file.

93–100%

entity-level recall across all tested codebases

67K

chunks in the largest validated codebase

<1s

average search latency

93–100% entity-level recall means that in more than 9 out of 10 searches, the exact function or class definition is in the returned content — not just the right file, the right chunk. Across a codebase with 67,000 indexed code chunks.

Why This Matters for AI-Assisted Development

Every time an AI coding assistant searches your codebase, it's making a bet on the results it gets back. If those results contain the right context, the assistant can make accurate edits, give correct explanations, and trace bugs to their source. If the results are close but wrong — the right file but the wrong function, or a usage site instead of the definition — the assistant is working with bad information.

The quality of code search is invisible until it fails. When it works, you don't notice it. When it doesn't, the assistant asks confusing questions, makes edits in the wrong place, or confidently does the wrong thing. The gap between 70% recall and 95% recall is the difference between an assistant that mostly works and one you can actually rely on.

Pyckle's search quality is the foundation everything else is built on. It's why the coding assistant powered by Pyckle can work accurately across large, real-world codebases — not just toy examples.

🔗

Previously: Why Code Search Is Harder Than It Looks

Why finding a function in a 67K-chunk codebase is a genuinely difficult problem — and what the failure modes look like.

← Back to Blog