The Flat Index Problem Nobody Talks About

You ask your AI coding assistant where authentication is handled. It returns three files: a utility that parses JWT tokens, a migration script that adds columns to the users table, and a test file that mocks the login endpoint. All of them mention authentication. None of them is the actual authentication module.

The underlying problem is not the model. The problem is how the codebase was indexed.

What Flat Actually Means

Most codebase indexing follows a straightforward pattern. Walk the directory tree. Chunk each file into segments small enough to embed. Compute a vector for each chunk. Store everything in a vector database. At query time, find the chunks whose embeddings are closest to the query embedding. Send those chunks to the model.

This approach treats every chunk as an island. The chunk from auth/login.py has no explicit relationship to the chunk from auth/session.py in the index. They might embed similarly if they share vocabulary, but the index has no representation of the fact that one imports the other, or that they belong to the same module, or that modifying one frequently requires modifying the other.

Similarity is the only relationship the flat index knows. Two chunks are related if their vectors are close. Everything else—import graphs, call chains, inheritance hierarchies, module boundaries—exists in the code but not in the index.

The Three Failures

Flat indexing fails in predictable ways when applied to code.

Retrieval without import context. A function calls a helper defined in another file. The function's chunk embeds well against a query about its behavior. But the helper's implementation—which contains the actual logic—embeds differently because it uses different vocabulary. The assistant receives the caller without the callee. It describes what the function does based on its name and docstring, not based on what actually executes.

Similarity without call graph awareness. Two files both mention "payment processing." One is the production payment module. The other is an abandoned prototype in a subdirectory called experiments/. A flat index cannot distinguish between them based on structure. If the prototype happens to embed closer to the query, it wins. The assistant confidently explains code that was never shipped.

Results from conflicting contexts. A query about error handling retrieves chunks from three different modules: the main error handler, a legacy error handler that was replaced but not deleted, and test fixtures that intentionally raise errors. The embeddings are similar. The implementations conflict. The assistant synthesizes advice that mixes patterns from incompatible contexts.

These failures share a root cause. The index lacks the structural information that would distinguish relevant code from code that merely looks similar.

Why Generic Embeddings Make It Worse

Embedding models are trained on large corpora. The corpora are predominantly natural language: web pages, articles, documentation. Code appears in training data, but it is not the primary domain.

A model trained on prose learns that similarity is semantic. "Handle the payment" and "process the transaction" are close because they mean the same thing. This is useful for document retrieval.

Code operates differently. handlePayment() and processTransaction() might be semantically similar in natural language terms, but in a codebase they might be entirely unrelated functions in different modules with different signatures and different behaviors. The name similarity is accidental. The structural relationship—or lack of one—is what matters.

Generic embeddings weight everything equally: variable names, comments, string literals, actual logic. A chunk with a detailed docstring might embed closer to a query than a chunk with sparse comments but relevant implementation. The docstring is not the code. A developer asking "where is X implemented" wants the implementation, not the documentation about the implementation.

The gap between prose retrieval and code retrieval is real. Tools that ignore it pay the cost in irrelevant results.

What Structure-Aware Retrieval Looks Like

The alternative to flat indexing is preserving structural relationships in the index itself.

Import edges are explicit in code. When auth/login.py imports auth/session.py, that relationship can be indexed alongside the chunks. A query about login functionality can retrieve not just the login code but also the modules it depends on—not because they embed similarly, but because the code explicitly declares a dependency.

Call relationships are similarly extractable. Static analysis identifies which functions call which other functions. When a query matches a caller, the callee becomes a candidate for retrieval even if its embedding is distant. The call graph provides a signal that pure similarity cannot.

Module boundaries provide yet another layer. Files in the same directory or package are more likely to be contextually related than files across the codebase. A query about authentication should prioritize results from auth/ over results from utils/, even if a utility function happens to mention authentication in a comment.

None of this replaces semantic similarity. It augments it. The best results come from combining embedding distance with structural proximity: chunks that are both semantically relevant and structurally connected to the query context.

The Personalization Angle

Every codebase has its own vocabulary.

One team calls it handlePaymentFlow. Another calls it processCheckout. A third calls it executeTransaction. These are the same concept implemented differently in different codebases. A generic embedding model treats them as equally distant from a query about payments.

Domain terms compound the problem. A fintech codebase uses "ledger" to mean something specific. An e-commerce codebase uses "cart" everywhere. A healthcare application has "encounter" as a core concept. Generic embeddings do not know that "ledger" in this codebase is as central as "database" in another.

Personalized embeddings learn your vocabulary. They train on your code, your naming conventions, your patterns. The distance between "ledger" and a query about financial records decreases because the model has seen how your codebase uses that term.

This is not cosmetic. Retrieval quality depends on the embedding space matching the semantics of the target domain. A model that learns your codebase represents your codebase more accurately than one that generalizes across all code everywhere.

The Practical Test

You can audit your current tool with a simple test.

Find a file in your codebase that imports several other modules. Ask your AI assistant a question about that file—specifically, a question whose answer requires understanding what happens in one of the imported modules.

Does the assistant surface the dependency? Does it understand that the imported module is part of the context? Or does it answer based only on the importing file, missing the logic that actually executes?

Try a second test. Ask about a concept that appears in multiple places: production code, test files, deprecated modules. Does the assistant distinguish between them? Does it know that code in tests/ is not production logic? Does it recognize that a file in archived/ might not reflect current behavior?

These tests reveal whether your indexing captures structure or ignores it.

The answers determine whether you are working with a tool that understands code or a tool that retrieves text. The difference is not academic. It shows up every time the assistant misses a dependency, conflates test code with production code, or confidently describes behavior from a function that was deprecated two years ago.

Flat indexing works for documents. Code is not a document. Code has structure. That structure carries meaning. An index that ignores it loses information that matters.

Ready to try structure-aware retrieval?

Try Pyckle free — model-agnostic context for your editor