---
title: "Polyglot and Multi-Language Codebases with AI"
subtitle: "Search, Retrieval, and Understanding When Your Stack Speaks Four Languages"
author: "David Kelly Price"
version: "1.0"
date: 2026-04-20
status: draft
type: ebook
target_audience: "Senior engineers and architects working in large multi-language codebases — dealing with Python, TypeScript, Go, Java, SQL, and infrastructure-as-code side by side"
estimated_pages: 70
chapters:
  - "The Polyglot Reality"
  - "Language-Specific Chunking Strategies"
  - "Embedding Across Languages"
  - "Cross-Language Dependency Tracing"
  - "Search and Retrieval in Mixed Codebases"
  - "Index Architecture for Polyglot Systems"
  - "Tooling and Pipeline Design"
  - "Common Failure Modes"
tags:
  - pyckle
  - ebook
  - polyglot
  - multi-language
  - code-search
  - architecture
  - large-codebases
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Figures:
- Captioned and numbered (Figure 1, Figure 2, etc.)
- Referenced by number in body text
-->

---

# Polyglot and Multi-Language Codebases with AI

## Search, Retrieval, and Understanding When Your Stack Speaks Four Languages

**By David Kelly Price**

Version 1.0 — April 2026

---

## Table of Contents

- About This Guide
- Chapter 1: The Polyglot Reality
- Chapter 2: Language-Specific Chunking Strategies
- Chapter 3: Embedding Across Languages
- Chapter 4: Cross-Language Dependency Tracing
- Chapter 5: Search and Retrieval in Mixed Codebases
- Chapter 6: Index Architecture for Polyglot Systems
- Chapter 7: Tooling and Pipeline Design
- Chapter 8: Common Failure Modes
- Conclusion
- Appendix A: Glossary
- Appendix B: Tools & Resources
- Appendix C: Further Reading

---

## About This Guide

Most AI coding tools were built with a single-language repo in mind. They chunk Python files. They index TypeScript. They do fine when the whole codebase speaks one dialect. The moment you add a second language — let alone four or five — the seams start to show.

This guide is for engineers and architects who work inside the kind of system that has evolved over years into something genuinely polyglot: a Python ML pipeline feeding a Go API, which reads from a PostgreSQL schema defined by migration files, deployed by Terraform, and consumed by a TypeScript frontend. These aren't edge cases. They're the norm at any company that's been building for more than three years and has shipped something real.

The specific problem this guide addresses is retrieval: how do you build AI-assisted search, navigation, and understanding over a codebase that speaks four languages? How do you chunk code that doesn't follow the same structure rules? How do you embed across a semantic space that was never designed to be shared? How do you trace a dependency that crosses a language boundary — from a TypeScript API call through a Go handler into a SQL schema — without losing the thread?

These are the questions this guide answers. Not in the abstract. With enough specificity that you can go build something after reading it.

The examples throughout use realistic code. The architectural diagrams describe real tradeoffs. The failure modes are drawn from systems that actually broke.

This guide assumes you're comfortable with vector embeddings, basic retrieval-augmented generation (RAG) concepts, and the general shape of language models. It does not spend time explaining what an embedding is from first principles. It does spend time on what goes wrong when you apply those concepts naively to a polyglot environment — and how to do it better.

---

## Chapter 1: The Polyglot Reality

### How codebases actually grow

Nobody designs a polyglot codebase. They grow into one. The original prototype was Python because the team knew Python. Then Go came in because the API team had latency requirements Python couldn't meet. Then the data warehouse grew its own SQL dialect. Then infrastructure went into Terraform because the DevOps hire came from a shop that used it. Then the frontend arrived in TypeScript because React was the natural choice.

Each of these decisions made sense in isolation. Together they create a system where a single feature touches five languages before it reaches the user. A login flow might start in TypeScript, hit a Go authentication service, call a Python ML model for fraud scoring, log an event via a Kafka schema defined in Avro, and write a record to a PostgreSQL table described by a migration file in plain SQL.

That's not an unusual system. It's a Tuesday at a mid-sized company with a few years of accumulated decisions behind it.

The problem isn't that these systems exist. The problem is that every tool built to help engineers navigate them was built for a simpler world.

### What breaks when AI tools hit polyglot codebases

The most common failure is invisible: the tool works fine for the dominant language, and engineers assume it works equally well across the board. It doesn't. A system that indexes Python beautifully might produce nearly useless results for Go code, not because Go is hard to index, but because the chunking logic was tuned for Python's indentation-based structure and Go's brace-delimited blocks produce chunks that are too large, too small, or both depending on where the splitter happened to break.

SQL is worse. A migration file that defines a critical table schema is, from the perspective of most text splitters, an undifferentiated wall of `CREATE TABLE` statements with no semantic boundaries at all. The splitter cuts it arbitrarily. The embedding captures a fragment. Queries about the schema return garbage.

Terraform and other infrastructure-as-code formats present a different problem. The semantic unit in a Terraform file is a resource block — but resource blocks can span dozens of lines, reference variables defined in other files, and have meaning that only becomes clear in the context of the module structure around them. A naive splitter treats a Terraform file like prose and produces chunks that include the first half of one resource and the second half of another.

TypeScript has its own flavor of difficulty. It's not that TypeScript is hard to parse — it's that TypeScript projects tend to have large amounts of generated code, type declaration files, barrel exports, and vendor code mixed in with the actual application logic. Without filtering, these end up in the index and pollute search results.

**Key Insight**: The single biggest mistake in polyglot code indexing is treating all languages with the same chunking and weighting strategy. The structure of meaning differs by language. What constitutes a "unit" of code in Python is structurally different from what constitutes a unit in Go, SQL, or Terraform.

### The retrieval problem is actually three problems

When engineers think about AI-assisted code navigation in a polyglot codebase, they're usually thinking about one of three distinct problems, often without distinguishing between them.

The first is **search**: given a natural language query, find the code that's relevant to it. "Where is the fraud scoring logic?" should return the Python model inference code, not a TypeScript component that mentions fraud in a comment.

The second is **dependency tracing**: given a piece of code, understand what it connects to — both in its own language and across language boundaries. The Go handler that calls the Python model doesn't just have a Go dependency graph. It has a cross-language dependency that standard tools won't surface.

The third is **comprehension assistance**: given a complex piece of code, help an engineer understand it — in context, including the context that exists in other languages. The SQL schema context matters when you're trying to understand a Go query. The Terraform resource definition matters when you're debugging an API timeout.

These three problems have different technical solutions. Conflating them leads to systems that half-solve each problem instead of fully solving any.

### The scale problem compounds everything

Most demonstrations of AI code search work on repositories of a few hundred files. Production polyglot systems often have tens of thousands. At that scale, index quality degrades unless it's actively managed. Dead code accumulates. Generated files inflate the index. Language distribution shifts as the system evolves. What was 60% Python and 20% Go two years ago might now be 40% TypeScript, 30% Python, 15% Go, and 15% everything else.

An index built once and never maintained becomes less useful over time in a way that's hard to notice until it's become actively misleading. Search results that used to be reliable start returning outdated code. Cross-language traces break because the index hasn't been updated to reflect a migration.

**Warning**: Index freshness is an operational concern, not a one-time setup problem. A stale index in a polyglot codebase doesn't just return old results — it returns results from a language distribution that may no longer reflect reality.

### What a good polyglot system looks like

The target state is a retrieval system that:

1. Chunks each language according to its own structural semantics, not a generic text splitter.
2. Embeds code in a way that preserves cross-language semantic similarity — so that a TypeScript `authenticate()` call and the Go `Authenticate()` handler it resolves to end up close together in embedding space.
3. Maintains an explicit cross-language dependency graph, separate from the vector index, that can be traversed when tracing connections across language boundaries.
4. Filters generated, vendored, and boilerplate code from the index so it doesn't dilute results.
5. Updates incrementally, so that changes to the codebase propagate to the index without requiring a full rebuild.
6. Surfaces language context alongside results — not just "here's the code" but "this is Go code, and it's called from this TypeScript path, and it writes to this SQL schema."

None of these are individually hard. The complexity is in getting them all to work together in a way that's maintainable and reliable. The rest of this guide covers each component in depth.

### Key Takeaways

- Polyglot codebases are the product of accumulated rational decisions, not poor planning. The retrieval problem they create is real and non-trivial.
- Most AI code tools fail silently in polyglot environments — they appear to work but return skewed or incomplete results for non-dominant languages.
- The retrieval problem in polyglot codebases is three distinct problems: search, dependency tracing, and comprehension assistance.
- Index freshness is an ongoing operational concern, not a one-time setup task.
- Good polyglot retrieval requires language-aware chunking, cross-language embedding, and explicit dependency graphs — not just a better text splitter.

**Try This**: Pick a feature in your current codebase that touches at least three languages. Trace it manually from the entry point to the data store. Count how many language boundaries you cross. That count is the minimum number of places a naive retrieval system will lose the thread.

---

## Chapter 2: Language-Specific Chunking Strategies

### Why chunking is the foundation

Chunking is where most polyglot retrieval systems fail first, and the failure is often never diagnosed. The system gets built, the embeddings get generated, and nobody notices that the chunks are bad because the search results are "pretty good" for the dominant language. The minority languages quietly return mediocre results, engineers lose trust in search for those files, and stop using it there — but nobody files a bug.

The purpose of a chunk is to create a unit of code that is semantically coherent, embeddable, and retrievable. Those three properties are in tension. A coherent unit might be too large to embed well (embeddings lose resolution at long lengths). An embeddable unit might be too small to be retrievable in a useful way (a single line rarely tells you anything). Getting the balance right requires understanding what "semantic coherence" means for each language.

### Python

Python's structure is explicit and regular. Functions and classes are the natural chunk boundaries, and they're easy to identify with an AST parser. The `ast` module in Python's standard library makes this straightforward:

```python
import ast

def extract_chunks(source: str, filepath: str) -> list[dict]:
    tree = ast.parse(source)
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            chunk_source = ast.get_source_segment(source, node)
            chunks.append({
                "type": type(node).__name__,
                "name": node.name,
                "lineno": node.lineno,
                "source": chunk_source,
                "filepath": filepath,
                "language": "python"
            })

    return chunks
```

A few Python-specific wrinkles: decorators are semantically attached to the function or class below them, but `ast.get_source_segment` will miss them if you only look at the function node's line range. Docstrings are part of the function body and should stay with the chunk — they carry important semantic content. Module-level code (not inside any function or class) often defines configuration, global state, or important constants that deserve their own chunks.

For classes, you have a choice: chunk the class as a whole, or chunk each method individually. Both are defensible. Chunking methods individually produces smaller, more focused embeddings. Chunking the class as a whole preserves context about what the methods belong to — which matters when someone searches for "authentication class" versus "method that validates a token." A pragmatic approach is to do both: index the class-level docstring and signature as one chunk, and each method as a separate chunk with a reference back to its containing class.

**Key Insight**: Python's AST is reliable, well-documented, and handles the full grammar including edge cases like walrus operators, match statements, and type aliases. Use it. Don't try to chunk Python with regex or line-counting heuristics — it will break on real code.

### TypeScript and JavaScript

TypeScript is structurally richer than Python and correspondingly harder to parse at the AST level. The TypeScript Compiler API exposes a full AST, but it's verbose and stateful. For chunking purposes, tree-sitter is a better choice — it handles TypeScript reliably, is language-agnostic (so the same infrastructure works for Go, Python, and others), and is fast enough to run on large codebases.

The natural chunk boundaries in TypeScript are: function declarations, arrow functions (when assigned to a named variable or exported), class declarations, interface declarations, type aliases, and React components. The last one deserves special attention. A React functional component is syntactically an arrow function or function declaration, but it has different retrieval semantics — someone searching for "user profile component" wants component-level results, not function-level results.

Detecting components isn't hard: look for JSX return statements, look for names that start with a capital letter, look for hooks (`useState`, `useEffect`). A simple heuristic catches most cases:

```typescript
// tree-sitter query for TypeScript components
(function_declaration
  name: (identifier) @name
  (#match? @name "^[A-Z]")
  body: (statement_block
    (return_statement
      (jsx_element)))) @component

(lexical_declaration
  (variable_declarator
    name: (identifier) @name
    (#match? @name "^[A-Z]")
    value: (arrow_function
      body: (jsx_element)))) @component
```

Generated code is a significant problem in TypeScript projects. `node_modules`, `.d.ts` declaration files, files ending in `.generated.ts`, and GraphQL schema type files tend to be large, repetitive, and useless from a retrieval perspective. They also inflate the index substantially. A TypeScript project with a large generated GraphQL schema can easily have more generated lines than hand-written lines. Filter aggressively.

**Warning**: TypeScript barrel files (`index.ts` that only re-exports from other files) should either be excluded from chunking entirely or chunked in a way that makes clear they're re-export proxies. Treating them as substantive code produces misleading results when someone searches for a specific export.

### Go

Go's structure is clean and consistent in ways that make chunking straightforward. The `go/ast` package handles parsing. Functions, methods, and struct definitions are the natural boundaries. Interface definitions deserve their own chunks — Go's implicit interface system means that an interface definition carries semantic information about what types satisfy it, and that information is often what engineers are searching for.

One Go-specific consideration: error handling is ubiquitous and visually noisy. Almost every function has repeated `if err != nil` blocks. These are not semantically meaningful for retrieval purposes, but they do expand chunk sizes and dilute embedding quality. You can't remove them from the chunk (the code wouldn't make sense without them), but you can be aware that Go chunks will tend to be longer than equivalent Python chunks due to explicit error handling.

Goroutines and channels create a different chunking challenge. A function that launches goroutines may have semantically important logic distributed across several closures. The outer function and its goroutines are related, but they're syntactically separate. Chunking the outer function as a unit (including its closures) usually produces better retrieval results than splitting them:

```go
// This should be one chunk, not three
func ProcessBatch(ctx context.Context, items []Item) error {
    results := make(chan Result, len(items))

    for _, item := range items {
        go func(i Item) {
            result, err := processOne(ctx, i)
            if err != nil {
                results <- Result{Err: err}
                return
            }
            results <- result
        }(item)
    }

    // collect results...
}
```

Go modules and packages provide important metadata that should be attached to every chunk from that package. The package name, module path, and file path together give enough context to reconstruct where a chunk lives without having to retrieve the file.

### SQL

SQL breaks most generic chunking strategies because it has almost no syntactic markers of semantic boundaries in the way procedural languages do. A migration file is a flat sequence of DDL and DML statements. The only structural markers are statement terminators (semicolons), `BEGIN`/`COMMIT` blocks, and sometimes comments.

The right approach is to chunk on statement boundaries rather than line counts:

```python
import sqlglot

def chunk_sql(source: str, filepath: str) -> list[dict]:
    statements = sqlglot.parse(source, dialect="postgres")
    chunks = []

    for stmt in statements:
        if stmt is None:
            continue

        stmt_type = type(stmt).__name__

        # Extract table/view name if available
        name = None
        if hasattr(stmt, 'this') and hasattr(stmt.this, 'name'):
            name = stmt.this.name

        chunks.append({
            "type": stmt_type,
            "name": name,
            "source": stmt.sql(pretty=True),
            "filepath": filepath,
            "language": "sql"
        })

    return chunks
```

`sqlglot` handles multiple SQL dialects (PostgreSQL, MySQL, BigQuery, SQLite) and normalizes them to a consistent AST. This matters in polyglot systems where different services may use different database backends.

For schema definitions specifically — `CREATE TABLE`, `CREATE INDEX`, `CREATE VIEW` — the chunk should include the full statement including all column definitions and constraints. Never split a table definition. A half-table definition is useless for retrieval. The full table definition is exactly what someone searching for "the users table schema" needs to see.

Stored procedures and functions in SQL deserve special treatment. They're procedural code embedded in a declarative language. Chunk them as units, include the full body, and tag them with `sql_procedure` or `sql_function` so they can be filtered separately from schema definitions.

**Key Insight**: SQL migration files often contain comments that describe *why* a migration was made — the business reason, the bug being fixed, the feature being shipped. These comments carry retrieval value that the SQL statements themselves don't. Include them in the chunk.

### Terraform and Infrastructure-as-Code

Terraform's HCL syntax organizes code into resource blocks, data blocks, variable declarations, output declarations, and module calls. Each of these is a natural chunk boundary. The `resource` block is the most important: it defines a concrete piece of infrastructure and has a type and a name that together constitute a unique identifier.

The challenge with Terraform is cross-file references. A resource in `main.tf` might reference a variable defined in `variables.tf`, a local defined in `locals.tf`, and a data source defined in `data.tf`. A chunk of the resource block alone is incomplete — it contains references that only resolve in the context of the module as a whole.

There are two reasonable approaches. The first is to chunk resource blocks individually but inject their resolved variable values as metadata at index time. This requires actually parsing the Terraform module and resolving references, which is possible with tools like `terraform show -json` or the `hcl2` Python library, but adds significant complexity to the pipeline.

The second approach, which is simpler and often good enough, is to chunk at the module level rather than the resource level for small modules, and at the resource level with file-path context for large modules. A module with ten resources fits in a single chunk. A module with fifty resources needs to be split, and the resource type and name provide enough context for useful retrieval even without resolved variables.

```hcl
# This should be one chunk
resource "aws_lambda_function" "fraud_scorer" {
  function_name = var.function_name
  role          = aws_iam_role.lambda_exec.arn
  handler       = "main.handler"
  runtime       = "python3.11"

  environment {
    variables = {
      MODEL_BUCKET = var.model_bucket
      THRESHOLD    = var.fraud_threshold
    }
  }

  vpc_config {
    subnet_ids         = var.private_subnet_ids
    security_group_ids = [aws_security_group.lambda.id]
  }
}
```

The resource type (`aws_lambda_function`) and logical name (`fraud_scorer`) should both be metadata attached to the chunk, not just embedded in the text. They enable exact-match lookups that bypass the embedding entirely — which is faster and more reliable for known-name lookups.

### Overlapping and Context Windows

Regardless of language, one structural decision applies universally: whether to use overlapping chunks. Overlapping means each chunk includes some content from the previous chunk — typically a few lines of context. The argument for overlap is that function signatures and class declarations at the start of a chunk provide context that improves embedding quality.

In practice, for well-structured code with meaningful names, the benefit is modest and the cost is index inflation. A better approach is to attach a "context header" to each chunk: the containing class name, the file path, the package or module name. This gives the embedding model the contextual signal it needs without duplicating code.

### Key Takeaways

- Chunking must be language-specific. A single splitting strategy applied uniformly will produce poor results for at least some languages in a polyglot stack.
- Use AST-based chunking for Python and Go. Use tree-sitter for TypeScript. Use statement-level parsing (sqlglot) for SQL. Use block-level parsing for Terraform.
- Filter generated code, vendored code, and barrel files before chunking. They inflate the index without adding retrieval value.
- Attach metadata — language, file path, containing class or module, chunk type — to every chunk. This enables both semantic search and structured filtering.
- SQL comments often carry more retrieval-relevant information than the SQL statements themselves. Don't strip them.

**Try This**: Take a representative sample of 20 files from each language in your codebase. Apply your current chunking strategy and inspect the output manually. Look for chunks that cross function boundaries, chunks that are less than 5 lines, and chunks that exceed 200 lines. Each of these is a signal that the chunking strategy needs adjustment for that language.

---

## Chapter 3: Embedding Across Languages

### The semantic alignment problem

Embeddings encode meaning as a position in a high-dimensional vector space. For code, that means similar code — code that does similar things, uses similar patterns, or expresses similar concepts — should end up close together. For a single language, this works reasonably well with modern code-focused embedding models. The problem in a polyglot system is that "close together" loses meaning when the model was trained primarily on one language.

Consider the simplest possible case: the same authentication logic implemented in both Python and Go. A developer who understands both would recognize immediately that these are doing the same thing. A well-trained multilingual code embedding model should place them close together in vector space. But most general-purpose embedding models, trained predominantly on English text and code samples that skew heavily toward Python and JavaScript, will place them further apart than two implementations of different logic in the same language.

This isn't a theoretical concern. It shows up in retrieval: someone searching for "token validation logic" in a Python-dominant index gets Python results even when the most relevant implementation is in Go.

### Choosing an embedding model for polyglot code

The embedding model selection decision has more leverage than almost any other architectural choice in a polyglot retrieval system. The right model dramatically improves cross-language retrieval quality. The wrong model makes cross-language retrieval essentially useless.

The key properties to look for are:

**Multi-language code training data**: The model should have been explicitly trained on code from multiple languages, not just fine-tuned on Python with some Go thrown in. Models trained on The Stack dataset, CodeSearchNet, or similar multi-language corpora tend to generalize better.

**Token vocabulary coverage**: Some embedding models use a vocabulary that's heavily optimized for Python syntax. Go's type assertions (`val.(type)`), TypeScript's angle-bracket generics, or SQL's `FROM...JOIN...ON` patterns may be poorly tokenized, which degrades embedding quality for those languages.

**Context window**: Longer context windows allow larger chunks, which allows more complete semantic units. A model with a 512-token context window will struggle with a 150-line Go function. A model with a 4096-token context window handles it comfortably.

**Cross-lingual capability**: This is distinct from multi-language code training. Cross-lingual capability means the model places semantically equivalent code from different languages close together in vector space. Not all models trained on multi-language code have this property — some learn per-language representations that live in different regions of the embedding space.

Current strong choices as of 2026 include code-specialized models from the voyage-code and jina-code families, as well as general-purpose multilingual models like multilingual-e5-large which, despite not being code-specialized, often outperform code-specialized models on cross-language retrieval tasks because their training explicitly optimized for cross-lingual alignment.

**Warning**: Benchmark results for code embedding models are usually reported on single-language or code-search tasks. They rarely measure cross-language retrieval alignment directly. Before committing to an embedding model for a polyglot system, run your own benchmark: take 20-30 pairs of semantically equivalent code snippets from different languages in your codebase, embed them all, and measure whether equivalent pairs are closer in vector space than non-equivalent pairs.

### The identifier alignment advantage

One underappreciated property of code is that identifiers — function names, class names, variable names — tend to be consistent across languages. An API response type named `UserProfile` in TypeScript and the Go struct named `UserProfile` that it serializes from are semantically connected, and that connection is partially captured by shared identifiers.

This means that code embeddings for technical codebases often have better cross-language alignment than you'd expect from general cross-lingual benchmarks, because the naming conventions partially override language-specific syntax. A codebase with consistent naming conventions benefits more from this effect than a codebase where the same concept is called `user_profile` in Python, `UserProfile` in Go, and `IUserProfile` in TypeScript.

This has a practical implication: if you're building a polyglot system and have any influence over naming conventions, pushing for consistency across languages will improve retrieval quality. This is a rare case where a code style decision has a direct, measurable impact on AI tooling effectiveness.

### Hybrid search and keyword anchoring

Pure vector search has a well-known weakness: it degrades on rare or technical terms. An exact identifier like `handleFraudScoreCallback` may not appear anywhere near its results in a pure semantic search, because the embedding model has never seen this identifier and can't align it with anything meaningful.

Hybrid search — combining vector similarity with BM25 keyword scoring — solves this. BM25 handles exact matches, rare identifiers, and technical jargon. The vector component handles semantic similarity and natural language queries. Combining them with reciprocal rank fusion (RRF) or a linear combination produces better results than either alone, particularly for the kind of queries engineers actually ask: sometimes "authentication middleware," sometimes `handleAuthCallback`.

```python
def hybrid_search(query: str, index, top_k: int = 20) -> list[dict]:
    # Vector search
    query_embedding = embed(query)
    vector_results = index.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        include=["documents", "metadatas", "distances"]
    )

    # BM25 keyword search
    keyword_results = index.bm25_search(query, top_k=top_k)

    # Reciprocal rank fusion
    scores = {}
    for rank, result in enumerate(vector_results):
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (60 + rank)

    for rank, result in enumerate(keyword_results):
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (60 + rank)

    # Sort by combined score and return top results
    sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [fetch_chunk(id) for id in sorted_ids[:top_k]]
```

The RRF constant (60 in the formula above) controls how quickly rank differences compound. Lower values weight top results more heavily. 60 is the standard empirical choice; in practice, you rarely need to tune it.

### Per-language embedding weighting

In a polyglot index, the distribution of chunks by language is usually uneven. A codebase that's 50% Python and 10% Go by line count might be 70% Python and 5% Go by chunk count (because Python tends to produce more, smaller chunks than Go). Without correction, vector search returns Python results more often than it should, simply because there are more Python embeddings to match against.

This can be partially addressed at query time by applying a per-language score multiplier to results from underrepresented languages. The multiplier is based on the inverse of the language's share of the index:

```python
def rerank_with_language_balance(results: list[dict], language_weights: dict) -> list[dict]:
    for result in results:
        lang = result["metadata"]["language"]
        weight = language_weights.get(lang, 1.0)
        result["score"] *= weight

    return sorted(results, key=lambda x: x["score"], reverse=True)

# Example weights: Go and SQL are underrepresented, boost them
language_weights = {
    "python": 0.8,
    "typescript": 0.9,
    "go": 1.3,
    "sql": 1.4,
    "terraform": 1.5
}
```

This is a blunt instrument and should be tuned empirically based on your codebase's actual retrieval quality across languages. Start with weights that are inverse to index share and adjust based on feedback.

**Key Insight**: Language distribution imbalance in the index is a retrieval quality problem that most teams never diagnose because they never measure per-language retrieval quality. Run your search benchmarks segmented by language. You'll almost certainly find that the minority languages have significantly lower precision.

### Handling code that mixes languages

Some files genuinely mix languages: TypeScript files with embedded SQL (via template literals or tagged template functions), Python files with embedded Jinja templates, Go files with embedded HTML templates. A chunk that contains two languages will produce an embedding that represents neither well.

The right approach is to detect and split these at the mixed-language boundary. For SQL-in-TypeScript, parse the TypeScript AST and look for tagged template literals with tags like `sql`, `gql`, or `prisma`. Extract the template contents as a separate SQL chunk. For Jinja-in-Python, parse the Python first, then apply a Jinja parser to the string literals that look like templates.

This adds pipeline complexity, but the payoff is significant. The embedded SQL in a TypeScript data layer is often the most important retrieval target for backend engineers, and it's the most likely to be invisible to a naive indexer.

### Key Takeaways

- Embedding model selection is the highest-leverage architectural decision in a polyglot retrieval system. Choose models with explicit multi-language code training and strong cross-lingual alignment.
- Hybrid search (vector + BM25) is mandatory for production systems. Pure vector search degrades on exact identifiers and rare technical terms.
- Consistent naming conventions across languages improve cross-language retrieval by leveraging identifier alignment.
- Language distribution imbalance in the index causes minority languages to be systematically underrepresented in results. Apply per-language weights or measure and tune.
- Files that mix languages should be split at language boundaries before embedding. Don't embed multi-language chunks.

**Try This**: Take five cross-language pairs from your codebase where the same logic is implemented in two different languages (a Python model and its Go HTTP wrapper, a TypeScript client and its Go server handler). Embed both sides with your current model and compute cosine similarity. If the same-language similarity is consistently higher than the cross-language similarity for semantically equivalent pairs, your model has a cross-language alignment problem worth addressing.

---

## Chapter 4: Cross-Language Dependency Tracing

### The gap in standard tools

Every mature language has dependency analysis tooling. Python has `ast` and `import-linter`. TypeScript has the Compiler API. Go has `go list`. Java has classpath analysis. They're all excellent for intra-language dependency graphs. None of them cross language boundaries.

The connection between a TypeScript frontend and the Go API it calls is, from the perspective of standard tooling, invisible. The connection between a Go service and the SQL schema it queries doesn't show up in any dependency graph. The Terraform resource that provisions the Lambda function the Python code runs in — also invisible.

This is a fundamental gap, not an oversight. Language-level dependency tools are designed to track language-level dependencies. They were never meant to track the runtime connection between a fetch call and an HTTP endpoint that happens to live in a different service written in a different language.

Building cross-language dependency tracing requires going outside the language tooling and working at the level of the interfaces between languages: HTTP APIs, database schemas, message queues, and shared configuration.

### HTTP-level dependency tracing

The most common cross-language boundary in a modern codebase is an HTTP API call. TypeScript frontend calls Go backend. Go service calls Python ML API. Python data pipeline calls a Go ETL service. These connections are semantic, not syntactic — they're expressed in strings, not import statements.

Tracing them requires recognizing patterns that express HTTP connections:

```typescript
// TypeScript side — fetch call to a specific endpoint
const response = await fetch(`${API_BASE_URL}/api/v1/users/${userId}/fraud-score`);

// Or with a typed client
const score = await apiClient.fraudScoring.getScore({ userId });
```

```go
// Go side — handler that registers the endpoint
router.GET("/api/v1/users/:userId/fraud-score", h.GetFraudScore)
```

Connecting these requires:
1. Extracting all fetch/axios/httpx call patterns from TypeScript/JavaScript/Python and parsing the URL patterns.
2. Extracting all route registrations from Go/Python/TypeScript server files and parsing the route patterns.
3. Matching URL patterns across the two sets, handling parameterization (`:userId` matches `${userId}`).

This is an imperfect process. Dynamic URLs, URL construction through string concatenation, and routing libraries with custom patterns all create gaps. But even a partial cross-language HTTP graph is enormously valuable — it makes the invisible connections visible.

```python
import ast
import re
from dataclasses import dataclass

@dataclass
class HttpDependency:
    source_file: str
    source_language: str
    source_function: str
    http_method: str
    url_pattern: str
    target_language: str | None = None
    target_handler: str | None = None

def extract_go_routes(source: str, filepath: str) -> list[HttpDependency]:
    # Pattern for gin/chi/echo router registrations
    pattern = r'router\.(GET|POST|PUT|DELETE|PATCH)\("([^"]+)",\s*(\w+)'
    matches = re.finditer(pattern, source)

    routes = []
    for match in matches:
        method, path, handler = match.groups()
        routes.append(HttpDependency(
            source_file=filepath,
            source_language="go",
            source_function=handler,
            http_method=method,
            url_pattern=path
        ))

    return routes

def extract_typescript_calls(source: str, filepath: str) -> list[HttpDependency]:
    # Pattern for fetch calls and typed API clients
    fetch_pattern = r'fetch\(`([^`]+)`'
    axios_pattern = r'axios\.(get|post|put|delete)\([\'"`]([^\'"`]+)[\'"`]'

    calls = []
    for match in re.finditer(fetch_pattern, source):
        url = match.group(1)
        calls.append(HttpDependency(
            source_file=filepath,
            source_language="typescript",
            source_function="",
            http_method="GET",  # fetch defaults to GET
            url_pattern=url
        ))

    return calls
```

### Database schema dependencies

The connection between application code and database schemas is another category of cross-language dependency that standard tools miss. A Go struct that has a `db` tag is connected to a SQL column. An ORM query in Python is connected to a table definition in a migration file. A TypeScript query builder call resolves to a schema.

Tracing these connections requires recognizing ORM and query builder patterns by language:

**Python (SQLAlchemy)**:
```python
class User(Base):
    __tablename__ = "users"
    id = Column(UUID, primary_key=True)
    email = Column(String(255), unique=True, nullable=False)
    fraud_score = Column(Float, nullable=True)
```

**Go (sqlx struct tags)**:
```go
type User struct {
    ID         uuid.UUID  `db:"id"`
    Email      string     `db:"email"`
    FraudScore *float64   `db:"fraud_score"`
}
```

**SQL (migration file)**:
```sql
CREATE TABLE users (
    id          UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    email       VARCHAR(255) UNIQUE NOT NULL,
    fraud_score FLOAT,
    created_at  TIMESTAMP WITH TIME ZONE DEFAULT NOW()
);
```

The connection between these three is the `users` table name and the column names. A cross-language dependency tracer can:
1. Extract all table references from SQL migration files.
2. Extract all `__tablename__` declarations and column definitions from SQLAlchemy models.
3. Extract all `db:` struct tags from Go structs.
4. Build an explicit mapping from table names to all code that references them across languages.

This produces a graph where a change to the SQL schema immediately surfaces all the code in all languages that touches it — which is exactly the impact analysis an engineer needs before making a schema change.

**Key Insight**: Cross-language dependency graphs are best built as explicit data structures separate from the vector index, not inferred from retrieval results. They're deterministic relationships — the TypeScript file either calls that Go endpoint or it doesn't — and they should be stored and queried as such.

### Message queue and event-based tracing

In event-driven architectures, services communicate through message queues or event buses rather than direct HTTP calls. The dependency is expressed as a producer publishing to a topic and a consumer subscribing to it. The topic name (or event type) is the bridge.

```python
# Python producer
producer.send("user-fraud-scored", {
    "user_id": user_id,
    "score": fraud_score,
    "timestamp": datetime.now().isoformat()
})
```

```go
// Go consumer
consumer.Subscribe("user-fraud-scored", func(msg Message) error {
    var event FraudScoredEvent
    if err := json.Unmarshal(msg.Value, &event); err != nil {
        return err
    }
    return h.handleFraudScored(ctx, event)
})
```

The bridge is the string `"user-fraud-scored"`. Tracing it requires:
1. Extracting all message producer calls and their topic names.
2. Extracting all consumer subscriptions and their topic names.
3. Joining them on the topic name.

The complication is that topic names are often stored in configuration files or environment variables rather than as inline strings. This requires resolving constants and config references, which adds a layer of indirection. A simpler first approach is to extract literal string arguments to known producer/consumer functions and treat config references as unresolved (but still recorded) dependencies.

### Representing the cross-language dependency graph

The cross-language dependency graph is best represented as a directed graph where nodes are chunks (identified by filepath + function name + language) and edges are typed by the kind of connection: `http_call`, `http_handler`, `db_read`, `db_write`, `event_publish`, `event_subscribe`, `config_reference`.

A lightweight representation that works well with existing retrieval infrastructure:

```python
@dataclass
class DependencyEdge:
    source_chunk_id: str     # "typescript:src/api/fraud.ts:getFraudScore"
    target_chunk_id: str     # "go:internal/handlers/fraud.go:GetFraudScore"
    edge_type: str           # "http_call"
    confidence: float        # 0.0-1.0 (how certain is this connection)
    metadata: dict           # HTTP method, URL pattern, etc.
```

The confidence score is important. HTTP route matching is imperfect. SQL table name matching can produce false positives for common names like `users`. Recording confidence allows downstream consumers to filter by certainty.

**Warning**: Don't try to build a complete and perfect cross-language dependency graph. The effort required for completeness is enormous, and perfect is the enemy of good. A graph that captures 80% of connections with high confidence is far more useful than a project that aims for 100% and ships nothing.

### Integration with the vector index

The cross-language dependency graph and the vector index serve different purposes and should be queried separately, but their results should be composable. When a user searches for "fraud scoring pipeline," the vector search returns relevant chunks across all languages. The dependency graph can then expand each result to include its cross-language neighbors — not because those neighbors were semantically similar to the query, but because they're structurally connected to the results.

This expansion is what makes the retrieval system genuinely useful for understanding complex cross-language flows. The initial search surfaces the entry point. The graph traversal surfaces the complete path.

### Key Takeaways

- Standard language-level dependency tools don't cross language boundaries. Cross-language dependency tracing requires working at the interface level: HTTP APIs, database schemas, message queues.
- HTTP route matching connects TypeScript/Python callers to Go/Python handlers by pattern-matching URL strings.
- Database schema dependencies connect ORM models and SQL migration files through table and column name matching.
- Build the cross-language dependency graph as an explicit data structure, not as an inference from retrieval results.
- Aim for 80% coverage with high confidence rather than 100% coverage with uncertain completeness.

**Try This**: Pick one cross-language API boundary in your codebase — a TypeScript caller and its server-side handler. Write the code to extract the URL pattern from both sides and determine whether they match. This is the core of HTTP-level dependency tracing. Once you have one pair working, generalize the pattern.

---

## Chapter 5: Search and Retrieval in Mixed Codebases

### What engineers actually search for

Before optimizing a retrieval system, it helps to understand what queries engineers actually run against it. The distribution is roughly:

- **Concept queries**: "Where is the rate limiting logic?", "How is authentication handled?" These are the queries that benefit most from semantic search.
- **Identifier queries**: "Where is `UserRepository` defined?", "What calls `processFraudScore`?" These benefit from exact-match lookup and graph traversal.
- **Change-impact queries**: "What will break if I change this function signature?", "What code reads from the `transactions` table?" These require graph traversal, not retrieval.
- **Exploration queries**: "Show me all the handlers in the payments service", "What data models exist in the order processing domain?" These benefit from structured filtering over metadata.

A retrieval system that only does semantic vector search handles the first category well and the rest poorly. Most production systems need a routing layer that dispatches to different retrieval strategies based on query type.

### Query intent classification

The simplest way to route queries to the right strategy is to classify them before retrieval:

```python
def classify_query(query: str) -> str:
    # Exact identifier: camelCase, snake_case, PascalCase with no spaces
    if re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', query.strip()):
        return "identifier"

    # Contains backtick-quoted identifier
    if '`' in query:
        return "identifier_with_context"

    # Change-impact pattern
    if any(phrase in query.lower() for phrase in [
        "what calls", "what uses", "what depends on", "what breaks",
        "impact of", "who reads", "who writes"
    ]):
        return "impact_analysis"

    # Enumeration pattern
    if any(phrase in query.lower() for phrase in [
        "show me all", "list all", "find all", "what are all"
    ]):
        return "enumeration"

    # Default to semantic
    return "semantic"
```

This classification feeds a routing layer that dispatches to the appropriate retrieval path. It doesn't have to be perfect — misclassified queries fall through to semantic search, which handles most things reasonably.

### Language-scoped search

One of the most useful features in a polyglot retrieval system, and one of the most underbuilt, is language-scoped search: the ability to restrict results to a specific language or set of languages.

```python
def search(
    query: str,
    languages: list[str] | None = None,
    chunk_types: list[str] | None = None,
    top_k: int = 20
) -> list[dict]:

    where_filter = {}
    if languages:
        where_filter["language"] = {"$in": languages}
    if chunk_types:
        where_filter["chunk_type"] = {"$in": chunk_types}

    results = vector_index.query(
        query_embeddings=[embed(query)],
        n_results=top_k,
        where=where_filter if where_filter else None
    )

    return results
```

Language scoping is particularly valuable when an engineer knows which layer they're working in. "Show me the database query for user lookups" should probably be scoped to SQL and the language the ORM models are in, not return TypeScript type definitions named `UserLookupResult`.

**Key Insight**: The query interface matters as much as the retrieval quality. Engineers who can't easily express language constraints will broaden their queries to compensate, and broader queries return noisier results. Build language scoping into the interface from the start.

### Contextual reranking

Initial retrieval returns a ranked list of candidates. Reranking takes that list and reorders it based on additional signals that were too expensive to compute at retrieval time. For polyglot codebases, the most valuable reranking signals are:

**Recency**: Code that was modified recently is more likely to be what the engineer is looking for than code that hasn't been touched in two years. Weight recently modified chunks higher.

**Language context**: If the engineer is currently editing a TypeScript file, TypeScript results should rank higher than Python results for ambiguous queries, all else being equal.

**Dependency proximity**: Results that are one hop away in the cross-language dependency graph from code the engineer has recently viewed should rank higher than unconnected results.

**Chunk type relevance**: For a query about "how authentication works," class definitions and complex function implementations should rank higher than type declarations and simple getters.

A simple reranker:

```python
def rerank(
    results: list[dict],
    current_file: str | None = None,
    recently_viewed: list[str] | None = None,
    dependency_graph: DependencyGraph | None = None
) -> list[dict]:

    for result in results:
        score = result["base_score"]

        # Recency boost: files modified in last 7 days
        if result["metadata"].get("days_since_modified", 999) < 7:
            score *= 1.2

        # Same-language boost when current file is known
        if current_file:
            current_lang = detect_language(current_file)
            if result["metadata"]["language"] == current_lang:
                score *= 1.1

        # Dependency proximity boost
        if recently_viewed and dependency_graph:
            for viewed_file in recently_viewed[-5:]:  # last 5 viewed files
                if dependency_graph.are_connected(viewed_file, result["metadata"]["filepath"]):
                    score *= 1.15
                    break

        result["final_score"] = score

    return sorted(results, key=lambda x: x["final_score"], reverse=True)
```

### Result presentation for cross-language queries

When a query returns results from multiple languages, how those results are presented matters significantly. A flat list that interleaves Python, Go, SQL, and TypeScript results is harder to process than a grouped presentation that clusters by language or by logical component.

For engineering tools, grouping by component (frontend, API layer, data layer, infrastructure) is often more useful than grouping by language, because it reflects how engineers think about the system. A query about "user profile" that returns results from TypeScript, Go, and SQL is more useful when those results are presented as "Frontend: src/components/UserProfile.tsx, API Layer: internal/handlers/user.go, Data Layer: migrations/0042_add_user_profile.sql" than as an interleaved list.

Building this kind of component-aware grouping requires two things: a component taxonomy (a mapping from file paths to logical components), and a presentation layer that applies it. The taxonomy can be derived from directory structure, service boundaries, or explicit annotations. Directory structure is the lowest-friction approach and works surprisingly well for well-organized codebases.

**Warning**: Component-grouped results can hide cross-cutting concerns. A security pattern implemented in a shared library that's used across all components should appear in search results as shared infrastructure, not buried under one component's results. Always include an "other" or "shared" bucket in component-grouped results.

### Multi-hop retrieval for cross-language flows

For queries that describe a cross-language flow — "how does the TypeScript fraud check end up updating the database?" — single-hop retrieval is insufficient. The answer requires traversing a path that spans multiple files and multiple languages.

Multi-hop retrieval combines an initial semantic search with graph traversal:

1. Run semantic search to find the entry point (the TypeScript function that initiates the fraud check).
2. Traverse the cross-language dependency graph from that entry point to find connected nodes.
3. Include connected nodes in the result set, annotated with their position in the dependency path.
4. Optionally continue traversal until a terminal node (a database write, a queue publish) is found.

This produces a result set that reads as a connected path rather than a set of disconnected fragments. For complex cross-language flows, this is the difference between a tool that helps and a tool that confuses.

```python
def multi_hop_retrieve(
    query: str,
    dependency_graph: DependencyGraph,
    max_hops: int = 3
) -> list[dict]:

    # Initial semantic retrieval
    initial_results = semantic_search(query, top_k=5)

    result_set = []
    visited = set()

    for result in initial_results:
        result["hop"] = 0
        result_set.append(result)

        # Traverse dependency graph
        frontier = [result["metadata"]["chunk_id"]]
        for hop in range(1, max_hops + 1):
            next_frontier = []
            for chunk_id in frontier:
                if chunk_id in visited:
                    continue
                visited.add(chunk_id)

                neighbors = dependency_graph.get_neighbors(chunk_id)
                for neighbor in neighbors:
                    if neighbor["chunk_id"] not in visited:
                        neighbor_chunk = fetch_chunk(neighbor["chunk_id"])
                        neighbor_chunk["hop"] = hop
                        neighbor_chunk["connection_type"] = neighbor["edge_type"]
                        result_set.append(neighbor_chunk)
                        next_frontier.append(neighbor["chunk_id"])

            frontier = next_frontier

    return result_set
```

### Key Takeaways

- Engineer search queries fall into four categories: concept queries, identifier queries, change-impact queries, and exploration queries. Each benefits from a different retrieval strategy.
- Language-scoped search dramatically improves result relevance and should be exposed in the query interface.
- Contextual reranking — using recency, current file language, and dependency proximity — improves result quality significantly over pure vector scoring.
- Multi-hop retrieval combines semantic search with dependency graph traversal to surface complete cross-language flows.
- Result presentation matters. Group results by logical component rather than by language when possible.

**Try This**: Run 10 queries against your current code search tool. Classify each query as concept, identifier, impact, or enumeration. Note which ones return unsatisfying results. The pattern in the unsatisfying queries points to which retrieval strategy needs the most investment.

---

## Chapter 6: Index Architecture for Polyglot Systems

### Why architecture matters

The difference between a polyglot retrieval system that scales and one that degrades over time is almost entirely architectural. The core retrieval logic — chunking, embedding, searching — is relatively straightforward. The hard part is keeping it maintainable as the codebase grows, the team changes, and the requirements evolve.

The architectural decisions that matter most are: how the index is structured (one index or many), how updates propagate from code changes to index changes, how the cross-language dependency graph is stored and queried, and how the system handles the operational concerns of a production tool.

### Single index versus per-language indexes

The first architectural decision is whether to maintain a single unified vector index or separate indexes per language. Both are defensible, and the right choice depends on your query patterns.

**Single unified index** — all chunks from all languages, embedded with the same model, stored in one index.

Advantages: Simpler operational footprint. Cross-language semantic search works naturally — results from all languages compete on the same score scale. Index management (updates, rebuilds, monitoring) has one control plane.

Disadvantages: Language distribution imbalance is harder to control. Per-language query features (filtering, ranking) require metadata filters rather than index-level isolation. A corrupt or degraded index affects all languages simultaneously.

**Per-language indexes** — separate vector stores for Python, Go, TypeScript, SQL, Terraform.

Advantages: Language-specific embedding models can be used (a SQL-specialized model for SQL, a code-general model for everything else). Per-language indexes can be rebuilt independently. Language distribution is naturally controlled — a Python-heavy codebase doesn't dilute Go results.

Disadvantages: Cross-language queries require fan-out across indexes and result merging. More operational complexity. Score normalization across indexes is non-trivial (a score of 0.85 in the Python index means something different from 0.85 in the SQL index).

For most systems, a single unified index with per-language metadata filtering is the right starting point. The per-language approach is worth considering when the SQL or IaC components have specialized enough characteristics that a general-purpose code embedding model does them a disservice.

### Incremental update architecture

A full index rebuild from scratch for a large polyglot codebase can take minutes to hours. Doing this on every code change is impractical. The index needs to support incremental updates: when a file changes, only the chunks from that file need to be re-embedded and re-inserted.

Incremental updates require:

1. **File-level tracking**: The index must know which chunks came from which file. This is a metadata requirement, not a retrieval feature.

2. **Chunk identity**: Each chunk needs a stable identity that can be used to detect whether it changed. A hash of the chunk content works well — if the hash changed, the chunk changed and needs re-embedding.

3. **Delete before reinsert**: When a file changes, delete all existing chunks from that file before inserting the new chunks. Without this, you'll accumulate stale chunks.

4. **Change detection**: Watch the file system (or git diff) for changed files. Only process changed files.

```python
import hashlib

def update_index_for_file(filepath: str, index, dependency_graph: DependencyGraph):
    source = read_file(filepath)
    language = detect_language(filepath)

    # Parse new chunks
    new_chunks = chunk_by_language(source, filepath, language)

    # Compute content hashes
    for chunk in new_chunks:
        chunk["content_hash"] = hashlib.sha256(
            chunk["source"].encode()
        ).hexdigest()

    # Get existing chunk hashes from index
    existing = index.get(where={"filepath": filepath})
    existing_hashes = {r["id"]: r["metadata"]["content_hash"]
                       for r in existing["results"]}

    new_chunk_ids = {chunk["id"] for chunk in new_chunks}

    # Delete chunks that no longer exist
    for existing_id in existing_hashes:
        if existing_id not in new_chunk_ids:
            index.delete(ids=[existing_id])

    # Upsert changed chunks
    to_upsert = [
        chunk for chunk in new_chunks
        if chunk["id"] not in existing_hashes
        or existing_hashes[chunk["id"]] != chunk["content_hash"]
    ]

    if to_upsert:
        embeddings = batch_embed([c["source"] for c in to_upsert])
        index.upsert(
            ids=[c["id"] for c in to_upsert],
            embeddings=embeddings,
            documents=[c["source"] for c in to_upsert],
            metadatas=[c["metadata"] for c in to_upsert]
        )

    # Update dependency graph for this file
    dependency_graph.update_for_file(filepath, source, language)
```

### Git-based change detection

For developer tooling, git is the natural change detection mechanism. Rather than watching the file system, detect changed files by comparing the current HEAD to the last indexed commit:

```python
def get_changed_files_since(last_indexed_commit: str) -> list[str]:
    result = subprocess.run(
        ["git", "diff", "--name-only", last_indexed_commit, "HEAD"],
        capture_output=True, text=True
    )
    return [f.strip() for f in result.stdout.splitlines() if f.strip()]
```

This approach is clean, reliable, and works well in CI/CD pipelines where the index should be updated as part of the build or post-merge.

**Key Insight**: Track the last-indexed commit hash in persistent storage alongside the index. On every update run, diff against this commit, process changed files, and update the stored commit hash. This gives you a clean incremental update loop with no need for a file watcher or database of file hashes.

### The dependency graph storage question

The cross-language dependency graph and the vector index have different storage requirements and access patterns. The vector index is optimized for approximate nearest-neighbor search. The dependency graph is optimized for graph traversal: "give me all nodes reachable from this node within N hops."

ChromaDB, Pinecone, Weaviate, and other vector stores are poor graph databases. Don't try to store a traversable graph in a vector store. Options for the dependency graph:

**SQLite** — for most use cases, a simple SQLite database with an `edges` table is sufficient. It handles millions of edges easily, supports the join patterns needed for traversal, and has zero operational overhead.

```sql
CREATE TABLE dependency_edges (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_chunk_id TEXT NOT NULL,
    target_chunk_id TEXT NOT NULL,
    edge_type TEXT NOT NULL,  -- 'http_call', 'db_read', etc.
    confidence REAL NOT NULL DEFAULT 1.0,
    metadata TEXT,  -- JSON blob for additional attributes
    UNIQUE(source_chunk_id, target_chunk_id, edge_type)
);

CREATE INDEX idx_source ON dependency_edges(source_chunk_id);
CREATE INDEX idx_target ON dependency_edges(target_chunk_id);
```

**Neo4j or similar graph database** — justified only when the graph traversal patterns become complex enough that SQL joins can't express them efficiently. For most codebases up to a few hundred thousand edges, SQLite is adequate.

**In-memory graph** — suitable for development and testing, and for small codebases where the full graph fits in memory. Not suitable for production unless the codebase is small and the process is long-lived.

### Index partitioning for large codebases

Very large codebases — monorepos with hundreds of services, open-source platforms with millions of lines — produce more chunks than a single vector index can serve at reasonable query latency. The solution is partitioning.

Natural partition boundaries for polyglot systems:
- By service or application boundary (most effective for service-oriented architectures)
- By language (maintains language-specific embedding tuning)
- By access frequency (hot partitions for actively developed code, cold partitions for stable legacy code)

Query-time partition selection can be done with a routing layer that determines which partitions to query based on the query content. A query mentioning TypeScript component names routes to the frontend partition. A query about database schemas routes to the SQL and data layer partition.

The complexity cost of partitioning is real. Don't add it until the single-partition approach is clearly insufficient. For most codebases up to 50,000 chunks, a single partition with good hardware will serve queries in under 200ms.

### Operational monitoring

A retrieval index that isn't monitored will silently degrade. The metrics that matter:

- **Index freshness**: How many hours since the last update? Alert when the index is more than N hours stale.
- **Language distribution**: Track the percentage of chunks per language. Alert when distribution shifts significantly (could indicate a language was accidentally excluded from indexing).
- **Query latency**: P50, P95, P99. Alert on P95 > 500ms.
- **Result quality** (harder to automate): Periodically run a fixed set of gold queries and measure whether the expected results appear in the top K. Alert when expected results fall outside the top 10.

**Warning**: Query latency can degrade as the index grows without any explicit change — simply because the ANN algorithm's performance degrades with index size. Monitor index size alongside query latency, and plan for the HNSW parameter retuning that will be required at larger scales.

### Key Takeaways

- A single unified index with per-language metadata filtering is the right starting point for most polyglot systems. Per-language indexes add complexity that's only justified when specialized embedding models are needed.
- Incremental updates are mandatory for production systems. Track content hashes at the chunk level and use git diff for change detection.
- Store the cross-language dependency graph in SQLite, not in the vector store. They have fundamentally different access patterns.
- Partition the index by service boundary only when query latency requires it. Premature partitioning adds routing complexity without payoff.
- Monitor index freshness, language distribution, and query latency. Silent degradation is the most common failure mode in production indexes.

**Try This**: Estimate the chunk count for your codebase by taking a representative sample of 100 files, chunking them, and extrapolating. Use this estimate to inform your index architecture decisions — particularly whether single-partition and in-process SQLite are sufficient, or whether you need to plan for partitioning from the start.

---

## Chapter 7: Tooling and Pipeline Design

### The pipeline as infrastructure

A polyglot code retrieval system isn't a piece of code you write once. It's operational infrastructure that needs to be built, deployed, monitored, and maintained. Treating it as infrastructure from the beginning — with the same attention to reliability, observability, and maintainability you'd give any production system — is the difference between a tool that helps for six months and one that helps for years.

The pipeline has three main stages: ingestion (detect changes, chunk, embed, store), serving (receive queries, retrieve, rerank, return), and maintenance (monitor freshness, trigger rebuilds, handle failures). Each stage has its own operational concerns.

### Ingestion pipeline design

The ingestion pipeline is triggered by code changes and must be fast enough to keep the index fresh. For a team that commits frequently, the pipeline needs to handle a new commit every few minutes without falling behind.

The core ingestion loop:

```
git diff → changed files → filter by language → chunk → embed → upsert → update dependency graph
```

Each step has independent failure modes and should be handled separately. A parsing failure on a single malformed file should not abort the entire pipeline run. An embedding API rate limit should trigger retry with backoff, not a pipeline failure. A dependency graph update failure should log and continue — the vector index is more important for immediate usability.

```python
import asyncio
import logging
from pathlib import Path

logger = logging.getLogger(__name__)

async def ingest_changed_files(
    changed_files: list[str],
    index,
    dependency_graph,
    embed_fn
) -> dict:
    results = {
        "processed": 0,
        "failed": 0,
        "skipped": 0
    }

    for filepath in changed_files:
        if not should_index(filepath):
            results["skipped"] += 1
            continue

        try:
            await ingest_single_file(filepath, index, dependency_graph, embed_fn)
            results["processed"] += 1
        except ParseError as e:
            logger.warning(f"Parse failure for {filepath}: {e}")
            results["failed"] += 1
        except EmbedError as e:
            logger.error(f"Embed failure for {filepath}: {e}")
            results["failed"] += 1
            # Could retry here or re-queue
        except Exception as e:
            logger.exception(f"Unexpected failure for {filepath}: {e}")
            results["failed"] += 1

    return results

def should_index(filepath: str) -> bool:
    path = Path(filepath)

    # Skip generated files
    if any(part in path.parts for part in [
        "node_modules", ".terraform", "__pycache__", "vendor",
        "dist", "build", ".next"
    ]):
        return False

    # Skip generated file patterns
    if path.name.endswith((".generated.ts", ".pb.go", "_pb2.py")):
        return False

    # Only index supported languages
    language_extensions = {".py", ".go", ".ts", ".tsx", ".js", ".sql", ".tf", ".hcl"}
    return path.suffix in language_extensions
```

### Embedding batch optimization

Embedding is the most expensive step in the ingestion pipeline, both in latency and cost. Batching embedding requests is critical for throughput.

Most embedding APIs support batch inputs. The optimal batch size varies by API, but 50-200 chunks per request is typical. The trade-off is between throughput (larger batches) and error recovery (smaller batches mean partial failures are less costly):

```python
async def batch_embed(
    texts: list[str],
    embed_fn,
    batch_size: int = 100,
    max_retries: int = 3
) -> list[list[float]]:
    embeddings = []

    for i in range(0, len(texts), batch_size):
        batch = texts[i:i + batch_size]

        for attempt in range(max_retries):
            try:
                batch_embeddings = await embed_fn(batch)
                embeddings.extend(batch_embeddings)
                break
            except RateLimitError:
                wait_time = 2 ** attempt  # exponential backoff
                await asyncio.sleep(wait_time)
            except Exception as e:
                if attempt == max_retries - 1:
                    raise
                logger.warning(f"Embed attempt {attempt + 1} failed: {e}, retrying...")

    return embeddings
```

For self-hosted embedding models, batching is still important — GPU utilization is highest when the model processes a full batch rather than single inputs.

### Dependency graph update pipeline

The dependency graph update pipeline runs in parallel with the vector index update pipeline. It's slower (it requires cross-file analysis rather than per-file chunking) and it can tolerate more lag — the dependency graph doesn't need to be perfectly fresh for every query.

```python
def rebuild_dependency_graph_for_service(
    service_path: str,
    db_conn,
    languages: list[str]
) -> int:
    """
    Rebuild the dependency graph for a service directory.
    Returns the number of edges inserted.
    """
    edges = []

    for language in languages:
        files = list_files_by_language(service_path, language)

        if language == "typescript":
            edges.extend(extract_typescript_http_calls(files))
        elif language == "go":
            edges.extend(extract_go_routes(files))
            edges.extend(extract_go_db_queries(files))
        elif language == "python":
            edges.extend(extract_python_orm_mappings(files))
            edges.extend(extract_python_kafka_producers(files))
        elif language == "sql":
            edges.extend(extract_sql_schema_definitions(files))

    # Cross-language matching
    http_edges = match_http_calls_to_routes(
        [e for e in edges if e.edge_type == "http_call"],
        [e for e in edges if e.edge_type == "http_route"]
    )

    edges.extend(http_edges)

    # Persist to SQLite
    with db_conn:
        # Delete existing edges for files in this service
        db_conn.execute(
            "DELETE FROM dependency_edges WHERE source_chunk_id LIKE ?",
            (f"{service_path}%",)
        )

        db_conn.executemany(
            """INSERT OR REPLACE INTO dependency_edges
               (source_chunk_id, target_chunk_id, edge_type, confidence, metadata)
               VALUES (?, ?, ?, ?, ?)""",
            [(e.source_chunk_id, e.target_chunk_id, e.edge_type,
              e.confidence, json.dumps(e.metadata)) for e in http_edges]
        )

    return len(http_edges)
```

### CI/CD integration

For teams that use GitHub Actions, GitLab CI, or similar, integrating the index update into the pipeline is straightforward:

```yaml
# .github/workflows/update-code-index.yml
name: Update Code Index

on:
  push:
    branches: [main]
  workflow_dispatch:

jobs:
  update-index:
    runs-on: ubuntu-latest

    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 2  # Need previous commit for diff

      - name: Get changed files
        id: changed
        run: |
          git diff --name-only HEAD~1 HEAD > changed_files.txt
          echo "count=$(wc -l < changed_files.txt)" >> $GITHUB_OUTPUT

      - name: Update code index
        if: steps.changed.outputs.count > 0
        env:
          EMBED_API_KEY: ${{ secrets.EMBED_API_KEY }}
          INDEX_CONNECTION: ${{ secrets.INDEX_CONNECTION }}
        run: |
          python -m code_indexer.update \
            --changed-files changed_files.txt \
            --index-connection $INDEX_CONNECTION
```

For monorepos with many services, the update job should be triggered by service path, not by any change to the repository. A change to the Python ML service shouldn't trigger a reindex of the Go API service.

**Key Insight**: The index update pipeline should be idempotent. Running it twice for the same change should produce the same result as running it once. This allows safe re-runs after failures without worrying about duplicate chunks or inconsistent state.

### Serving layer design

The serving layer receives queries and returns results. For developer tooling, it usually runs as a local process rather than a remote API — latency matters, and a round trip to a remote service adds perceptible lag to interactive use.

The serving layer architecture:

```
Query → Classify intent → Route to retrieval strategy
  → semantic search | exact lookup | impact analysis | enumeration
  → Rerank results
  → Expand with dependency graph
  → Format and return
```

For IDE integration, the serving layer should be a lightweight local process that starts with the IDE and responds to queries in under 200ms for 90th percentile. This requires:

- The vector index to be in-process or on localhost (not a remote API call).
- The dependency graph SQLite database to have a persistent connection (connection overhead per query is noticeable at scale).
- The embedding for the query to be computed locally or cached (repeated similar queries benefit from embedding cache).

```python
from functools import lru_cache

@lru_cache(maxsize=1000)
def cached_embed(query: str) -> tuple[float, ...]:
    embedding = embed_fn(query)
    return tuple(embedding)  # Lists aren't hashable, tuples are
```

The LRU cache on query embeddings is effective because engineers often repeat similar queries within a session. "Where is authentication handled" and "authentication handler" will have nearly identical embeddings — the first query populates the cache and the second gets a cache hit.

### Handling large-scale rebuilds

Despite incremental updates, there are situations that require a full rebuild: changing the embedding model, changing the chunking strategy, corrupted index state, or a major refactoring that changed a large fraction of files. A full rebuild needs to be fast enough to complete in a reasonable time.

For a codebase of 50,000 files with an average of 5 chunks per file:
- 250,000 chunks to embed
- At 100 chunks per batch and 100ms per embedding API call: ~250 seconds for embedding alone
- At 1,000 upserts per second: ~250 seconds for index insertion

Total: roughly 10 minutes for a full rebuild under these numbers. That's acceptable for a scheduled maintenance window. It's not acceptable for on-demand rebuilds during development.

Optimize the rebuild with parallelism at the file processing level (parse and chunk many files simultaneously) and at the embedding level (multiple concurrent embedding API calls, each with a full batch). With 8 parallel embedding workers, the rebuild time drops to ~2 minutes.

**Warning**: During a full rebuild, the index is in an inconsistent state — some chunks are from the new build, some from the old. Queries during this window return mixed results. Options: run the rebuild on a shadow index and swap atomically at completion, or temporarily disable writes to the serving index until the rebuild completes. For developer tooling where brief unavailability is acceptable, the simpler approach of building in place with a staleness indicator is usually fine.

### Key Takeaways

- Treat the indexing pipeline as production infrastructure from the start. Build in error isolation, retry logic, and monitoring.
- Batch embedding API calls for throughput. 50-200 chunks per request is typical; tune for your specific API.
- Integrate index updates into CI/CD for automatic freshness maintenance after merges to main.
- The serving layer for interactive developer tools should respond in under 200ms at P90. Cache query embeddings for repeated similar queries.
- Full rebuilds require parallelism to complete in acceptable time. Plan for 8x-16x speedup from parallel embedding workers.

**Try This**: Time your current code indexing setup from scratch. Then time an incremental update for a single file change. If the incremental update takes more than 5 seconds, there's optimization headroom that will pay dividends as the codebase grows.

---

## Chapter 8: Common Failure Modes

### Why failure modes deserve their own chapter

Most technical guides cover the happy path in detail and mention failure modes briefly, if at all. This is backwards for infrastructure. Understanding how a system fails is as important as understanding how it works — and for polyglot retrieval systems, the failure modes are specific enough and common enough that they warrant explicit documentation.

What follows is a catalog of the failures most likely to affect a production polyglot retrieval system, organized by when they occur.

### Failure mode 1: Silent language bias

**What it looks like**: Search results are consistently relevant for one or two languages and mediocre for others. Engineers stop trusting search for minority language files and revert to grep.

**Root cause**: The embedding model was trained on a corpus that's heavily skewed toward certain languages (Python, JavaScript). Its representations for Go, SQL, or Terraform are lower quality, producing embeddings that cluster more loosely and return less precise results.

**How to detect it**: Run a benchmark of 20 known-answer queries per language. Measure recall@5 (fraction of queries where the correct answer appears in the top 5 results) for each language separately. A gap of more than 10-15 percentage points between the best and worst language indicates systematic bias.

**How to fix it**: Switch to a model with stronger multi-language training. Consider per-language embedding models for severely underserved languages. Apply language-specific score boosts in reranking (as described in Chapter 3).

**Warning**: This failure mode is almost never self-reported by users. Engineers assume the tool "doesn't work for Go" and compensate individually. You won't learn about it from support tickets — you'll only find it by measuring.

---

### Failure mode 2: Chunk boundary artifacts

**What it looks like**: Relevant code appears in search results, but the returned chunks are incomplete or confusing. A function appears but its signature is missing. A class method appears without any indication of what class it belongs to.

**Root cause**: The chunking strategy splits code at boundaries that aren't semantically coherent. A fixed-size splitter cuts through a function. A naive line-count splitter separates a class declaration from its methods.

**How to detect it**: Inspect a sample of 50 chunks from each language. Look for chunks that start mid-function, chunks that end before a closing brace, and chunks that reference names without defining them.

**How to fix it**: Switch to AST-based chunking for languages where it's available. For languages without mature parsers, use conservative overlap (each chunk includes the last few lines of the previous chunk). Ensure every chunk includes the containing class or function signature as a metadata header.

---

### Failure mode 3: Generated code pollution

**What it looks like**: Searches return large amounts of generated, boilerplate, or vendor code. A search for "user authentication" returns generated TypeScript from a gRPC definition, mock files from a test generator, and type definitions from a third-party library.

**Root cause**: The indexing pipeline lacks adequate filtering for generated and non-application code.

**How to detect it**: Search for concepts that should return a small number of highly relevant results. If the top 10 results include files with paths like `node_modules/`, `vendor/`, `*.generated.ts`, or `*.mock.ts`, the filter is insufficient.

**How to fix it**: Build a comprehensive `should_index` filter that excludes:
- `node_modules/`, `vendor/`, `.terraform/providers/`, `__pycache__/`
- Files matching `*.generated.*`, `*.pb.go`, `*_pb2.py`, `*.d.ts` (unless they're your own type declarations)
- Test fixtures and mock data files
- Binary files and configuration files that aren't IaC

Review and update this filter regularly. New libraries and build tools introduce new patterns of generated code.

**Key Insight**: Generated code in the index is worse than missing code in the index. Generated code returns with high similarity scores (because it's often syntactically identical to code you're searching for) and adds noise that drives down the ranking of real results.

---

### Failure mode 4: Index staleness at the wrong time

**What it looks like**: An engineer searches for something they know was just added and gets no results, or gets results pointing to the old version of code that was just refactored.

**Root cause**: The index update pipeline isn't running, has fallen behind, or had a silent failure that left some files un-indexed.

**How to detect it**: Track the "last indexed commit" and compare to the current HEAD. Alert when the gap exceeds a threshold (e.g., 50 commits or 4 hours, whichever comes first).

**How to fix it**: Add explicit monitoring for pipeline failures. Run the update pipeline as a post-merge hook in addition to (or instead of) a scheduled job. Display an index freshness indicator in the tool's UI so engineers know when results might be stale.

---

### Failure mode 5: Cross-language matching false positives

**What it looks like**: The cross-language dependency graph contains incorrect connections. A TypeScript function is linked to a Go handler that it doesn't actually call. A Python model is shown as reading from a SQL table it has no connection to.

**Root cause**: URL pattern matching is imprecise. Route patterns that are similar but not identical get matched. Table names that are generic (`users`, `events`, `items`) appear in many services that aren't actually connected.

**How to detect it**: Manually inspect the top 20 cross-language edges in the dependency graph, specifically those with confidence below 0.9. Verify whether they represent real connections.

**How to fix it**: Increase matching specificity. For HTTP routing, require that the full URL path pattern match, not just the prefix. For SQL, require that the table name appears in both a schema definition file AND application code — a reference in comments shouldn't count. Store confidence scores and filter results by a minimum threshold.

---

### Failure mode 6: Embedding model API failures

**What it looks like**: The index update pipeline fails silently. Files that changed are not reindexed. The failure happens during the embedding step and is swallowed by error handling.

**Root cause**: The embedding API returned an error (rate limit, network failure, server error) that was logged but not retried or surfaced in monitoring.

**How to detect it**: Track the success and failure rates of embedding API calls. Track which files were last successfully indexed. Compare "last indexed" timestamps to "last modified" timestamps.

**How to fix it**: Implement a dead letter queue: files that fail to index are added to a retry queue with exponential backoff. After N failed retries, alert and require manual intervention. Never silently swallow embedding failures.

---

### Failure mode 7: Reranking that overrides correct results

**What it looks like**: A search query returns relevant results but they're buried in the list behind less relevant results. The engineer expects to find something in the top 3 and has to scroll to find it.

**Root cause**: The reranking logic applies context signals (recency, language proximity) that happen to be strong for wrong results. A file that was recently modified ranks above a file that's the correct answer to the query but hasn't been touched in months.

**How to detect it**: A/B test the reranker against the raw vector search results on a set of known-answer queries. If the reranker consistently worsens the ranking of correct answers for specific query types, it's over-applying its signals.

**How to fix it**: Reranking signals should adjust scores at the margin, not override base retrieval scores. Cap the maximum boost any single signal can apply (e.g., no more than 20% boost from recency). Only apply context signals when the base scores of competing results are close.

---

### Failure mode 8: Cold start degradation

**What it looks like**: When a new service is added to a codebase and indexed for the first time, search results for that service are poor until enough queries have been run to provide feedback signals.

**Root cause**: Some reranking signals (query feedback, view frequency) are learned from usage patterns. A service with no usage history has none of these signals and gets ranked down relative to older, more-queried services.

**How to detect it**: Test search quality for newly indexed services before they've been queried. Compare results for "new service" terms to "established service" terms.

**How to fix it**: Initialize all new service content with neutral (not zero) signals. Don't penalize absence of feedback history — only reward presence of positive signals. Run a synthetic "warm-up" query set against new content at index time.

---

### Failure mode 9: Overlong chunks degrading embedding quality

**What it looks like**: Searches for specific functionality return results that contain the target functionality but are so large that the embedding doesn't specifically represent it. A search for "JWT validation logic" returns a 500-line authentication module that contains JWT validation somewhere inside it.

**Root cause**: The chunking strategy produces chunks that are too large. The embedding model compresses the entire chunk into a fixed-size vector, and the specific concept gets diluted by the surrounding code.

**How to detect it**: Measure the distribution of chunk sizes by language. Flag chunks above 300 tokens. Measure whether queries that should return specific functions instead return large file-level chunks.

**How to fix it**: For languages where chunking is already by function or class, this usually indicates a few very large functions or classes that should be split further. For SQL and Terraform, add size-based splitting as a fallback after statement-level splitting.

---

### Failure mode 10: The wrong query reaching the wrong retrieval strategy

**What it looks like**: Query classification routes a query to the wrong strategy. An impact analysis query like "what breaks if I change the User struct" gets routed to semantic search and returns code that's similar to User structs, not code that depends on the specific User struct.

**Root cause**: The query classifier fails to recognize the query's intent.

**How to detect it**: Log the classification decision for every query. Periodically review a sample of classified queries and verify the classification was correct.

**How to fix it**: Add the misclassified query patterns to the classifier's training set or rule set. For impact analysis specifically, make the query syntax explicit and document it for users: "Use `impact:` prefix to trigger dependency analysis."

---

### Key Takeaways

- Silent language bias is the most common and least-diagnosed failure mode. Measure per-language retrieval quality explicitly.
- Generated code pollution degrades results more than missing code does. Invest in comprehensive filtering.
- Index staleness is an operational failure, not a retrieval failure. Monitor pipeline health, not just index quality.
- Reranking signals should adjust at the margin, not override base scores. Cap maximum boosts.
- Log query classifications and audit them periodically. Routing errors compound as the system handles more queries.

**Try This**: Pick three failure modes from this list that you haven't checked for in your system. Design a specific, measurable test for each. Run the tests. For any failure you find, fix it before adding new retrieval features — foundation quality matters more than feature breadth.

---

## Conclusion

The polyglot codebase is not a problem to be solved. It's a condition to be managed. Every engineering organization of any size ends up with one, and the engineers who work in it need tools that respect its actual structure rather than pretending it's a single-language system.

What this guide has tried to do is close the gap between the general principles of AI retrieval — chunk, embed, search — and the specifics of applying those principles to code that spans Python, Go, TypeScript, SQL, Terraform, and everything else. General principles don't survive contact with real systems without this kind of specificity.

The key ideas worth carrying forward:

Language-specific chunking is not optional. A single splitting strategy applied uniformly produces bad results for at least some languages. Python wants AST-based function extraction. Go wants similar AST extraction but with awareness of error-handling verbosity. SQL wants statement-level parsing. Terraform wants resource-block isolation. Each language has a natural unit of semantic coherence, and the chunking strategy should respect it.

The embedding model is the highest-leverage choice in the stack. A model with strong multi-language code training and cross-lingual alignment dramatically outperforms a general-purpose model for polyglot retrieval. Before building anything else, validate your embedding model on cross-language pairs from your actual codebase.

The cross-language dependency graph is the missing piece that most retrieval systems never build. Vector search is excellent for concept retrieval. It cannot trace a dependency that crosses a language boundary. The HTTP route graph, the ORM-to-schema mapping, the event producer-consumer graph — these are deterministic relationships that should be stored and queried as graphs, not inferred from retrieval results.

Hybrid search (vector + BM25) is the baseline. Pure semantic search degrades on exact identifiers. Pure keyword search degrades on concept queries. Together, with RRF scoring, they handle the full range of queries engineers actually run.

Index quality is an operational concern. An index that isn't monitored degrades silently. Language distribution shifts. Files accumulate without indexing. Generated code leaks past filters. Query latency climbs as the index grows. These aren't theoretical concerns — they're the observed behavior of production retrieval systems that aren't actively maintained.

Building a polyglot retrieval system that earns daily use from engineers is not a weekend project. It's an investment that compounds over time. A system that surfaces cross-language connections that would otherwise take an engineer 30 minutes to trace manually, running dozens of times per day across a team, is meaningfully faster development. That's the payoff. It's worth the architecture.

---

## Appendix A: Glossary

**ANN (Approximate Nearest Neighbor)**: An algorithm for finding vectors in a high-dimensional space that are approximately closest to a query vector. Used in vector databases for fast retrieval. Trades exactness for speed.

**AST (Abstract Syntax Tree)**: A tree representation of the grammatical structure of source code. Used in language-aware chunking to identify natural boundaries like function and class definitions.

**BM25 (Best Match 25)**: A ranking function used in information retrieval based on term frequency and inverse document frequency. Complements vector search for exact-match and rare-term queries.

**Chunk**: A unit of code extracted from a source file for embedding and indexing. The fundamental unit of retrieval.

**ChromaDB**: An open-source vector database commonly used for local or self-hosted embedding storage. Supports metadata filtering and hybrid search.

**Cosine Similarity**: A measure of similarity between two vectors based on the angle between them. Standard metric for comparing embeddings.

**Cross-language dependency graph**: An explicit graph structure that captures connections between code written in different languages — HTTP call-to-handler mappings, ORM-to-schema mappings, event producer-consumer connections.

**Embedding**: A numerical vector that encodes the semantic meaning of a text or code chunk. Similar items have embeddings that are close in vector space.

**HCL (HashiCorp Configuration Language)**: The domain-specific language used by Terraform and other HashiCorp tools for defining infrastructure configuration.

**HNSW (Hierarchical Navigable Small World)**: A graph-based data structure for efficient approximate nearest-neighbor search. Used internally by most modern vector databases.

**Hybrid search**: A retrieval approach that combines vector similarity search with keyword search (typically BM25). Uses RRF or linear combination to merge results.

**Incremental update**: An index update strategy that re-indexes only changed files rather than rebuilding the entire index.

**Polyglot codebase**: A codebase that uses multiple programming languages for different components. Distinct from a codebase that happens to have some files in multiple languages.

**RAG (Retrieval-Augmented Generation)**: A technique that enhances language model responses by retrieving relevant context from an external index and including it in the model's input.

**Reciprocal Rank Fusion (RRF)**: A method for combining ranked lists from multiple retrieval sources. Computes a combined score as the sum of reciprocals of rank positions.

**Reranking**: Reordering an initial set of retrieval results using additional signals (recency, context, dependency proximity) that were too expensive to compute at retrieval time.

**sqlglot**: An open-source SQL parser and transpiler that handles multiple SQL dialects. Used for statement-level SQL chunking.

**tree-sitter**: A language-agnostic parsing library that produces ASTs for many programming languages. Used for chunking TypeScript, Go, and other languages with consistent grammar.

**Vector database**: A database optimized for storing and querying high-dimensional vectors (embeddings). Examples include ChromaDB, Pinecone, Weaviate, and Qdrant.

**Vector index**: The data structure within a vector database that enables efficient similarity search. HNSW and IVF (Inverted File Index) are common implementations.

---

## Appendix B: Tools & Resources

### Parsing and Chunking

**tree-sitter** — Language-agnostic parser with support for Python, TypeScript, Go, Rust, and dozens of others. The best choice for polyglot chunking infrastructure because the same API works across all supported languages. Available as a Python package (`tree-sitter`) with language-specific grammar packages (`tree-sitter-typescript`, `tree-sitter-go`, etc.).

**Python `ast` module** — Standard library. Complete Python AST with source segment extraction. Use this over tree-sitter for Python because it handles the full Python grammar including recent additions.

**Go `go/ast` package** — Standard library. Full Go AST with good tooling. Use with `go/parser` for parsing.

**sqlglot** — SQL parser and transpiler supporting PostgreSQL, MySQL, BigQuery, SQLite, and others. The most practical choice for SQL statement parsing in a Python pipeline.

**hcl2** — Python library for parsing HCL (HashiCorp Configuration Language). Use for Terraform file parsing.

### Embedding Models

**voyage-code-3** — Code-specialized embedding model with strong multi-language support. Well-suited for retrieval in polyglot codebases. Available via API.

**jina-embeddings-v3** — General-purpose embedding model with competitive code retrieval performance and a long context window (8192 tokens). Useful when chunk sizes are large.

**multilingual-e5-large** — Despite not being code-specialized, often competitive on cross-language retrieval tasks due to its explicit cross-lingual training objective. Available as an open-weight model via Hugging Face.

**Nomic Embed Code** — A newer open-weight code embedding model with competitive performance across multiple languages. Can be self-hosted.

### Vector Databases

**ChromaDB** — Open-source, embeddable vector database. Runs in-process (no server required) or as a standalone service. Ideal for developer tooling where operational simplicity matters. Supports metadata filtering and has a Python-first API.

**Qdrant** — Open-source vector database with strong performance on filtered search. Better than ChromaDB for large-scale production deployments. Has built-in sparse vector support for hybrid search.

**Weaviate** — Open-source vector database with a rich query language and built-in hybrid search. Heavier operationally than ChromaDB but more powerful.

**LanceDB** — Newer open-source vector database with efficient on-disk storage. Fast full-text search built in. A good choice for codebases where the vector index needs to be stored efficiently and queried quickly.

### Dependency Analysis

**Depends** — Multi-language dependency analysis tool with support for Java, Python, TypeScript, and others. Useful starting point for intra-language dependency graphs.

**Sourcegraph** — Commercial code intelligence platform with cross-repository dependency tracking. Expensive but comprehensive for large engineering organizations.

**NetworkX** — Python library for graph construction and analysis. Use for building and querying the cross-language dependency graph when SQLite is insufficient.

### Search and Retrieval Infrastructure

**BM25s** — Pure Python BM25 implementation. Fast enough for most codebase sizes and easy to integrate with a vector index for hybrid search.

**rank-bm25** — Another Python BM25 implementation. Slightly more feature-complete than BM25s.

**LangChain** — Framework for building retrieval pipelines. Useful for rapid prototyping of retrieval chains but adds significant abstraction overhead for custom polyglot systems.

**LlamaIndex** — Similar to LangChain. Has built-in code splitter support but lacks polyglot awareness. Useful as a starting point, not an end state.

### Monitoring and Observability

**Prometheus + Grafana** — Standard stack for metric collection and visualization. Add a Prometheus client to the serving layer to expose query latency, result counts, and index freshness metrics.

**Sentry** — Error tracking. Add to the ingestion pipeline to catch and alert on parsing and embedding failures.

---

## Appendix C: Further Reading

### Retrieval and Search

**"Pretrained Transformers for Text Ranking: BERT and Beyond"** — Lin et al. The technical foundation for understanding why dense retrieval (embedding-based) outperforms sparse retrieval (BM25) for semantic queries and falls short for exact-match queries. Understanding this trade-off informs every hybrid search implementation.

**"BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models"** — Thakur et al. Benchmark paper that demonstrates retrieval model performance varies dramatically across domains. Directly applicable to understanding why a general-purpose embedding model underperforms on code.

**"Improving Passage Retrieval with Zero-Shot Question Generation"** — Sachan et al. The technique of generating hypothetical questions for each chunk at indexing time to improve retrieval quality. Applicable to code chunks, particularly for function-level documentation.

### Code Representation and Analysis

**"CodeSearchNet Challenge: Evaluating the State of Semantic Code Search"** — Husain et al. The foundational benchmark for code retrieval. Covers multiple languages and is the primary benchmark for code embedding models. Understanding the benchmark is essential for evaluating model choices.

**"GraphCodeBERT: Pre-training Code Representations with Data Flow"** — Guo et al. Introduces the use of data flow graphs in code embeddings. Relevant for understanding how structural code information (beyond text) can improve embedding quality.

**"Tree-sitter: A New Parsing System for Programming Tools"** — Max Brunsfeld. The original description of the tree-sitter parsing system. Essential reading before building anything on top of it.

### Polyglot Systems and Architecture

**"Building Microservices"** — Sam Newman. The canonical reference for service-oriented architecture. The dependency tracing challenges described in Chapter 4 are directly related to the microservices communication patterns Newman describes.

**"Designing Data-Intensive Applications"** — Martin Kleppmann. Covers data pipeline architecture, streaming systems, and database internals at a depth that directly applies to building production retrieval infrastructure.

**"Software Architecture: The Hard Parts"** — Ford, Richards, Sadalage, Dehghani. Addresses the architectural decisions in distributed polyglot systems, including service decomposition and cross-service dependency management.

### Practical Code Intelligence

**Language Server Protocol Specification** — Microsoft. The protocol underlying IDE intelligence features. Understanding LSP is valuable for integrating retrieval into IDE tooling and for understanding how language-specific intelligence is exposed across tools.

**Sourcegraph Blog** — The engineering blog from Sourcegraph covers practical code intelligence challenges including cross-repository search, precise code navigation, and scaling issues. Highly relevant to the problems described in this guide, with the additional context of systems operating at much larger scale.

**GitHub Semantic** — The open-source library GitHub built for semantic code analysis. Covers language-aware parsing for multiple languages. Some of the code may be outdated, but the architecture and design decisions documented in the codebase are instructive.

---

*Polyglot and Multi-Language Codebases with AI — Version 1.0 — April 2026*

*David Kelly Price*

---

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*

---

## Related Blog Posts

- [Your Codebase Has Its Own Language](https://pyckle.co/blog/your-codebase-has-its-own-languageand-your-ai-doesnt-speak-it.html)
- [When Everything Is Flat, Everything Gets Lost](https://pyckle.co/blog/when-everything-is-flat-everything-gets-lost.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*
