# The Developer's Guide to Semantic Code Search

## Introduction

Code search has two modes: you know what you're looking for, or you know what problem you're trying to solve.

The first mode is well-served. IDE symbol search, `grep`, language server `go-to-definition` — these work. They are fast and accurate when you already know the function name or file.

The second mode is the hard one. "Where does this system validate JWT tokens." "What runs when a background job fails." "How does the API rate limiter work." These queries don't have exact answers to match against. They require the search system to understand what you're describing and find the code that does it.

This guide covers the infrastructure, the tradeoffs, and the implementation patterns for building semantic code search that actually works at the L1 level — queries where the vocabulary doesn't match the code.

---

## Part 1: How Semantic Search Works

### The Core Mechanism

Semantic search converts text into dense vectors — fixed-length arrays of floating-point numbers where semantically similar text ends up in geometrically close positions. At query time, you convert the query into a vector and find the stored vectors closest to it.

The retrieval quality depends entirely on the quality of the embedding model. A model that places "user authentication" near `validate_jwt_token()` in vector space will return useful results. A model that places them far apart will return noise.

### Embedding Dimensions

Most embedding models produce 384- to 1536-dimensional vectors. Higher dimensionality generally means more expressive representations, at the cost of storage and computation. For code search:

- 384 dimensions: compact, fast, lower quality ceiling
- 768 dimensions: good balance for most codebases
- 1536 dimensions: highest quality, 4x storage vs. 384

The dimension doesn't matter as much as the model's training distribution. A 384-dimension model trained on code will outperform a 1536-dimension model trained on Wikipedia for code retrieval tasks.

### Similarity Metrics

**Cosine similarity** is the standard for text embeddings. It measures the angle between two vectors, normalized for magnitude. Range: -1 to 1. For retrieval, higher is better.

**Dot product** is faster (no normalization) and equivalent to cosine similarity when vectors are unit-normalized. Most embedding models return unit-normalized vectors; check your model's documentation.

**Euclidean distance** is less common for text but used in some approximate nearest neighbor libraries. Lower is better.

For all practical purposes: use cosine similarity unless your library requires something else.

---

## Part 2: Chunking Strategy

How you split your codebase into chunks has a larger effect on retrieval quality than most developers expect.

### The Wrong Way: Fixed-Size Text Chunks

```python
# Don't do this for code
def chunk_text(text, size=512):
    return [text[i:i+size] for i in range(0, len(text), size)]
```

Fixed-size chunking cuts across function boundaries, class definitions, and logical units. The resulting chunks lose structural context. A chunk that starts mid-function has no type information, no function name, and often no meaningful semantics.

### Better: Function-Level Chunking

Parse the AST, extract complete functions and classes as chunks. This preserves structural boundaries:

```python
import ast

def extract_functions(source: str, filepath: str) -> list[dict]:
    tree = ast.parse(source)
    chunks = []
    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            chunk_source = ast.get_source_segment(source, node)
            chunks.append({
                "content": chunk_source,
                "name": node.name,
                "file": filepath,
                "line_start": node.lineno,
                "line_end": node.end_lineno,
            })
    return chunks
```

Function-level chunks work well for:
- Direct function lookups
- Understanding what a specific function does
- Finding similar implementations

They work less well for:
- Cross-function data flow queries
- Understanding module-level patterns
- Queries about how multiple functions interact

### Best: Hybrid Chunking with Overlap

Use function-level chunks as the primary unit, with context windows that include neighboring code:

```python
def extract_with_context(source: str, filepath: str, context_lines: int = 5) -> list[dict]:
    lines = source.splitlines()
    tree = ast.parse(source)
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            # Include lines above for decorator/comment context
            start = max(0, node.lineno - context_lines - 1)
            end = node.end_lineno

            context_chunk = "\n".join(lines[start:end])
            chunks.append({
                "content": context_chunk,
                "name": node.name,
                "file": filepath,
                "line_start": node.lineno,
                "line_end": node.end_lineno,
                "has_context": True,
            })
    return chunks
```

### Metadata Enrichment

Metadata is not embedded, but it is critical for post-retrieval filtering and display:

```python
chunk = {
    "content": function_source,
    "metadata": {
        "file": filepath,
        "language": "python",
        "function_name": node.name,
        "class_name": parent_class,
        "imports": extract_imports(source),
        "line_start": node.lineno,
        "last_modified": file_mtime,
    }
}
```

Always store:
- File path and line numbers (for "go to definition" links)
- Language
- Function/class name
- Last modified timestamp (for staleness detection)

---

## Part 3: Vector Database Selection

### For Prototyping: ChromaDB

ChromaDB is the fastest path from zero to working. It runs in-process, no server required, persistence is handled automatically.

```python
import chromadb
from chromadb.utils.embedding_functions import EmbeddingFunction

class PyckleEmbeddingFunction(EmbeddingFunction):
    def __init__(self, api_key: str):
        self.api_key = api_key

    def __call__(self, input: list[str]) -> list[list[float]]:
        import httpx
        response = httpx.post(
            "https://api.pyckle.co/v1/embeddings",
            headers={"Authorization": f"Bearer {self.api_key}"},
            json={"input": input, "model": "pycklelm-1"},
            timeout=30,
        )
        response.raise_for_status()
        data = response.json()["data"]
        return [item["embedding"] for item in sorted(data, key=lambda x: x["index"])]

client = chromadb.PersistentClient(path="./codebase_index")
collection = client.get_or_create_collection(
    name="codebase",
    embedding_function=PyckleEmbeddingFunction(api_key="pk_live_...")
)
```

ChromaDB limitations: no horizontal scaling, performance degrades above ~1M vectors, limited filtering options.

### For Production: Qdrant or Weaviate

**Qdrant** is fast, runs as a Docker container, and has excellent filtering support:

```python
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient("localhost", port=6333)
client.create_collection(
    collection_name="codebase",
    vectors_config=VectorParams(size=768, distance=Distance.COSINE),
)
```

**Weaviate** has stronger built-in hybrid search (BM25 + semantic fusion) out of the box:

```python
import weaviate
import weaviate.classes as wvc

client = weaviate.connect_to_local()
client.collections.create(
    name="CodeChunk",
    vectorizer_config=wvc.config.Configure.Vectorizer.none(),
    properties=[
        wvc.config.Property(name="content", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="filepath", data_type=wvc.config.DataType.TEXT),
        wvc.config.Property(name="functionName", data_type=wvc.config.DataType.TEXT),
    ]
)
```

### Approximate Nearest Neighbor

At scale (>100K chunks), exact similarity search becomes slow. ANN algorithms trade a small amount of recall for large speedups. HNSW (Hierarchical Navigable Small World) is the standard:

- Qdrant, Weaviate, and ChromaDB all use HNSW internally
- Typical recall@10: 95-99% vs. exact search at 10-100x speedup
- For most code search workloads, this tradeoff is worth it

---

## Part 4: Hybrid Search

Pure semantic search misses exact matches. A developer searching for `useAuthorizationMiddleware` by name gets better results from keyword search. Combining both gives you the best of both:

### BM25 + Semantic Fusion (RRF)

Reciprocal Rank Fusion merges results from two ranked lists:

```python
def reciprocal_rank_fusion(
    semantic_results: list[tuple[str, float]],
    bm25_results: list[tuple[str, float]],
    k: int = 60
) -> list[tuple[str, float]]:
    scores = {}

    for rank, (doc_id, _) in enumerate(semantic_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    for rank, (doc_id, _) in enumerate(bm25_results):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)
```

RRF is parameter-free (the `k=60` constant is robust across most use cases) and doesn't require score normalization between the two systems.

### When to Weight Semantic vs. Keyword

- **Short, specific queries** (function names, variable names): weight keyword higher
- **Descriptive queries** ("where does billing webhook fire"): weight semantic higher
- **Mixed queries** ("find rate_limit related to billing"): use balanced RRF

Auto-detection heuristic: if the query contains camelCase or snake_case tokens, assume it's keyword-heavy.

---

## Part 5: Indexing Pipeline

### Initial Index

```python
import asyncio
import httpx
from pathlib import Path

async def index_codebase(
    root: str,
    api_key: str,
    collection,
    batch_size: int = 100
):
    chunks = []

    for filepath in Path(root).rglob("*.py"):
        if should_skip(filepath):
            continue
        source = filepath.read_text()
        file_chunks = extract_with_context(source, str(filepath))
        chunks.extend(file_chunks)

    # Batch embed
    async with httpx.AsyncClient() as client:
        for i in range(0, len(chunks), batch_size):
            batch = chunks[i:i+batch_size]
            texts = [c["content"] for c in batch]

            response = await client.post(
                "https://api.pyckle.co/v1/embeddings",
                headers={"Authorization": f"Bearer {api_key}"},
                json={"input": texts, "model": "pycklelm-1"},
                timeout=30,
            )
            response.raise_for_status()
            data = response.json()["data"]
            embeddings = [item["embedding"] for item in sorted(data, key=lambda x: x["index"])]

            metadatas = [
                {"file": c["file"], "name": c["name"],
                 "line_start": c["line_start"], "line_end": c["line_end"]}
                for c in batch
            ]
            collection.add(
                documents=texts,
                embeddings=embeddings,
                metadatas=metadatas,
                ids=[f"{c['file']}:{c['line_start']}" for c in batch]
            )

    return len(chunks)

def should_skip(filepath: Path) -> bool:
    skip_dirs = {".git", "node_modules", "__pycache__", ".venv", "dist", "build"}
    return any(part in skip_dirs for part in filepath.parts)
```

### Incremental Updates

Don't re-index the entire codebase on every change. Track file modification times:

```python
def get_changed_files(root: str, index_state: dict) -> list[str]:
    changed = []
    for filepath in Path(root).rglob("*.py"):
        mtime = filepath.stat().st_mtime
        if str(filepath) not in index_state or index_state[str(filepath)] < mtime:
            changed.append(str(filepath))
    return changed

async def update_index(root: str, api_key: str, collection, index_state: dict):
    changed = get_changed_files(root, index_state)

    for filepath in changed:
        # Delete old chunks for this file
        collection.delete(where={"file": filepath})

        # Re-index
        source = Path(filepath).read_text()
        chunks = extract_with_context(source, filepath)
        # ... embed and add new chunks

        index_state[filepath] = Path(filepath).stat().st_mtime

    return len(changed)
```

### File Watcher Integration

For IDE integrations, trigger incremental updates on file save:

```python
from watchdog.observers import Observer
from watchdog.events import FileSystemEventHandler

class CodebaseWatcher(FileSystemEventHandler):
    def __init__(self, indexer, loop: asyncio.AbstractEventLoop):
        self.indexer = indexer
        self.loop = loop

    def on_modified(self, event):
        if not event.is_directory and event.src_path.endswith(".py"):
            asyncio.run_coroutine_threadsafe(
                self.indexer.update_file(event.src_path), self.loop
            )
```

---

## Part 6: Query Pipeline

### Basic Query

```python
async def search(
    query: str,
    collection,
    api_key: str,
    top_k: int = 10,
    filter_dict: dict = None
) -> list[dict]:
    # Embed the query
    async with httpx.AsyncClient(timeout=30) as client:
        response = await client.post(
            "https://api.pyckle.co/v1/embeddings",
            headers={"Authorization": f"Bearer {api_key}"},
            json={"input": [query], "model": "pycklelm-1"},
        )
        response.raise_for_status()
        query_embedding = response.json()["data"][0]["embedding"]

    # Retrieve
    results = collection.query(
        query_embeddings=[query_embedding],
        n_results=top_k,
        where=filter_dict
    )

    return [
        {
            "content": doc,
            "metadata": meta,
            "score": 1 - dist,
        }
        for doc, meta, dist in zip(
            results["documents"][0],
            results["metadatas"][0],
            results["distances"][0]
        )
    ]
```

### Query Expansion

For short or ambiguous queries, expanding with synonyms or related terms improves recall:

```python
def expand_query(query: str) -> str:
    expansions = {
        "auth": "authentication authorization",
        "db": "database",
        "config": "configuration settings",
        "middleware": "handler interceptor",
    }

    tokens = query.lower().split()
    expanded = []
    for token in tokens:
        expanded.append(token)
        if token in expansions:
            expanded.append(expansions[token])

    return " ".join(expanded)
```

### Re-ranking

After retrieval, a cross-encoder re-ranker can improve precision significantly:

```python
from sentence_transformers import CrossEncoder

reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")

def rerank(query: str, results: list[dict], top_k: int = 5) -> list[dict]:
    pairs = [(query, r["content"]) for r in results]
    scores = reranker.predict(pairs)

    ranked = sorted(
        zip(results, scores),
        key=lambda x: x[1],
        reverse=True
    )

    return [r for r, _ in ranked[:top_k]]
```

---

## Part 7: Production Considerations

### Caching

Query results for common queries are worth caching. Use a short TTL (5-15 minutes) to avoid stale results after indexing updates:

```python
import hashlib
import json
from functools import lru_cache

def cache_key(query: str, filters: dict) -> str:
    return hashlib.sha256(
        json.dumps({"query": query, "filters": filters}, sort_keys=True).encode()
    ).hexdigest()
```

### Staleness Detection

Track when chunks were last indexed. Surface staleness warnings in results:

```python
import time

def is_stale(metadata: dict, max_age_seconds: int = 3600) -> bool:
    indexed_at = metadata.get("indexed_at", 0)
    return time.time() - indexed_at > max_age_seconds
```

### Access Control

If your codebase has access controls, enforce them at the retrieval layer:

```python
async def search_with_acl(query: str, user_id: str, collection, api_key: str):
    allowed_files = await get_user_accessible_files(user_id)

    return await search(
        query=query,
        collection=collection,
        api_key=api_key,
        filter_dict={"file": {"$in": allowed_files}}
    )
```

---

## Conclusion

Semantic code search that works at L1 — the queries developers actually need, not just the ones that are easy to answer — requires:

1. A retrieval model trained on code-to-query pairs, not general-purpose text
2. Function-level chunking with contextual overlap
3. Hybrid search (semantic + BM25) for both exact and descriptive queries
4. Incremental indexing to stay current with code changes
5. A feedback loop that improves retrieval quality based on real usage

The Pyckle Embeddings API provides the retrieval model. The rest of this guide provides the infrastructure.

---

*The Pyckle Embeddings API is live at pyckle.co. 768-dimensional embeddings, trained specifically for code-to-query retrieval, available at 10 req/min (free) or 60 req/min (Pro).*
