Embed a Codebase and Query It

From raw files to semantic search in one session. Chunking, embedding, storing, and querying.

🎧
Listen to this guide 8 min
Download MP3
Prerequisites
  • Pyckle API key
  • Python 3.9+
  • pip install openai pgvector psycopg2-binary chromadb

You have a codebase. You want to ask it questions in plain English. This guide walks you through chunking, embedding, storing, and querying — end to end, with real code.

Step 1: Choose Your Chunking Strategy

Chunking determines what your search returns. Three options exist: file-level, sliding window, and function-level. File-level is too coarse — you'll get back 500-line files when you needed one function. Sliding window works for prose but ignores code structure entirely. Function-level is the right call for code: each chunk is a self-contained unit with a name, a signature, and a purpose.

For Python, use ast.parse to extract functions precisely. For JavaScript and TypeScript, a regex on function and arrow function patterns gets you 90% of the way there without pulling in a full parser.

Warning

Exclude node_modules/, .git/, dist/, build/, and any generated files before you chunk. Embedding build artifacts wastes tokens and pollutes your search results with noise you'll never want returned.

Step 2: Chunk Your Codebase Files

Walk the directory, filter to source files, and extract functions. The snippet below handles Python with AST and falls back to whole-file chunking for JS/TS.

import ast
import os
from pathlib import Path
from typing import Iterator

def iter_source_files(root: str, extensions=(".py", ".js", ".ts")) -> Iterator[Path]:
    for dirpath, dirnames, filenames in os.walk(root):
        # Skip junk directories in place
        dirnames[:] = [
            d for d in dirnames
            if d not in {"node_modules", ".git", "dist", "build", "__pycache__", ".venv"}
        ]
        for fname in filenames:
            if fname.endswith(extensions):
                yield Path(dirpath) / fname

def chunk_python_file(path: Path) -> list[dict]:
    source = path.read_text(encoding="utf-8", errors="ignore")
    chunks = []
    try:
        tree = ast.parse(source)
    except SyntaxError:
        return [{"text": source, "file": str(path), "name": path.name}]

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            start = node.lineno - 1
            end = node.end_lineno
            snippet = "\n".join(source.splitlines()[start:end])
            chunks.append({
                "text": snippet,
                "file": str(path),
                "name": node.name,
                "lines": f"{node.lineno}-{node.end_lineno}",
            })
    return chunks or [{"text": source, "file": str(path), "name": path.name}]

def chunk_js_file(path: Path) -> list[dict]:
    import re
    source = path.read_text(encoding="utf-8", errors="ignore")
    # Split on function declarations and arrow functions assigned to const/let/var
    pattern = re.compile(
        r'((?:export\s+)?(?:async\s+)?function\s+\w+|(?:const|let|var)\s+\w+\s*=\s*(?:async\s+)?\()',
        re.MULTILINE
    )
    positions = [m.start() for m in pattern.finditer(source)]
    if not positions:
        return [{"text": source, "file": str(path), "name": path.name}]

    chunks = []
    for i, pos in enumerate(positions):
        end = positions[i + 1] if i + 1 < len(positions) else len(source)
        chunks.append({"text": source[pos:end].strip(), "file": str(path), "name": f"chunk_{i}"})
    return chunks

def collect_chunks(root: str) -> list[dict]:
    all_chunks = []
    for path in iter_source_files(root):
        if path.suffix == ".py":
            all_chunks.extend(chunk_python_file(path))
        else:
            all_chunks.extend(chunk_js_file(path))
    return all_chunks

Step 3: Embed Chunks in Batches

Pyckle's embedding endpoint uses the standard embedding API format. Point the client at the Pyckle base URL, set your API key, and send chunks in batches of 100. Batching keeps latency reasonable and avoids hitting rate limits on large codebases.

from openai import OpenAI

client = OpenAI(
    api_key=os.environ["PYCKLE_API_KEY"],
    base_url="https://api.pyckle.dev/v1",
)

EMBED_MODEL = "pyckle-embed-code"
BATCH_SIZE = 100

def embed_chunks(chunks: list[dict]) -> list[dict]:
    for i in range(0, len(chunks), BATCH_SIZE):
        batch = chunks[i : i + BATCH_SIZE]
        texts = [c["text"] for c in batch]
        response = client.embeddings.create(model=EMBED_MODEL, input=texts)
        for j, item in enumerate(response.data):
            batch[j]["embedding"] = item.embedding
        print(f"Embedded {min(i + BATCH_SIZE, len(chunks))}/{len(chunks)} chunks")
    return chunks
Try This

Prepend the filename and function name to the chunk text before embedding: f"# {chunk['file']}::{chunk['name']}\n{chunk['text']}". The model picks up the structural signal and returns more precise results for queries like "where is authentication handled".

Step 4: Store Embeddings in pgvector or ChromaDB

Two solid options. pgvector runs on Postgres you already have. ChromaDB is a local vector store that needs no infrastructure. Pick pgvector if you're in production or need filtered queries against other columns. Pick ChromaDB if you want something running in five minutes.

Key Insight

pgvector on an existing Postgres instance is 79% cheaper than a dedicated vector database for most workloads under 10M vectors. If you already run Postgres, CREATE EXTENSION vector is all you need.

pgvector

import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect("postgresql://user:pass@localhost:5432/codebase_db")
register_vector(conn)
cur = conn.cursor()

cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("""
    CREATE TABLE IF NOT EXISTS code_chunks (
        id SERIAL PRIMARY KEY,
        file TEXT NOT NULL,
        name TEXT,
        lines TEXT,
        content TEXT NOT NULL,
        embedding vector(384)
    )
""")
# HNSW index: faster queries than IVFFlat, no need to tune lists parameter
cur.execute("""
    CREATE INDEX IF NOT EXISTS idx_chunks_embedding
    ON code_chunks
    USING hnsw (embedding vector_cosine_ops)
    WITH (m = 16, ef_construction = 64)
""")
conn.commit()

def store_chunks_pg(chunks: list[dict]):
    for chunk in chunks:
        cur.execute(
            "INSERT INTO code_chunks (file, name, lines, content, embedding) VALUES (%s, %s, %s, %s, %s)",
            (chunk["file"], chunk.get("name"), chunk.get("lines"), chunk["text"], chunk["embedding"]),
        )
    conn.commit()
    print(f"Stored {len(chunks)} chunks in pgvector")

ChromaDB alternative

import chromadb

chroma = chromadb.PersistentClient(path="./chroma_codebase")
collection = chroma.get_or_create_collection(
    name="code_chunks",
    metadata={"hnsw:space": "cosine"},
)

def store_chunks_chroma(chunks: list[dict]):
    collection.add(
        ids=[f"{c['file']}::{c.get('name', i)}" for i, c in enumerate(chunks)],
        documents=[c["text"] for c in chunks],
        embeddings=[c["embedding"] for c in chunks],
        metadatas=[{"file": c["file"], "name": c.get("name", "")} for c in chunks],
    )
    print(f"Stored {len(chunks)} chunks in ChromaDB")

Step 5: Write a Semantic Search Query Function

Embed the query string with the same model you used at index time, then run a cosine similarity search against your store. Return the top-k results with scores so you can inspect confidence alongside content.

pgvector query

def search_pg(query: str, top_k: int = 5) -> list[dict]:
    response = client.embeddings.create(model=EMBED_MODEL, input=[query])
    query_vec = response.data[0].embedding

    cur.execute("""
        SELECT file, name, lines, content,
               1 - (embedding <=> %s::vector) AS score
        FROM code_chunks
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, (query_vec, query_vec, top_k))

    results = []
    for file, name, lines, content, score in cur.fetchall():
        results.append({"file": file, "name": name, "lines": lines, "content": content, "score": round(score, 4)})
    return results

ChromaDB query

def search_chroma(query: str, top_k: int = 5) -> list[dict]:
    response = client.embeddings.create(model=EMBED_MODEL, input=[query])
    query_vec = response.data[0].embedding

    results = collection.query(
        query_embeddings=[query_vec],
        n_results=top_k,
        include=["documents", "metadatas", "distances"],
    )

    output = []
    for doc, meta, dist in zip(
        results["documents"][0],
        results["metadatas"][0],
        results["distances"][0],
    ):
        output.append({
            "file": meta["file"],
            "name": meta["name"],
            "content": doc,
            "score": round(1 - dist, 4),  # cosine distance → similarity
        })
    return output

Step 6: Tune Your Similarity Threshold

Raw top-k results include everything regardless of relevance. Add a threshold filter so callers get nothing back when there's no good match — better than confidently returning the wrong function.

def search(query: str, top_k: int = 5, threshold: float = 0.70) -> list[dict]:
    # Swap search_pg for search_chroma if using ChromaDB
    results = search_pg(query, top_k=top_k * 2)  # over-fetch, then filter
    filtered = [r for r in results if r["score"] >= threshold]
    return filtered[:top_k]

# Example usage
hits = search("rate limiting middleware", top_k=5, threshold=0.70)
for hit in hits:
    print(f"[{hit['score']}] {hit['file']} :: {hit['name']}")
    print(hit["content"][:200])
    print("---")

Start at 0.70. If results contain loosely related code that doesn't match your intent, lower to 0.65. If you're missing functions you know exist, raise to 0.75. The right threshold varies by codebase — run ten representative queries and adjust until precision and recall both feel right.

Try This

Log query strings and their top result scores to a file for a week. You'll quickly see where the threshold is too tight (queries with zero results for known functions) and where it's too loose (high scores on clearly irrelevant chunks). Real query data beats guessing.

← All Guides

More Resources

Free Ebooks

36 Technical Books

Download free ebooks on embeddings, RAG, and code search.

How-To Books

8 In-Depth Guides

Step-by-step technical books on AI-assisted development.