- Pyckle API key
- Python 3.9+
pip install openai pgvector psycopg2-binary chromadb
You have a codebase. You want to ask it questions in plain English. This guide walks you through chunking, embedding, storing, and querying — end to end, with real code.
Step 1: Choose Your Chunking Strategy
Chunking determines what your search returns. Three options exist: file-level, sliding window, and function-level. File-level is too coarse — you'll get back 500-line files when you needed one function. Sliding window works for prose but ignores code structure entirely. Function-level is the right call for code: each chunk is a self-contained unit with a name, a signature, and a purpose.
For Python, use ast.parse to extract functions precisely. For JavaScript and TypeScript, a regex on function and arrow function patterns gets you 90% of the way there without pulling in a full parser.
Exclude node_modules/, .git/, dist/, build/, and any generated files before you chunk. Embedding build artifacts wastes tokens and pollutes your search results with noise you'll never want returned.
Step 2: Chunk Your Codebase Files
Walk the directory, filter to source files, and extract functions. The snippet below handles Python with AST and falls back to whole-file chunking for JS/TS.
import ast
import os
from pathlib import Path
from typing import Iterator
def iter_source_files(root: str, extensions=(".py", ".js", ".ts")) -> Iterator[Path]:
for dirpath, dirnames, filenames in os.walk(root):
# Skip junk directories in place
dirnames[:] = [
d for d in dirnames
if d not in {"node_modules", ".git", "dist", "build", "__pycache__", ".venv"}
]
for fname in filenames:
if fname.endswith(extensions):
yield Path(dirpath) / fname
def chunk_python_file(path: Path) -> list[dict]:
source = path.read_text(encoding="utf-8", errors="ignore")
chunks = []
try:
tree = ast.parse(source)
except SyntaxError:
return [{"text": source, "file": str(path), "name": path.name}]
for node in ast.walk(tree):
if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
start = node.lineno - 1
end = node.end_lineno
snippet = "\n".join(source.splitlines()[start:end])
chunks.append({
"text": snippet,
"file": str(path),
"name": node.name,
"lines": f"{node.lineno}-{node.end_lineno}",
})
return chunks or [{"text": source, "file": str(path), "name": path.name}]
def chunk_js_file(path: Path) -> list[dict]:
import re
source = path.read_text(encoding="utf-8", errors="ignore")
# Split on function declarations and arrow functions assigned to const/let/var
pattern = re.compile(
r'((?:export\s+)?(?:async\s+)?function\s+\w+|(?:const|let|var)\s+\w+\s*=\s*(?:async\s+)?\()',
re.MULTILINE
)
positions = [m.start() for m in pattern.finditer(source)]
if not positions:
return [{"text": source, "file": str(path), "name": path.name}]
chunks = []
for i, pos in enumerate(positions):
end = positions[i + 1] if i + 1 < len(positions) else len(source)
chunks.append({"text": source[pos:end].strip(), "file": str(path), "name": f"chunk_{i}"})
return chunks
def collect_chunks(root: str) -> list[dict]:
all_chunks = []
for path in iter_source_files(root):
if path.suffix == ".py":
all_chunks.extend(chunk_python_file(path))
else:
all_chunks.extend(chunk_js_file(path))
return all_chunks
Step 3: Embed Chunks in Batches
Pyckle's embedding endpoint uses the standard embedding API format. Point the client at the Pyckle base URL, set your API key, and send chunks in batches of 100. Batching keeps latency reasonable and avoids hitting rate limits on large codebases.
from openai import OpenAI
client = OpenAI(
api_key=os.environ["PYCKLE_API_KEY"],
base_url="https://api.pyckle.dev/v1",
)
EMBED_MODEL = "pyckle-embed-code"
BATCH_SIZE = 100
def embed_chunks(chunks: list[dict]) -> list[dict]:
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i : i + BATCH_SIZE]
texts = [c["text"] for c in batch]
response = client.embeddings.create(model=EMBED_MODEL, input=texts)
for j, item in enumerate(response.data):
batch[j]["embedding"] = item.embedding
print(f"Embedded {min(i + BATCH_SIZE, len(chunks))}/{len(chunks)} chunks")
return chunks
Prepend the filename and function name to the chunk text before embedding: f"# {chunk['file']}::{chunk['name']}\n{chunk['text']}". The model picks up the structural signal and returns more precise results for queries like "where is authentication handled".
Step 4: Store Embeddings in pgvector or ChromaDB
Two solid options. pgvector runs on Postgres you already have. ChromaDB is a local vector store that needs no infrastructure. Pick pgvector if you're in production or need filtered queries against other columns. Pick ChromaDB if you want something running in five minutes.
pgvector on an existing Postgres instance is 79% cheaper than a dedicated vector database for most workloads under 10M vectors. If you already run Postgres, CREATE EXTENSION vector is all you need.
pgvector
import psycopg2
from pgvector.psycopg2 import register_vector
conn = psycopg2.connect("postgresql://user:pass@localhost:5432/codebase_db")
register_vector(conn)
cur = conn.cursor()
cur.execute("CREATE EXTENSION IF NOT EXISTS vector")
cur.execute("""
CREATE TABLE IF NOT EXISTS code_chunks (
id SERIAL PRIMARY KEY,
file TEXT NOT NULL,
name TEXT,
lines TEXT,
content TEXT NOT NULL,
embedding vector(384)
)
""")
# HNSW index: faster queries than IVFFlat, no need to tune lists parameter
cur.execute("""
CREATE INDEX IF NOT EXISTS idx_chunks_embedding
ON code_chunks
USING hnsw (embedding vector_cosine_ops)
WITH (m = 16, ef_construction = 64)
""")
conn.commit()
def store_chunks_pg(chunks: list[dict]):
for chunk in chunks:
cur.execute(
"INSERT INTO code_chunks (file, name, lines, content, embedding) VALUES (%s, %s, %s, %s, %s)",
(chunk["file"], chunk.get("name"), chunk.get("lines"), chunk["text"], chunk["embedding"]),
)
conn.commit()
print(f"Stored {len(chunks)} chunks in pgvector")
ChromaDB alternative
import chromadb
chroma = chromadb.PersistentClient(path="./chroma_codebase")
collection = chroma.get_or_create_collection(
name="code_chunks",
metadata={"hnsw:space": "cosine"},
)
def store_chunks_chroma(chunks: list[dict]):
collection.add(
ids=[f"{c['file']}::{c.get('name', i)}" for i, c in enumerate(chunks)],
documents=[c["text"] for c in chunks],
embeddings=[c["embedding"] for c in chunks],
metadatas=[{"file": c["file"], "name": c.get("name", "")} for c in chunks],
)
print(f"Stored {len(chunks)} chunks in ChromaDB")
Step 5: Write a Semantic Search Query Function
Embed the query string with the same model you used at index time, then run a cosine similarity search against your store. Return the top-k results with scores so you can inspect confidence alongside content.
pgvector query
def search_pg(query: str, top_k: int = 5) -> list[dict]:
response = client.embeddings.create(model=EMBED_MODEL, input=[query])
query_vec = response.data[0].embedding
cur.execute("""
SELECT file, name, lines, content,
1 - (embedding <=> %s::vector) AS score
FROM code_chunks
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_vec, query_vec, top_k))
results = []
for file, name, lines, content, score in cur.fetchall():
results.append({"file": file, "name": name, "lines": lines, "content": content, "score": round(score, 4)})
return results
ChromaDB query
def search_chroma(query: str, top_k: int = 5) -> list[dict]:
response = client.embeddings.create(model=EMBED_MODEL, input=[query])
query_vec = response.data[0].embedding
results = collection.query(
query_embeddings=[query_vec],
n_results=top_k,
include=["documents", "metadatas", "distances"],
)
output = []
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0],
):
output.append({
"file": meta["file"],
"name": meta["name"],
"content": doc,
"score": round(1 - dist, 4), # cosine distance → similarity
})
return output
Step 6: Tune Your Similarity Threshold
Raw top-k results include everything regardless of relevance. Add a threshold filter so callers get nothing back when there's no good match — better than confidently returning the wrong function.
def search(query: str, top_k: int = 5, threshold: float = 0.70) -> list[dict]:
# Swap search_pg for search_chroma if using ChromaDB
results = search_pg(query, top_k=top_k * 2) # over-fetch, then filter
filtered = [r for r in results if r["score"] >= threshold]
return filtered[:top_k]
# Example usage
hits = search("rate limiting middleware", top_k=5, threshold=0.70)
for hit in hits:
print(f"[{hit['score']}] {hit['file']} :: {hit['name']}")
print(hit["content"][:200])
print("---")
Start at 0.70. If results contain loosely related code that doesn't match your intent, lower to 0.65. If you're missing functions you know exist, raise to 0.75. The right threshold varies by codebase — run ten representative queries and adjust until precision and recall both feel right.
Log query strings and their top result scores to a file for a week. You'll quickly see where the threshold is too tight (queries with zero results for known functions) and where it's too loose (high scores on clearly irrelevant chunks). Real query data beats guessing.