- Pyckle API key
- An indexed codebase or document set
- Python 3.9+
Vendor benchmarks tell you how a model performs on vendor data. That's not your problem. Your problem is whether developers on your team can find what they're looking for. This guide shows you how to measure that directly — in about 15 minutes.
Step 1: Understand What MRR@10 Actually Measures
Mean Reciprocal Rank at 10 is the standard metric for retrieval quality. For each query in your eval set, you ask: what rank was the first relevant result, within the top 10? You take the reciprocal of that rank — 1 if it was first, 0.5 if second, 0.33 if third — and average across all queries.
A score of 1.0 means every query returned the right result first. A score of 0.5 means the right result was consistently second. Anything below 0.2 means the model isn't finding your content reliably. The math is simple on purpose — it rewards getting the right answer high, not just anywhere in the top 10.
# MRR@10 from scratch — no libraries needed
def mrr_at_10(results: list[dict]) -> float:
"""
results: list of {"query": str, "expected": str, "ranked": list[str]}
ranked: list of file paths or doc IDs, in order
"""
reciprocal_ranks = []
for r in results:
ranked = r["ranked"][:10]
expected = r["expected"]
try:
rank = ranked.index(expected) + 1 # 1-indexed
reciprocal_ranks.append(1.0 / rank)
except ValueError:
reciprocal_ranks.append(0.0) # not in top 10
return sum(reciprocal_ranks) / len(reciprocal_ranks)
MRR penalizes rank 2 significantly (0.5 vs 1.0) but barely distinguishes rank 7 from rank 10 (0.14 vs 0.1). That's intentional — developers don't scroll. If the answer isn't first or second, it may as well not exist.
Step 2: Build a Minimal Eval Set from Your Own Data
You need 30 to 50 (query, expected_file) pairs drawn from real developer behavior. "Real" means queries that developers actually typed or would type — not queries you invented to make the model look good. If you're not sure what to use, run cat ~/.bash_history | grep grep and look at what people were searching for last month.
Each pair maps a natural language question to the single file or function that best answers it. You don't need exhaustive relevance labels. One ground-truth file per query is enough to compute MRR.
# eval_set.py — start here, fill in your own pairs
EVAL_SET = [
{
"query": "where is authentication middleware defined",
"expected": "src/middleware/auth.py"
},
{
"query": "how does the retry logic work for API calls",
"expected": "src/utils/retry.py"
},
{
"query": "database connection pool configuration",
"expected": "src/db/pool.py"
},
{
"query": "user permission checking before route handler",
"expected": "src/middleware/permissions.py"
},
{
"query": "how are background jobs queued",
"expected": "src/jobs/queue.py"
},
# add 25-45 more from your actual grep history
]
If your eval set has fewer than 30 queries, the variance is too high to trust. A single bad result swings the score by 3+ points. At 10 queries, you're essentially measuring noise. Build to 30 minimum before drawing conclusions.
Step 3: Run the Evaluation Loop
The eval loop is straightforward: for each (query, expected) pair, call the Pyckle search API, pull the top-10 results, and check whether the expected file appears and at what rank. Collect those ranks and compute MRR.
Run this against your indexed codebase. If you haven't indexed it yet, see the Embed a Codebase and Query It guide first — this loop assumes your content is already in Pyckle.
import os
import json
import requests
from eval_set import EVAL_SET
PYCKLE_API_KEY = os.environ["PYCKLE_API_KEY"]
PYCKLE_BASE_URL = "https://api.pyckle.dev/v1"
def search(query: str, top_k: int = 10) -> list[str]:
resp = requests.post(
f"{PYCKLE_BASE_URL}/search",
headers={"Authorization": f"Bearer {PYCKLE_API_KEY}"},
json={"query": query, "top_k": top_k}
)
resp.raise_for_status()
return [r["file_path"] for r in resp.json()["results"]]
def run_eval(eval_set: list[dict]) -> dict:
results = []
for item in eval_set:
ranked = search(item["query"])
results.append({
"query": item["query"],
"expected": item["expected"],
"ranked": ranked
})
print(f" {item['query'][:60]}")
if item["expected"] in ranked:
rank = ranked.index(item["expected"]) + 1
print(f" ✓ rank {rank}")
else:
print(f" ✗ not found")
score = mrr_at_10(results)
return {"mrr_at_10": score, "results": results}
def mrr_at_10(results: list[dict]) -> float:
reciprocal_ranks = []
for r in results:
ranked = r["ranked"][:10]
expected = r["expected"]
try:
rank = ranked.index(expected) + 1
reciprocal_ranks.append(1.0 / rank)
except ValueError:
reciprocal_ranks.append(0.0)
return sum(reciprocal_ranks) / len(reciprocal_ranks)
if __name__ == "__main__":
print(f"Running eval on {len(EVAL_SET)} queries...\n")
output = run_eval(EVAL_SET)
print(f"\nMRR@10: {output['mrr_at_10']:.3f}")
with open("eval_results.json", "w") as f:
json.dump(output, f, indent=2)
print("Results saved to eval_results.json")
Step 4: Interpret Your Scores
Raw scores mean nothing without reference points. Here's what the numbers actually mean in a code search context.
Scores above 0.4 are strong — developers are consistently hitting the right file in the first two results. Scores between 0.2 and 0.4 are usable, but worth investigating the misses. Scores below 0.2 mean something is systematically wrong: bad chunking, wrong model for your content type, or an eval set that doesn't reflect real queries.
def interpret_score(mrr: float) -> str:
if mrr >= 0.4:
return "Strong. Developers will find what they need."
elif mrr >= 0.2:
return "Usable. Worth examining which queries are failing."
else:
return "Poor. Investigate chunking strategy and model fit."
def failure_analysis(results: list[dict]) -> None:
"""Print the queries where the expected file wasn't in top 10."""
print("\nFailed queries (expected not in top 10):")
for r in results["results"]:
if r["expected"] not in r["ranked"]:
print(f" Query: {r['query']}")
print(f" Expected: {r['expected']}")
print(f" Got: {r['ranked'][:3]}")
print()
# after running eval:
print(interpret_score(output["mrr_at_10"]))
failure_analysis(output)
PyckLM scores 0.456 MRR@10 on CodeSearchNet — 62% better than GraphCodeBERT on the same benchmark. But your codebase isn't CodeSearchNet. Run your own eval set. The relative improvement usually holds, but your absolute score depends on your content and your queries.
Step 5: Compare Models on Your Eval Set
If you want to compare PyckLM against another model — a hosted service, Cohere, or a self-hosted option — run the same eval set through each. Use the same queries, the same expected files, the same MRR computation. The only variable should be the embedding model.
The pattern below wraps each model behind the same interface so the eval loop doesn't care which one it's calling. Add or remove adapters as needed.
import openai
def search_pyckle(query: str, top_k: int = 10) -> list[str]:
resp = requests.post(
f"{PYCKLE_BASE_URL}/search",
headers={"Authorization": f"Bearer {PYCKLE_API_KEY}"},
json={"query": query, "top_k": top_k}
)
resp.raise_for_status()
return [r["file_path"] for r in resp.json()["results"]]
def search_provider(query: str, top_k: int = 10) -> list[str]:
# assumes you've indexed your codebase with your existing provider separately
resp = requests.post(
"https://your-existing-index-endpoint/search",
headers={"Authorization": f"Bearer {os.environ['PROVIDER_API_KEY']}"},
json={"query": query, "top_k": top_k}
)
resp.raise_for_status()
return [r["file_path"] for r in resp.json()["results"]]
MODELS = {
"pyckle": search_pyckle,
"provider": search_provider,
}
comparison = {}
for model_name, search_fn in MODELS.items():
results = []
for item in EVAL_SET:
ranked = search_fn(item["query"])
results.append({
"query": item["query"],
"expected": item["expected"],
"ranked": ranked
})
score = mrr_at_10(results)
comparison[model_name] = score
print(f"{model_name}: {score:.3f}")
winner = max(comparison, key=comparison.get)
print(f"\nWinner: {winner} ({comparison[winner]:.3f})")
Sort your eval set results by reciprocal rank ascending and read the bottom 10 failures. Patterns emerge fast — often a single chunking issue or a naming mismatch explains half your misses. Fix those before switching models.
Step 6: Know When Good Enough Is Good Enough
Optimizing retrieval metrics can become a distraction. The real signal is developer behavior. If your team finds what they need in one or two queries, the model is working — regardless of whether your MRR is 0.41 or 0.52.
Stop optimizing when: your MRR is above 0.4, your failure analysis shows no systematic patterns, and developers aren't complaining. Don't chase marginal gains on your eval set. An eval set is a proxy. The actual goal is less time spent searching and more time spent building.
# Quick check: what's the distribution of ranks for found results?
def rank_distribution(results: list[dict]) -> dict:
dist = {i: 0 for i in range(1, 11)}
dist["not_found"] = 0
for r in results["results"]:
ranked = r["ranked"][:10]
expected = r["expected"]
if expected in ranked:
rank = ranked.index(expected) + 1
dist[rank] += 1
else:
dist["not_found"] += 1
return dist
dist = rank_distribution(output)
print("\nRank distribution:")
for rank, count in dist.items():
bar = "█" * count
print(f" Rank {rank:>9}: {bar} ({count})")
If most of your results cluster at rank 1 and 2, with a long tail of misses, your model is healthy. Investigate the misses individually — they're usually edge cases, not systemic failures. If results are spread evenly across ranks 1 through 10, that's a model problem, not a query problem.
One last check before you declare victory: run your eval set again after any reindexing or config change. Scores drift when the index changes. Keep eval_results.json in version control so you have a baseline to compare against.