What Cosine Similarity Actually Tells You (And What It Doesn't)

A cosine similarity score of 0.82 means the angle between two vectors is about 35 degrees. That's it. It does not mean the retrieved document is relevant. It does not mean you found what you were looking for. It means two points in high-dimensional space are pointing in roughly the same direction. Whether that direction corresponds to what your user actually needs — that's a separate problem, and most embedding pipelines don't solve it.

What the metric actually measures

Cosine similarity computes the angle between two vectors. Zero degrees is perfect alignment, 90 degrees is orthogonal, 180 is opposite. The formula normalizes for magnitude, so a 10-word document and a 10,000-word document can return identical similarity scores if their "direction" matches.

That's useful for some tasks. It's the wrong frame for retrieval.

Retrieval is a ranking problem, not a geometry problem. You're not asking "are these two things similar?" You're asking "is this the thing the user needs?" Those questions have different answers. A file containing the word "authentication" will score highly against a query about authentication middleware even if the file is a changelog entry from 2019 with one passing reference. The angle between vectors doesn't capture intent, specificity, or task relevance.

The model doesn't know what you want. It knows what words tend to appear near each other. Cosine similarity faithfully reports that relationship. It doesn't report usefulness.

The mushy middle

Here's the problem that actually kills retrieval quality in practice.

Most embedding models are trained with MSE loss or contrastive loss variants that push similar things together and dissimilar things apart. The issue is compression. When you train on a large, diverse corpus, the model learns to spread representations across the space — but MSE loss specifically pulls scores toward the mean. You end up with most of your retrieval scores clustered between 0.65 and 0.85.

The correct file scores 0.78. A completely irrelevant file scores 0.74. That 0.04 gap is statistically meaningless given typical retrieval noise. Your system can't tell them apart, so it returns both.

This isn't a threshold-tuning problem. Moving your cutoff from 0.70 to 0.76 doesn't fix it — you'd need to move it to 0.77 to drop the bad result, and at 0.77 you'd start dropping good ones too. The scores are too compressed to discriminate. No static cutoff survives this.

The mushy middle is a training problem, not a deployment problem. You can't tune your way out of it at inference time.

Static top_k is a trap

The standard retrieval setup picks a fixed number of results — top_k=5, top_k=10, top_k=20 — and injects all of them into the context. The assumption is that more context is better, or at least not harmful.

It's harmful.

If only one file in your codebase is actually relevant to the query, top_k=20 means you're passing 19 noise files to the model. That's not safety margin. That's intentional pollution. The model now has to reason through irrelevant code, potentially anchors on wrong patterns, and burns tokens doing it.

Here's a practical test. Run your retrieval pipeline with top_k=5. Note the AI output. Run it again with top_k=20. If the output quality doesn't meaningfully change, your embeddings aren't discriminating. The extra 15 results contained nothing useful — or the model learned to ignore retrieval context entirely, which is worse.

Good retrieval is selective retrieval. If your embedding model is calibrated correctly, top_k=3 should outperform top_k=20 on most queries, because you're returning only the chunks that actually matter. If top_k=20 consistently helps, the model doesn't know which 3 are right.

Margin Ranking Loss and why it matters

The fix at training time is Margin Ranking Loss. Instead of pushing similar things together globally, you train the model so that the correct result scores significantly higher than hard negatives — documents that are superficially similar but wrong.

The loss function looks like this:

def margin_ranking_loss(pos_score, neg_score, margin=0.3):
    # Correct result must outscore the hardest wrong result by at least `margin`
    return torch.clamp(margin - (pos_score - neg_score), min=0).mean()

The margin parameter is the lever. Set it too low and the model learns to barely separate correct from incorrect — you're back to the mushy middle. Set it high (0.3+) and the model is forced to place correct results in a genuinely distinct region of score space.

Hard negatives are the other key ingredient. Easy negatives are documents that are obviously wrong — a Python file about sorting algorithms when you queried for database connection pooling. The model figures those out quickly and stops learning. Hard negatives are documents that look relevant but aren't: a file that uses the same class names, imports the same libraries, but doesn't implement the behavior you need. Training on hard negatives forces the model to develop finer-grained discrimination.

The result isn't just better recall. The score distribution opens up. You get clear separation between relevant and irrelevant results, which means threshold-based filtering actually works.

Adaptive threshold calibration

Once your model is producing discriminative scores, you can replace static top_k with threshold-based retrieval. Return everything above a learned cutoff; return nothing below it.

The calibration step fits a threshold on held-out examples: you look at the score distribution for known-relevant and known-irrelevant results and find where they diverge. On well-trained models, this is a clean break — relevant documents cluster above 0.85, irrelevant ones cluster below 0.70, with a narrow gap in between. The threshold sits in that gap.

The practical setup:

def adaptive_retrieve(query_embedding, corpus, threshold=None, top_k=None):
    scores = cosine_similarity(query_embedding, corpus.embeddings)

    if threshold is not None:
        # Return only results above the calibrated cutoff
        mask = scores >= threshold
        return corpus[mask], scores[mask]

    # Fallback to top_k if threshold not calibrated yet
    idx = scores.argsort()[-top_k:][::-1]
    return corpus[idx], scores[idx]

The threshold isn't static across queries — it should adapt to query type. A narrow, specific query ("find the rate limiter middleware") has a tighter threshold than a broad one ("show me authentication code"). Some retrieval systems learn this per-query-type using a lightweight classifier. That's worth the engineering effort at scale.

In Pyckle's retrieval pipeline, shifting from static top_k to calibrated thresholding cut token usage by 95% on typical queries. Not 10%, not 30% — 95%. Most queries are narrow. The answer lives in 1-3 chunks. The rest is noise you were paying to process.

What a calibrated score looks like

With MSE-trained embeddings, the score distribution for a typical codebase query might look like this:

Query: "connection pool timeout handling"

Rank  Score  File
1     0.81   db/pool.py          ← what you want
2     0.78   db/connection.py    ← close, adjacent
3     0.75   tests/test_pool.py  ← tangentially relevant
4     0.74   utils/retry.py      ← probably noise
5     0.72   config/settings.py  ← definitely noise

With Margin Ranking Loss + calibration:

Query: "connection pool timeout handling"

Rank  Score  File
1     0.94   db/pool.py          ← what you want
2     0.71   db/connection.py    ← below threshold, not returned
3     0.58   tests/test_pool.py  ← not returned
4     0.41   utils/retry.py      ← not returned

The model now knows it found what it was looking for. That confidence is structural, not coincidental — it's the margin doing its job.

Cosine similarity didn't change. What changed is that the embedding space was trained to make cosine similarity meaningful for this specific task. The metric was always measuring angle. Now the angles actually correspond to something.

If you're not doing this calibration work, you're running on the assumption that your embedding model's training distribution matches your retrieval task. It probably doesn't. The gap between those two distributions is where your retrieval quality went.