Most teams switching embedding providers expect the hard part to be the API integration. It isn't. The endpoint swap takes about ten minutes if your new provider is API-compatible. What actually burns teams is the part nobody writes about: your similarity thresholds are now wrong, and they'll stay wrong silently until your search quality degrades and someone notices.
The Endpoint Swap Is Not the Migration
If you're moving to a standard embedding API with a compatible interface — same request format, same response shape — the mechanical work is trivial. One environment variable.
# Before
EMBEDDING_API_BASE=https://api.openai.com/v1
EMBEDDING_MODEL=text-embedding-3-small
# After
EMBEDDING_API_BASE=https://api.pyckle.dev/v1
EMBEDDING_MODEL=pyckle-embed-v1
Same SDK. Same calling code. Your pipeline doesn't care. If you were using the same Python client for your previous provider, you're done with the code side in under an hour including testing.
The trap is thinking you're done. You're not. You have a search system built on top of embeddings, and somewhere in that system there are thresholds — cutoff scores below which results get filtered, ranked differently, or discarded entirely. Those thresholds were calibrated against your old embedding space. They are now wrong.
Why Thresholds Break Across Models
Cosine similarity scores are not universal. A 0.85 in one embedding space does not mean the same thing as a 0.85 in another. Each model has its own geometry — its own way of distributing semantic relationships across the vector space. Some models cluster tightly, producing scores in a narrow band. Others spread out. Some produce high baseline similarity even between unrelated content.
When you move from your current embedding model to a different one, your entire score distribution shifts. If your previous model produced relevant results between 0.80 and 0.95, and your new model's equivalent quality range is 0.65 to 0.85, your old threshold of 0.80 is now filtering out good results. Or the inverse: a threshold that was tight is now too permissive.
You can't compare scores across models. Full stop. It doesn't matter that the numbers look similar — they're measuring distance in different spaces. Treat your thresholds as model-specific configuration, not universal constants.
The Recalibration Process
You don't need to re-embed everything to recalibrate. You need a representative sample — 500 to 1,000 chunks from your actual data, covering the spread of content types your system handles. Pull a stratified sample, not just the first 500 records.
Re-embed that sample with your new model. Then run your existing eval queries against it. These should be queries where you already know what a good result looks like — queries you've used before to sanity-check your search, or a small set you've labeled by hand.
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
# Run eval queries against new embeddings
results = []
for query, expected_ids in eval_set.items():
query_vec = new_model.embed(query)
scores = cosine_similarity([query_vec], chunk_vecs)[0]
# Track what threshold recovers your known-good results
for threshold in np.arange(0.5, 0.95, 0.05):
retrieved_ids = [chunk_ids[i] for i, s in enumerate(scores) if s >= threshold]
recall = len(set(retrieved_ids) & set(expected_ids)) / len(expected_ids)
precision = len(set(retrieved_ids) & set(expected_ids)) / max(len(retrieved_ids), 1)
results.append((threshold, recall, precision))
Plot recall and precision across the threshold range. The right cutoff is wherever the tradeoff is acceptable for your use case. If missing relevant results is expensive, bias toward recall. If noise is expensive, bias toward precision. This is a product decision, not a math problem — the math just shows you the curve.
Don't assume that higher similarity scores from the new model mean better results. Some models score higher across the board. Higher numbers aren't signal. Relative ordering and your recall/precision curve are.
Rolling Out the Full Re-Embed
Once you have a calibrated threshold on the sample, re-embed your full dataset in batches. Don't try to do this in one shot if you have meaningful volume — batch sizes of 500 to 1,000 chunks let you monitor and stop cleanly if something goes wrong.
Run your old and new indices in parallel during rollout if you can afford the storage. Shadow traffic against both, compare result sets on a sample of live queries. You're looking for cases where the new model surfaces clearly worse results — content that's semantically adjacent but wrong for the query.
# Batch re-embed with checkpointing
BATCH_SIZE=500
CHECKPOINT_FILE="embed_progress.json"
python3 re_embed.py \
--source chunks.jsonl \
--model pyckle-embed-v1 \
--batch-size $BATCH_SIZE \
--checkpoint $CHECKPOINT_FILE \
--output new_embeddings.jsonl
Monitor quality during rollout, not just after. If you're seeing a degraded result set on a class of queries — code search behaving differently than doc search, for instance — you may need segment-specific threshold tuning, not a single global cutoff.
The checkpoint matters. Re-embedding 100,000 chunks and failing at 80,000 without a resume point is an avoidable problem.
The Inventory Step Nobody Does
Before any of this: audit where your thresholds actually live. This sounds obvious. In practice, most teams find thresholds scattered across three different places when they go looking — a config file, a hardcoded constant in a retrieval function, and a database column in a search settings table that nobody's touched in a year.
The thresholds that burn you are the ones you forgot you set. A filter that quietly drops results below 0.75 somewhere in a pipeline step, set by someone who tuned it by feel eighteen months ago and never wrote it down. That filter is now miscalibrated and you won't know it because your system still runs — it just surfaces worse results.
Map every place in your codebase where a similarity score is compared against a fixed value. Grep for similarity, score, threshold, cutoff. Make a list. Every one of those values needs to be recalibrated against the new model before you cut over production traffic.
# Find hardcoded thresholds in your codebase
grep -rn "similarity\|\.score\|threshold\|cutoff" src/ \
--include="*.py" \
| grep -E "[0-9]\.[0-9]{2}" \
| grep -v "test_"
If your threshold was arbitrary — you picked 0.80 because it felt right — the recalibration process is also a chance to do it properly. Use the eval set. Build the curve. Make a real decision about where precision and recall trade off for your users.
The migration itself is not the hard part. The hard part is the gap between "my pipeline runs" and "my search is as good as it was." Those are different bars, and the similarity threshold is almost always what sits in between them. Get the inventory done first, recalibrate on a real eval set, and don't trust score magnitudes across model boundaries. The numbers lie. The recall curve doesn't.