Everyone talks about which embedding model to use. Almost nobody talks about what happens after you pick one. That's where the real cost lives.
The model is a download. The infrastructure is a job.
The Part Nobody Puts in the Blog Post
You've benchmarked your options. MTEB scores, recall@10, latency numbers on your hardware. You picked a model. Good. Now: where does it run?
If you're going to production, you need a GPU instance that doesn't flake out under load. You need to decide between ONNX Runtime and PyTorch — ONNX is faster for inference, PyTorch is easier to update, and neither choice is obviously right once you're dealing with 10+ model variants and a team that didn't make the original call. You need batching logic that doesn't crater throughput when requests come in bursts. You need a versioning strategy so you don't silently break your index when you update the model.
None of that is hard. It's just time. And once you build it, you own it. The engineer who built it leaves. The P40 instance that "worked fine" turns out to be a single point of failure. The model update you meant to ship six months ago keeps getting deprioritized because something else is always on fire.
That's the build-it-yourself trap. You spend two to three weeks on embedding infrastructure that isn't your product, and then you spend the next two years maintaining it.
What the Cost Comparison Actually Looks Like
The headline numbers look fine for DIY. OpenAI's text-embedding-3-small runs around $0.02 per million tokens. If you're doing high volume, self-hosting looks like it pencils out — a P40 GPU on Lambda runs roughly $3–5/hour, and a good embedding model can push tens of millions of tokens per hour at full batch throughput.
But that math assumes full utilization, zero ops overhead, and no engineer time. None of those assumptions hold.
Most teams run embedding workloads intermittently — code indexing runs, batch re-embeddings, search queries. You're paying for GPU time whether it's busy or not. And the ops overhead is real: monitoring inference latency, debugging OOM errors, managing model cold starts, handling version upgrades without breaking the index downstream.
Then there's calibration drift. Embedding models don't degrade gracefully. When you update a model, your existing index is now semantically misaligned — vectors from model v1 aren't comparable to vectors from model v2. You need to re-embed everything. If your corpus is large, that's a job. If you didn't build re-indexing automation, it's a manual job. If you have downstream consumers that cached embeddings, you have a data consistency problem.
None of this shows up in the per-token cost calculation.
The OpenAI-Compatible API Is the Whole Point
The barrier to switching embedding providers has historically been the migration cost. You're not just swapping a model — you're rewriting API calls, changing authentication, updating your batching logic, and hoping the new response format is close enough to your existing parsing code.
An OpenAI-compatible drop-in endpoint eliminates that entirely. You change one line:
from openai import OpenAI
client = OpenAI(
api_key="your-pyckle-api-key",
base_url="https://api.pyckle.dev/v1"
)
response = client.embeddings.create(
model="bge-large-en-v1.5",
input=["semantic search over developer documentation"]
)
That's it. Your existing LangChain chains, your LlamaIndex pipelines, your custom RAG stack — they all work unchanged. The model name changes. Nothing else does.
This matters more than it sounds. Teams don't avoid switching providers because they love their current provider. They avoid it because migration is risk with no upside. A drop-in API converts a migration project into a config change.
Model Selection Is Actually Complicated
Most embedding services give you one or two models and call it done. The problem is that "embeddings" covers wildly different use cases, and the model that's best for one is mediocre for another.
Code search is not the same as document retrieval. Multilingual content has different requirements than English-only. Long-context documents — API references, technical specs, lengthy READMEs — need 8K context windows, not 512 tokens. A model trained on web text will systematically underperform on code, because the token distribution is completely different.
What actually matters in practice:
- Code search — you want a model trained on code. Specialized models outperform general-purpose ones by a meaningful margin on CodeSearchNet benchmarks. This is what PyckLM is built for, and it's reserved for Pyckle platform users.
- RAG over docs — bge-large-en-v1.5 is the workhorse here. Consistently strong across BEIR benchmarks, well-calibrated, wide adoption means plenty of production evidence.
- Multilingual — bge-m3 handles 100+ languages with competitive performance. If your users write in more than one language, this is the tier you need.
- Semantic search with nuance — e5-mistral-7b is a 7B decoder-based model. Slower, more expensive, but it catches semantic relationships that smaller models miss. Worth it when precision matters more than throughput.
- Long documents — 8K context window models handle full files, long-form documentation, and extended code contexts without chunking hacks.
Having the right model for the job isn't a luxury. Embedding the wrong content with a general-purpose model and then wondering why retrieval quality is off is a common failure mode. The answer usually isn't "tune your retrieval pipeline" — it's "use a model that was trained for this."
What You're Actually Buying
This isn't a pitch for outsourcing everything. Some teams have the GPU capacity, the ML infra experience, and the maintenance bandwidth to run this well in-house. Those teams exist.
Most teams don't. And for most teams, the decision to self-host embeddings is a decision to permanently staff a small piece of ML infrastructure that has nothing to do with their product. Every hour spent on that is an hour not spent on whatever the product actually does.
What an embeddings service is selling isn't access to a model. You can download most of the relevant models from HuggingFace right now, for free. What it's selling is:
Time back. No provisioning, no ops, no version management. You call an API and get vectors.
Reliability you didn't have to build. SLAs, uptime monitoring, and inference infrastructure maintained by people whose entire job is inference infrastructure.
Calibration that's already done. Models tuned for developer workflows, tested against real code search benchmarks, updated without breaking your downstream index.
The math on build vs. buy almost always favors buy for non-core infrastructure — not because cloud is cheap, but because the fully-loaded cost of ownership for self-hosted ML infra is consistently underestimated at planning time and consistently painful at maintenance time.
Pick the model that fits your use case. Let someone else keep the GPU warm.