Our local model study found a hard context cliff at 32K in several model families. The natural question was: do commercial API models show the same failure mode? We built API-adapted NIAH scanners and tested five commercial models using the identical needle/haystack methodology.
Commercial Results Summary
| Model | Provider | Tested Through | HITs | Verdict |
|---|---|---|---|---|
| Claude Haiku | Anthropic | 128K | 18/18 | Clean |
| Claude Sonnet | Anthropic | 192K | 21/21 | Clean |
| GPT-4o-mini | OpenAI | 128K | 24/24 | Clean |
| GPT-4o | OpenAI | 28K | 9/9 | Tier 1 TPM limit |
| Gemini 2.5 Flash | 512K | 30/30 | Clean |
Zero recall cliffs across all five commercial models at any practically relevant context length. Where tests hit limits, those were API rate limits, not model recall failures. Every model that was fully tested returned a 1.00 recall rate throughout its practical context window.
The Gemini 2.5 Flash Result
The most striking result: Gemini 2.5 Flash showed perfect recall — 30/30 HITs — across positions (start, middle, end) and lengths (32K, 64K, 128K, 256K, 512K). This is a reasoning model tested up to the free-tier daily quota limit. The 1M token scan was not possible within the free-tier daily quota, but no degradation was observed at any tested length.
The Gemini result is consistent with what we found locally for MQA models — that architecture is purpose-built for large context, and the deployed models reflect that.
GPT-4o: An Infrastructure Limit, Not a Model Limit
GPT-4o's Tier 1 rate limit is 30,000 TPM (tokens per minute). A single 32K NIAH call requests 32,042 tokens — which exceeds the entire per-minute budget. Every 32K+ call returned a 429 rate limit error regardless of inter-call delay.
We confirmed clean recall through 28K context (9/9 HITs) and accepted the Tier 1 ceiling as an infrastructure constraint rather than a model capability issue. GPT-4o-mini covers the story at the full 128K range, and GPT-4o performance at lower context lengths extrapolates from the 28K clean result.
The GPT-4o wall is not a recall cliff — it is a rate limit artifact. Tier 2+ access would allow full testing. We report what we measured: clean through 28K, blocked by rate limits above that.
Why No Cliffs?
Commercial API models differ from local quantized models in several relevant ways:
- Full precision inference — no quantization artifacts compressing the KV cache. 4-bit quantization, which we use locally, reduces weights to 4-bit precision and may affect long-range attention.
- Infrastructure-level optimization — production serving systems apply techniques (Flash Attention, KV cache management, speculative decoding) that improve effective context utilization.
- Training distribution — commercial models are explicitly trained and fine-tuned for their advertised context windows, with context length as a first-class product requirement.
The local model cliffs we documented are real — they are observed on real hardware, running real workloads, at the quantization levels developers actually use. Commercial APIs trade cost and privacy for extended effective context range.
What This Means for Architecture Decisions
If you use commercial APIs
The recall cliff is unlikely to be your primary concern. The value of Pyckle is predominantly cost reduction: 190x fewer tokens means 190x lower API spend per query. At $15/M tokens, a 32K query costs $0.48. With injection, the same query costs $0.0025. That difference compounds across millions of developer queries per month.
If you use local models
The cliff is an active risk for full-attention and sliding-window model families running past 32K context. MQA models eliminate the cliff for users who need long context locally — but the efficiency argument still holds: even at full precision, 169 tokens is a faster and cheaper prefill than 32K.
If you use both
A consistent context injection layer across local and cloud backends means your session history is portable. The same semantic index works regardless of which model backend handles a given query.
← Back to blog