The Architecture Determines the Cliff: A 7-Model NIAH Study

We ran the same needle-in-a-haystack benchmark across 7 models from 4 attention mechanism families. The context cliff is not universal. It is architecture-specific.

🎧
Listen to this article 6 min
Download MP3

After establishing the 32K context cliff in a full-attention 4B model (previous post), the natural question was: is this specific to that model family, or is it universal? We pulled 5 additional models spanning three architecturally distinct families and ran the same test.

The short answer: the cliff exists in some families and is absent in others. The determining factor is the attention mechanism.

The Four Families

Full Causal Attention

Every token attends to every previous token. The KV cache grows linearly with sequence length. At some point, the effective attention range cannot cover the full context — and recall drops hard. Both tested models in this family show the cliff at 32K, confirming the boundary is family-level, not parameter-count-level.

GQA (Grouped-Query Attention)

GQA reduces the number of KV heads from 32 (full attention) to 8 — sharing key/value projections across groups of query heads. This shrinks the KV cache significantly. Our prediction: a later cliff. The result: graduated degradation rather than a hard drop, with the cliff appearing at 64–96K depending on model size.

Sliding Window Attention

Each layer attends only to the last 4,096 tokens. On paper this should produce gradual degradation as content scrolls out of the sliding window. In practice, The sliding-window model shows the same hard cliff as full-attention models, at the same location. Likely reason: the model was not trained to handle contexts beyond its practical limit, so out-of-distribution inputs produce the same binary failure mode.

MQA + Interleaved Global/Local Attention

Multi-Query Attention uses a single shared KV head — the smallest possible KV cache per token. This family also interleaves local sliding-window attention with global attention layers, distributing the context load. This combination was designed and trained for 128K. The result is exactly what the architecture predicts.

The Results

Model Attention 32K 64K 96K 128K Pattern
4B full-attention Full causal 1.00 0.00 0.00 0.00 Hard cliff
9B full-attention Full causal 1.00 0.00 0.00 0.00 Hard cliff
7B sliding-window Sliding window 1.00 0.00 0.00 0.00 Hard cliff
8B GQA GQA 1.00 0.67 0.00 0.00 Graduated
3B GQA GQA 1.00 1.00 1.00 0.00 Later cliff
4B MQA/interleaved MQA / interleaved 1.00 1.00 1.00 1.00 No cliff

The MQA model: 30/30 HITs across all positions and all context lengths through 128K. A 4B model with 4-bit quantization on consumer hardware recalled every needle — whether placed at the start, middle, or end of a 128K context. This is the most significant single data point in the study.

Why This Matters

The common assumption is that small models cannot handle long contexts. The MQA result disproves that. The constraint is not parameter count. It is the KV cache architecture and what the model was trained to handle.

The MQA model's KV cache at 128K context is approximately 3.4GB — well within the VRAM budget of a 16GB card running a 4B 4-bit quantized model (~3.3GB for weights). The architecture was designed for 128K, and it delivers exactly that in practice.

The Sliding-Window Surprise

We expected the sliding-window model to produce gradual degradation — content would simply scroll out of the 4,096-token window per layer. Instead, we saw a hard cliff identical to the full-attention models.

The most likely explanation: The model was not trained to handle contexts much beyond its practical training distribution. When you push past that boundary, the model fails the same way a full-attention model fails — not because sliding window is equivalent to full attention, but because the out-of-distribution behavior converges to the same failure mode.

Implication for model selection: If you need reliable recall beyond 32K, architectural family matters more than parameter count. A 4B MQA model outperforms a 7B sliding-window model at 64K+ context on recall tasks. Architecture first, size second.

Pyckle's Role by Architecture

These results change how we think about where Pyckle provides value:

  • Full-attention / Sliding-window models — Cliff protection is critical. Sessions approaching 32K must have injection in place or recall fails completely.
  • GQA models — Graduated cliff gives more headroom, but protection is still needed for sessions running past 64–96K. Pyckle's efficiency benefit (token reduction) also applies throughout.
  • MQA models — No cliff to protect against through 128K. Pyckle's value here is purely efficiency: token reduction and faster prefill, not recall rescue.

The token reduction benefit is architecture-agnostic. Regardless of whether your model cliffs at 32K or 128K, injecting 169 tokens instead of 32K is a 190x reduction in prefill cost on every query.

← Back to blog