The Context Cliff: What Happens When Your AI Runs Out of Memory

We ran a needle-in-a-haystack benchmark across context lengths from 32K to 128K tokens. At 48K, recall dropped from 1.00 to 0.00. Instantly.

🎧
Listen to this article 5 min
Download MP3

Every LLM has an advertised context window. Most developers assume the model degrades gracefully as it approaches that limit — a slow, manageable decline you can observe and compensate for. Our benchmarks show the opposite. The decline is not gradual. It is a cliff.

The Needle-in-a-Haystack Methodology

To measure how reliably a model retrieves information from long contexts, we used a standardized NIAH (Needle-in-a-Haystack) test. The setup is straightforward:

  • Construct a context of a specific token length filled with realistic-looking noise text
  • Embed a specific "needle" fact at a known position (start, middle, or end)
  • Ask the model to retrieve that exact fact from the context
  • Repeat across 3 positions × 2 trials per context length

We tested a 4-bit quantized 4B local model via Ollama, at context lengths of 32K, 48K, 64K, 96K, and 128K tokens.

The Results

Context Length Start Recall Middle Recall End Recall Verdict
32K 1.00 1.00 1.00 Perfect
48K 0.00 0.00 1.00 Cliff
64K+ 0.00 0.00 0.00 Total failure

At 32K, the model recalls every needle perfectly regardless of where it appears. At 48K, start and middle recall collapses to zero. By 64K, even end-of-context recall fails. This is not degradation — it is a hard boundary.

The cliff is not where you expect it. The test model advertises a 32K context window. The cliff appears right at that boundary. Once you exceed it — even by 50% — the model stops retrieving anything from the early or middle portions of context.

What This Means for Real Sessions

A typical coding session with an AI assistant accumulates context quickly. Each turn adds: the user message, the model's response, any code snippets, and tool outputs. By turn 20–30 in an active session, you are likely past 32K tokens.

At that point, the model has not just "forgotten" some context. It has lost access to everything from the early turns — the initial problem statement, the constraints you established, the architecture decisions you made together. The model continues responding, but it is working from an increasingly incomplete picture.

The failure is invisible. The model does not say "I cannot recall that." It generates a plausible response based on what it can see. That plausibility is the problem.

Why Not Just Use a Larger Context Window?

The cliff is not a function of window size alone — it is a function of the attention mechanism's effective range at a given quantization level. A 9B full-attention model shows the same cliff at the same location: 32K. The family architecture, not the parameter count, determines the boundary.

More importantly: filling a 128K context window does not help if attention is concentrated in the last 32K tokens anyway. You are paying the prefill cost for context the model cannot effectively attend to.

The Goodput Problem

We measured the performance impact of long-context inference in concrete terms. At 32K context, the model achieves:

  • 122.6 tok/s decode — the raw speed at which it generates tokens
  • 2,136 tok/s prefill — the speed at which it processes input context
  • 6.3 useful tok/s goodput — the useful output rate when you account for the time spent prefilling all that context

At 32K context, 95% of wall-clock time is spent on prefill, not generation. You are paying a 17x tax on every query just to process context the model may not reliably use.

What Pyckle Does Instead

Rather than feeding the full conversation history on every query, Pyckle maintains a semantic index of your session. When you send a query, it retrieves the 3–5 most relevant turns and injects only those — typically 169 tokens total — regardless of how long your session has been running.

The result: 99.5% token reduction vs. full-context at 32K, 18x goodput improvement, and — critically — recall that never hits the cliff because the injected context never exceeds the model's effective attention range.

The Broader Pattern

We subsequently tested 10 additional models across different architecture families. The finding holds, with one important nuance: the cliff is mechanism-specific, not universal.

Full-attention and sliding-window model families both show hard cliffs at their training distribution boundary. GQA models show graduated degradation with more headroom. MQA models with interleaved attention show no cliff at all through 128K.

The architecture determines the failure mode. We cover that in detail here →

← Back to blog