190x Fewer Tokens: The Math Behind Pyckle's Context Compression

At 32K context, Pyckle injects 169 tokens instead of 32,133. Here is what that ratio means for throughput, cost, and recall.

🎧
Listen to this article 6 min
Download MP3

Token reduction numbers like "99.5%" are easy to cite and hard to believe without the underlying math. This post walks through the measured numbers and explains why the improvement is larger than it looks.

Token reduction
99.5%
169 tokens vs 32,133
Compression ratio
190x
Measured at 32K context
Goodput improvement
18x
113.6 vs 6.3 useful tok/s
QPM improvement
17x
68 QPM vs 4 QPM

Why Prefill Is the Bottleneck

LLM inference has two distinct cost centers: prefill (processing the input tokens) and decode (generating the output tokens). For most developer queries, the output is short — a few sentences or a code snippet. The input is what grows unboundedly as the session continues.

On a 4-bit quantized 4B local model, we measured:

OperationSpeedNotes
Prefill2,136 tok/sProcessing input context
Decode122.6 tok/sGenerating output tokens
Prefill/Decode ratio17.4xPrefill is 17x faster per token

Prefill is fast per-token, but at 32K tokens, the total prefill time is: 32,000 / 2,136 ≈ 15 seconds just to process the context before generation starts. Add ~3 seconds of generation for a typical response, and each query takes roughly 18 seconds wall-clock time.

With a 169-token injection: 169 / 2,136 ≈ 0.08 seconds for prefill. The same query now takes under 1 second total.

Goodput: The Metric That Actually Matters

Decode speed (tok/s) is the number developers see in Ollama's output. But it only measures generation speed — it excludes the time spent waiting for prefill to complete. The metric that matters is goodput: useful output tokens produced per second of wall-clock time.

At 32K context with full history injection:

  • ~15s prefill + ~3s generation = 18s per query
  • ~113 useful output tokens / 18s = 6.3 useful tok/s
  • 4 QPM ceiling before the next query is ready

With Pyckle's 169-token injection:

  • ~0.08s prefill + ~0.83s generation = ~1s per query
  • ~113 useful output tokens / 1s = 113.6 useful tok/s
  • 68 QPM ceiling

The 18x goodput improvement comes entirely from reducing prefill time. Decode speed does not change. The model is the same. The only variable is how many tokens it needs to process before it can start generating.

Why Token Reduction Compounds at Scale

The improvement scales with context length. At 32K, the injection is 169 tokens (a fixed cost, not proportional to session length). As sessions grow longer, the baseline cost grows — but Pyckle's cost does not.

At 64K context, a naive approach injects 64K tokens. Pyckle still injects ~169. The ratio becomes 379x. At 128K, it becomes 757x. The efficiency gap grows the longer your sessions run.

For commercial API users, this maps directly to cost. At $15/M input tokens (Claude Opus tier), a 32K query costs $0.48. With Pyckle's injection, the same query costs $0.0025 — a 190x reduction in API spend per query, with recall held at 1.00.

Does Recall Hold?

The concern with aggressive compression is information loss. If the model only sees 169 tokens of context instead of 32K, does it retrieve facts correctly?

Across all tested sessions:

  • Single-decision recall: 1.00 at all positions and lengths up to 32K (3 trials each)
  • Multi-decision recall: 1.00 (5 simultaneous decisions, 500 turns, tested in v2)
  • Post-distillation recall: 1.00 even after applying rule-based compression to verbose sessions (v4)

The model retrieves the right information because the semantic search selects the turns that actually contain that information — not a random 169-token slice, but the 3–5 most relevant turns from the entire session history.

The Distillation Layer (v4)

v4 of the whitepaper adds a rule-based distillation layer on top of the injection pipeline. Verbose reasoning turns (debugging narratives, analysis chains, 80–300 tokens) are compressed to their essential content using a no-LLM regex pipeline.

Results on 5 verbose reasoning turns: 468 tokens → 193 tokens — a 59% additional reduction. The 169-token warm injection baseline grows to ~225 tokens in verbose sessions without distillation; with distillation, that ceiling holds.

← Back to blog