```yaml
---
title: "Prompt Compression in Production"
subtitle: "Reducing Token Count Without Losing What the Model Needs"
author: "David Kelly Price"
version: "1.0"
date: 2026-04-20
status: draft
type: ebook
target_audience: "Engineers running LLM systems at scale where context size directly impacts latency and cost"
estimated_pages: 65
chapters:
  - "1. Why Compression Is Necessary"
  - "2. What Can Be Compressed and What Cannot"
  - "3. Extractive Compression: Summary and Selection"
  - "4. Abstractive Compression: Rewriting for Density"
  - "5. Learned Compression: Soft Tokens and Embeddings"
  - "6. Compression Evaluation: Faithfulness and Task Performance"
  - "7. Integration Patterns: Where Compression Lives in the Pipeline"
  - "8. Production Considerations"
tags:
  - pyckle
  - ebook
  - prompt-compression
  - token-optimization
  - llm
  - inference
  - production
---
```

<!--
  DESIGN / LAYOUT NOTES
  =====================
  Font: Body — Charter or Georgia 11pt. Code — JetBrains Mono 9.5pt.
  Margins: 1.25in all sides. Line height: 1.6.
  Chapter headings: H1, bold, 22pt, top padding 2rem.
  Section headings: H2, semibold, 16pt.
  Sub-section: H3, semibold italic, 13pt.
  Code blocks: light gray background (#F4F4F4), 1px border, border-radius 4px.
  Callout boxes (NOTE / WARNING): left border 3px solid #2A6EBB, background #EEF4FB.
  Page numbers: footer center.
  Chapter breaks: new page.
  Table of contents: dotted leader tabs to page numbers.
  Cover: dark background (#0D1117), white title text, accent line in #2A6EBB.
-->

# Prompt Compression in Production
## Reducing Token Count Without Losing What the Model Needs

**By David Kelly Price**
Version 1.0 — April 2026

---

## Table of Contents

1. Why Compression Is Necessary
2. What Can Be Compressed and What Cannot
3. Extractive Compression: Summary and Selection
4. Abstractive Compression: Rewriting for Density
5. Learned Compression: Soft Tokens and Embeddings
6. Compression Evaluation: Faithfulness and Task Performance
7. Integration Patterns: Where Compression Lives in the Pipeline
8. Production Considerations

---

## About This Guide

Every token a model receives costs something — time, money, or both. At small scale, that cost is invisible. At scale, it defines whether a product is viable. This guide is about the engineering discipline of prompt compression: the set of techniques that reduce what you send to a model without degrading what comes back. It covers the theory necessary to understand why compression works, the practical methods in active use today, and the tradeoffs that separate approaches that hold up in production from those that only work in demos.

The intended reader is already building with LLMs. They understand context windows, they've thought about retrieval, and they've probably hit a wall — latency, cost, or context overflow — that made them search for something beyond "just use a bigger model." This guide does not spend time on prompt engineering basics or model selection. It assumes you're past that and looking for what comes next. The methods described range from simple heuristics you can ship today to learned compression techniques that require training infrastructure, and the guide is honest about the cost-benefit profile of each.

After reading, you'll be able to audit a production pipeline for where tokens are being wasted, select the right compression approach for a given constraint, implement the core techniques without reaching for a library you don't understand, and measure whether compression is actually working — not just on token count, but on task performance. That last part matters more than most treatments of this subject acknowledge. Compression that saves tokens while degrading outputs isn't optimization. This guide treats faithfulness as a first-class concern throughout.

---


---

# Chapter 1: Why Compression Is Necessary

## Chapter Overview

Context windows have grown from 4K tokens to 200K+ in under two years. That sounds like problem solved—until you run a production system where every request carries 50K tokens of documentation, conversation history, and retrieved context. The bills mount. Response times drift from 2 seconds to 8. Users complain about lag. You're paying for tokens that contribute nothing to the output. Compression isn't an optimization exercise. It's the difference between a system that scales and one that bleeds money while frustrating users.

### The Cost Curve Nobody Warns You About

Start with the numbers. A single GPT-4 API call with 50,000 input tokens costs roughly $1.50. If your application serves 10,000 users making 5 requests per day, that's $750,000 monthly—just on input tokens. Double the context size because you're including full documentation dumps, and you've crossed $1.5M before accounting for output tokens, infrastructure, or anything else.

The cost function is linear with token count, but context bloat grows exponentially with feature additions. You add a help desk integration—now every prompt carries ticket history. You enable multi-turn conversations—history accumulates. You implement RAG—retrieved chunks stack up. Each feature multiplies context size by a factor nobody estimated during planning.

Most teams discover this curve after launch. The MVP worked fine with 5K token prompts. Production traffic revealed that real-world context averages 45K tokens because users don't behave like test cases. Cutting features isn't an option. Raising prices kills growth. The only lever left is compression.

**Warning**: Provisioning budgets based on average token counts will burn you. The 95th percentile matters more than the mean. One verbose user with 200K tokens of conversation history costs as much as 100 typical users.

### Latency Isn't Just About Model Speed

Response time decomposition tells the real story. Model inference might take 800ms, but token processing overhead adds another 1,200ms when you're feeding 60K tokens into the context window. Users perceive the total: 2 seconds from submit to first token.

Time-to-first-token (TTFB) scales roughly linearly with input token count for most providers. Anthropic's benchmarks show that going from 10K to 100K input tokens can add 3-5 seconds to TTFB, depending on model and load. That delay happens before the model generates a single useful character.

This isn't theoretical. Streaming responses mitigate perceived latency only after the first token arrives. If users wait 6 seconds watching a spinner before the stream starts, streaming bought you nothing. The interaction feels broken. Compression that cuts 60K tokens to 15K tokens can recover 3+ seconds of wait time—the difference between responsive and sluggish.

```python
# Measuring the impact of context size on latency
import time
import anthropic

client = anthropic.Anthropic()

def measure_ttfb(prompt: str, context_size: int) -> float:
    # Pad prompt to target context size
    padding = "Background information. " * (context_size // 3)
    full_prompt = padding + prompt

    start = time.time()
    with client.messages.stream(
        model="claude-3-5-sonnet-20241022",
        max_tokens=100,
        messages=[{"role": "user", "content": full_prompt}]
    ) as stream:
        for _ in stream.text_stream:
            first_token_time = time.time() - start
            return first_token_time

# Results from real measurement:
# 5K tokens: 0.8s TTFB
# 25K tokens: 2.1s TTFB
# 50K tokens: 4.3s TTFB
# 100K tokens: 7.8s TTFB
```

### Context Dilution Degrades Output Quality

More context doesn't always help. It often hurts. The "lost in the middle" problem is well-documented: models pay less attention to information buried in the middle of long contexts. If the critical instruction sits at token 35,000 in a 70,000-token prompt, the model's attention to it degrades compared to the same instruction at token 1,000 in a 5,000-token prompt.

This isn't a theoretical concern from academic papers. Production logs reveal it constantly. A support bot that pulls 50 KB of documentation performs worse than one that pulls 5 KB of the right documentation. The model gets distracted by irrelevant details, hedges its answers, or synthesizes information from the wrong section because everything looks equally important.

Compression forces prioritization. You can't include everything, so you include what matters. The act of deciding what to keep and what to discard improves output quality as a side effect. A 10K-token compressed prompt that contains only relevant context outperforms a 60K-token dump that includes everything tangentially related.

**Key Insight**: The optimal context size for quality is often far smaller than the maximum window size. Compression isn't just about cost—it's about signal-to-noise ratio.

### Provider Limits Still Bite

Yes, context windows hit 200K tokens. That's the maximum, not the practical limit. Actual constraints are tighter:

- **Rate limits**: Many providers cap input tokens per minute, not just requests per minute. A few 100K-token requests exhaust your quota as fast as hundreds of small requests.
- **Timeouts**: Large context requests are more likely to hit timeout limits, especially under load.
- **Cost tiers**: Some pricing structures penalize extremely large contexts with multiplicative factors beyond linear scaling.
- **Regional availability**: Extended context windows often aren't available in all regions or deployment types.

Even if you're willing to pay, you can't always use the full window. And when you can, the cost and latency penalties often make it a bad trade. Compression keeps you under practical thresholds where the system behaves predictably.

### Efficiency as a Competitive Moat

The team that solves compression first ships faster features. When every new capability doesn't balloon context size, you're not constantly firefighting cost overruns or latency regressions. You can add conversation memory, file upload processing, or multi-agent coordination without the system collapsing under its own weight.

Competitors who ignore compression hit a scaling wall. They optimize prompts, switch providers, cache aggressively—all important, but none of it addresses the root issue. If your baseline context is 50% smaller than theirs, you're operating with better unit economics and faster response times before any other optimization. That advantage compounds.

Compression isn't a one-time fix. It's a discipline. Teams that build compression into their workflow from day one maintain lean, fast systems as they scale. Teams that treat it as an afterthought spend months in remediation mode, rewriting prompts and rearchitecting retrieval pipelines under pressure.

## Key Takeaways

- Cost scales linearly with tokens, but real-world context grows exponentially with features—compression is the only sustainable answer.
- Time-to-first-token increases with input size; cutting context from 60K to 15K can recover multiple seconds of perceived latency.
- Smaller, focused context often produces better outputs than exhaustive context dumps due to reduced noise and attention dilution.
- Provider limits on rate, timeout, and regional availability make maximum context windows less accessible than marketing materials suggest.
- Building compression practices early creates compounding advantages in both cost and iteration speed.

## Try This

Measure your current context usage. Pick three representative requests from your production logs (or realistic examples if you're pre-launch). Count the input tokens. Break them down by category: instructions, examples, retrieved documents, conversation history, other. Calculate what each category costs per 1,000 requests at your provider's pricing. Identify which category contributes the most tokens. That's your compression target. Don't optimize yet—just measure and understand where the bloat lives.


---

## Chapter 2: What Can Be Compressed and What Cannot

### Chapter Overview

Not all tokens are equal. Some carry load-bearing information — remove them and the model breaks. Others are scaffolding: useful at authoring time, but dead weight at inference time. The core skill of prompt compression is learning to tell the difference before you ship to production. This chapter builds that intuition by working through the anatomy of a typical LLM input, identifying which components compress well, which compress poorly, and which should never be touched.

---

### The Anatomy of a Production Prompt

A production prompt is rarely just a user question. By the time a request hits the model, it's usually an assembled stack: a system prompt, retrieved context, conversation history, tool schemas, formatting instructions, and finally the user's actual input. Each layer has different characteristics and different compression tolerances.

Start by pulling apart a real prompt. Here's a simplified but representative example from a customer support system:

```
[SYSTEM]
You are a helpful support assistant for Acme Corp. You should be polite, concise,
and accurate. If you don't know the answer, say so. Always follow our tone guidelines:
professional but approachable. Do not speculate about pricing. Refer billing questions
to the billing team. Never make promises about delivery timelines. ...

[CONTEXT - retrieved from knowledge base]
Article 47: Return Policy
Acme Corp accepts returns within 30 days of purchase. Items must be in original
packaging. Electronics are non-refundable after 14 days. To initiate a return,
customers should visit acme.com/returns and follow the prompts. Returns initiated
after 30 days require manager approval. Approval typically takes 2-3 business days...

[HISTORY]
User: Hi, I bought a laptop last week and it's not working.
Assistant: I'm sorry to hear that. Can you tell me more about the issue?
User: It won't turn on at all.
Assistant: That sounds like it could be a hardware issue. Have you tried charging it for at least an hour?
User: Yes, still nothing.

[USER]
Can I return it?
```

This is maybe 300 tokens. Real systems often run 8,000–32,000. The question is: which of these four layers has room to shrink?

---

### What Compresses Well: Retrieved Context

Retrieved context — chunks pulled from a vector store, database, or document corpus — is usually the biggest opportunity and the most abused one. A naive RAG system retrieves top-K chunks and concatenates them wholesale. The problem is that chunk boundaries rarely align with information relevance. You often get 400-token chunks where 80 tokens contain the answer and 320 tokens contain related-but-irrelevant prose.

Extractive compression works well here: identify the sentences within each chunk that are semantically close to the query and discard the rest. This is measurable. If the query is "return policy for electronics," the sentence "Electronics are non-refundable after 14 days" scores high. The sentences about manager approval and delivery timelines score low.

The same logic applies to long documents passed as context. A 10-page product specification doesn't need to arrive in full when the question is about one feature. Compress at retrieval time, not at generation time — you want a smaller input going into the model, not a smaller output coming out.

> **Key Insight**
>
> Retrieved context compresses best because it has a clear relevance target: the user's query. Every sentence in every retrieved chunk can be scored against that query. Sentences that score below a threshold are candidates for removal. This is tractable, automatable, and cheap relative to the cost of sending extra tokens to the model.

---

### What Compresses Poorly: System Prompts

System prompts are where engineers tend to overreach. The instinct makes sense — system prompts can balloon to thousands of tokens as teams layer on edge case handling, persona instructions, and policy guardrails. Compressing them feels like free savings.

The reality is messier. System prompts are behavioral specifications. They're often brittle: remove the wrong clause and you lose a behavior you needed, usually in a case you won't see in testing. A line that reads like redundant padding ("always be professional") might be load-bearing because without it, the model defaults to a tone that doesn't match your product.

That doesn't mean system prompts can't be tightened. Verbose explanations of *why* a rule exists usually compress safely. The rule itself often cannot. "Do not speculate about pricing. Refer billing questions to the billing team." is already dense — there's nothing to remove. The paragraph explaining the legal reasons for that policy can go.

The test: if you remove a clause and run evals, does task performance change? For system prompts, run this rigorously. Don't eyeball it.

> **Warning**
>
> System prompt compression without eval coverage is a liability. The model's behavior is a function of every token in that prompt. Removing tokens you assume are redundant is a hypothesis, not a fact. Test it, measure it, and keep the diff reviewable. A 200-token saving that breaks a compliance behavior is not a win.

---

### What Compresses Well: Conversation History

Conversation history is often the most compressible component in a multi-turn system, and it's chronologically structured — which makes compression strategies straightforward.

The standard approach: summarize older turns, keep recent turns verbatim. The intuition is sound. The model needs precise recall of the last few exchanges to maintain coherence. It needs the gist of earlier exchanges to maintain context. You rarely need verbatim fidelity to a conversation that happened 20 turns ago.

```python
def compress_history(turns: list[dict], keep_recent: int = 4) -> list[dict]:
    if len(turns) <= keep_recent:
        return turns

    older = turns[:-keep_recent]
    recent = turns[-keep_recent:]

    summary_prompt = f"""Summarize the following conversation history in 2-3 sentences.
Preserve key facts, decisions, and unresolved issues. Be specific, not general.

History:
{format_turns(older)}"""

    summary = call_model(summary_prompt)

    return [{"role": "system", "content": f"Earlier conversation summary: {summary}"}] + recent
```

The failure mode here is over-summarization. A summary that says "the user had a technical issue with their device" loses the specific detail that the device won't power on, which is exactly what the support agent needs. Compression at the history layer requires preserving entities, facts, and states — not just topics.

---

### What Cannot Be Compressed: Structured Inputs

Some inputs are already at minimal representation. Structured data — JSON payloads, SQL schemas, function signatures, tool definitions — tends to resist compression because every token is semantic. Remove a field name and the model may misread the schema. Truncate a tool definition and you'll get malformed calls.

```json
{
  "tool": "create_ticket",
  "parameters": {
    "title": "string",
    "priority": "low | medium | high | critical",
    "assigned_to": "user_id",
    "due_date": "ISO 8601"
  }
}
```

There's nothing to compress here. The information density is already high. The right optimization for tool schemas isn't compression — it's selection. Don't send all 30 tools when the current task only plausibly requires 3. Dynamic tool selection at routing time reduces token count without touching the schema itself.

The same applies to code. If you're passing source code as context, resist the urge to strip whitespace or remove docstrings heuristically. Whitespace is syntactically meaningful in Python. Docstrings often carry type information the model uses for reasoning. Compression strategies built for prose don't transfer cleanly to code.

---

### The Compression Budget Framework

A useful mental model: treat each component in your prompt as having a compression budget — a percentage by which it can safely shrink without measurable degradation. Rough starting estimates:

| Component | Compression Budget |
|---|---|
| Retrieved context (prose) | 40–70% |
| Conversation history (older turns) | 50–80% |
| System prompt (explanatory prose) | 10–25% |
| Tool schemas / structured data | 0–5% |
| Recent conversation turns | 0% |
| User query | 0% |

These are starting points, not guarantees. The actual numbers depend on your task, your model, and your eval suite. But the ordering is consistent: context compresses most, structured inputs compress least, and the user's query should never be touched.

Budget compression works because it forces you to make explicit decisions about where to spend your optimization effort. A system that's at 20,000 tokens shouldn't try to compress everything equally — it should find the highest-budget components and go there first.

---

### Key Takeaways

- Retrieved context is the highest-leverage compression target because every sentence can be scored against a clear relevance signal — the user's query.
- System prompts compress poorly at the rule level and better at the explanation level. Always validate changes with evals, not intuition.
- Conversation history follows a recency curve: older turns summarize well, recent turns need to stay verbatim.
- Structured inputs — tool schemas, JSON, code — have near-zero compression budget. Optimize them through selection, not compression.
- Not all token reduction is equivalent. A 30% reduction in retrieved context is a different risk profile than a 30% reduction in the system prompt. Treat compression budgets as component-specific, not prompt-wide.

---

### Try This

Pull a prompt from a system you're currently running — or build a representative one. Break it into components: system prompt, retrieved context, history, user query. Estimate the token count of each component using a tokenizer (`tiktoken` works for OpenAI-compatible models; `transformers` for others). Calculate what percentage of total tokens each component represents.

```python
import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

components = {
    "system_prompt": "...",
    "retrieved_context": "...",
    "history": "...",
    "user_query": "..."
}

total = sum(len(enc.encode(v)) for v in components.values())
for name, text in components.items():
    tokens = len(enc.encode(text))
    print(f"{name}: {tokens} tokens ({tokens/total:.0%})")
```

Most engineers who run this for the first time are surprised by how much of the budget goes to retrieved context. That surprise is the starting point. Now look at that context and ask: if you had to cut it by half, which sentences would you keep?


---

## Chapter 3: Extractive Compression: Summary and Selection

### Chapter Overview

Extractive compression keeps the original text intact — no rewriting, no paraphrasing — and instead makes decisions about what to include and what to drop. It's the oldest form of compression and still the most predictable. When you need to cut a 10,000-token document to 2,000 tokens without introducing any new artifacts or hallucination risk, extraction is where you start. This chapter covers the two primary extraction strategies — selection and summarization by retrieval — and the practical engineering decisions that determine whether they work in production or fall apart under pressure.

---

### What Extractive Actually Means

The word gets misused. Extractive compression means the output is a strict subset of the input. Every token in the compressed output existed in the original, in the original order. Nothing is rephrased. Nothing is inferred. The model either sees a chunk of text or it doesn't.

This constraint is both the strength and the limitation. The strength: no hallucination introduced by the compressor. If your source document says the API rate limit is 1,000 requests per minute, extractive compression cannot accidentally turn that into 100. What's kept is verbatim. The limitation: you're at the mercy of how information is distributed across the source. If the critical sentence is buried in a dense paragraph with surrounding noise, you either keep the whole paragraph or risk dropping the sentence.

Most production systems underestimate how often this distribution problem kills naive extraction. Relevance is rarely localized to a clean sentence boundary. Context leaks across paragraphs, and chunking decisions that looked reasonable during development create silent failures in production — the right information was technically present, but split across a chunk boundary that got discarded.

---

### Chunking Strategy Is Compression Strategy

Before you can select, you need to chunk. And chunking decisions are, functionally, compression decisions made upstream of the selection step.

Fixed-size chunking — split every N tokens with M tokens of overlap — is the default in most retrieval pipelines for good reason: it's simple, deterministic, and fast. But it optimizes for pipeline simplicity, not information density. A 512-token chunk that spans the end of one topic and the beginning of another is a worse unit of selection than a 300-token chunk that cleanly covers one complete idea.

Semantic chunking — splitting at topic boundaries rather than token boundaries — produces better selection candidates. The implementation typically involves embedding sentences, computing cosine similarity between adjacent sentences, and cutting where similarity drops below a threshold:

```python
from sentence_transformers import SentenceTransformer
import numpy as np

model = SentenceTransformer("all-MiniLM-L6-v2")

def semantic_chunks(sentences: list[str], threshold: float = 0.75) -> list[list[str]]:
    embeddings = model.encode(sentences)
    chunks, current = [], [sentences[0]]

    for i in range(1, len(sentences)):
        sim = np.dot(embeddings[i-1], embeddings[i]) / (
            np.linalg.norm(embeddings[i-1]) * np.linalg.norm(embeddings[i])
        )
        if sim < threshold:
            chunks.append(current)
            current = []
        current.append(sentences[i])

    chunks.append(current)
    return chunks
```

The threshold is tunable. Higher values produce smaller, tighter chunks. Lower values produce fewer, larger ones. What you're optimizing for matters: a customer support system retrieving specific policy details wants tight, precise chunks. A research assistant summarizing long documents can tolerate broader ones.

> **Key Insight:** The chunking step determines your ceiling. If the right information doesn't land cleanly in a retrievable unit, no selection algorithm fixes it downstream. Invest in chunking before tuning retrieval.

---

### Selection by Retrieval Score

Given a query and a corpus of chunks, the default extractive selection strategy is straightforward: embed the query, embed the chunks, rank by cosine similarity, take the top K. This is the backbone of every RAG pipeline in production today.

The engineering reality is that cosine similarity alone leaves significant performance on the table. Dense retrieval — embedding-based ranking — is excellent at semantic relevance but weak at exact match. Sparse retrieval — BM25, TF-IDF — handles exact terms and identifiers well but misses paraphrases. Hybrid search, combining both with a fusion step like Reciprocal Rank Fusion (RRF), consistently outperforms either alone.

```python
def reciprocal_rank_fusion(dense_ranks: list[str], sparse_ranks: list[str], k: int = 60) -> list[str]:
    scores = {}
    for rank, doc_id in enumerate(dense_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    for rank, doc_id in enumerate(sparse_ranks):
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    return sorted(scores, key=lambda x: scores[x], reverse=True)
```

After ranking, you still face the budget question: how many chunks fit in context, and which K maximizes answer quality per token? The naive answer is "take the top K until you hit the limit." The smarter answer accounts for redundancy. If chunks 1 and 3 are near-duplicates, including both wastes budget. Maximal Marginal Relevance (MMR) selection penalizes chunks that are too similar to already-selected chunks, trading some raw relevance for coverage diversity.

> **Warning:** Re-ranking with a cross-encoder after initial retrieval improves precision but adds latency. Measure this cost explicitly. For latency-sensitive applications, a well-tuned bi-encoder with hybrid search often closes most of the gap without the round-trip.

---

### Sentence-Level Extraction

When the unit isn't a retrieved chunk but a long document you need to compress in-place — a transcript, a legal filing, a support thread — sentence-level extraction gives you finer control.

The classic approach is TextRank: treat sentences as nodes in a graph, draw edges weighted by similarity, and run a PageRank-style algorithm to surface the most "central" sentences. Central here means connected to many other sentences by similar language — a reasonable proxy for summary-worthy content in informational text.

```python
from sklearn.metrics.pairwise import cosine_similarity
import networkx as nx

def textrank_sentences(sentences: list[str], model, top_n: int = 5) -> list[str]:
    embeddings = model.encode(sentences)
    sim_matrix = cosine_similarity(embeddings)
    np.fill_diagonal(sim_matrix, 0)

    graph = nx.from_numpy_array(sim_matrix)
    scores = nx.pagerank(graph)
    ranked = sorted(scores, key=scores.get, reverse=True)

    # Return in original document order
    top_indices = sorted(ranked[:top_n])
    return [sentences[i] for i in top_indices]
```

Note the last step: return sentences in original document order, not ranked order. Presenting sentences in the order a PageRank score assigned them produces incoherent output. Preserving original sequence is a small change that dramatically affects usability.

TextRank works well for dense, informational documents. It fails on conversational text where central sentences are often pleasantries, and on procedural text where each step is intentionally distinct from others — exactly the structure that low cross-sentence similarity characterizes as unimportant.

---

### Handling Position Bias and Structural Signals

Not all positions in a document are equal. In news articles, the first paragraph typically contains the highest-density summary. In technical documentation, section headers and the first sentence of each section carry disproportionate signal. In support tickets, the last message often resolves what the first ten couldn't.

Production extractive systems that ignore position bias leave quality on the table. The simplest correction is positional boosting: apply a weight to retrieval or extraction scores based on position in the document. Heavier weight to the first and last N sentences, lighter to the middle.

More structured documents warrant structural extraction instead: parse headers, pull the first sentence of each section, then fill remaining budget with the highest-ranked remaining content. This gives you a skeleton of the document's own organization before adding flesh.

> **Try This:** Take a long internal document — a product spec, a design doc, a support thread — and run both TextRank extraction and a naive first-N-sentences extraction on it. Compare which version you'd rather hand to an LLM. The gap tells you exactly how much structural signal pure similarity-ranking is discarding.

---

### When Extraction Is the Wrong Tool

Extraction has a ceiling. You cannot get more information density from extraction than the source document's own structure allows. If the source is repetitive, verbose, or poorly organized, extraction will compress it to a repetitive, verbose, or poorly organized subset.

The other failure mode is context dependency. A sentence that reads perfectly in its original paragraph can become ambiguous or misleading in isolation. "This approach is not recommended" extracted without its antecedent leaves the model guessing what approach.

These limitations aren't arguments against extraction — they're arguments for using it where it fits. Tight factual documents with good structure, retrieval over large corpora, reducing a known-clean corpus to its most relevant subset: extraction is fast, safe, and explainable. When the source material is messy, repetitive, or highly interdependent, the next tool set handles what extraction cannot.

---

### Key Takeaways

- Chunking decisions are upstream compression decisions — poor chunk boundaries create ceiling effects that no downstream selection algorithm can fix.
- Hybrid retrieval (dense + sparse, fused via RRF) consistently outperforms either approach alone without requiring a slower reranking step.
- Sentence-level extraction with TextRank should return results in original document order, not relevance order, to maintain coherence.
- Positional signals — first paragraphs, section headers, terminal messages — carry structural weight that pure similarity scoring ignores.
- Extraction introduces no new information and no hallucination risk, but it cannot compress beyond the source document's own structural density.

---

### Try This

Take any document longer than 3,000 tokens that your system currently sends to a model in full. Apply three extraction strategies to it: (1) first-N-sentences until you hit 800 tokens, (2) TextRank top-10 sentences in document order, and (3) hybrid retrieval against a fixed test query that represents a typical use case. Measure the token count and read each output. Note which one preserves the information a model would need to answer your test query correctly. That comparison — not benchmarks, not theory — is what calibrates your extraction strategy for your actual documents.


---

## Chapter 4: Abstractive Compression: Rewriting for Density

### Chapter Overview

Extractive compression keeps what's already there — it selects, trims, filters. Abstractive compression does something fundamentally different: it rewrites. The output isn't a subset of the input; it's a new artifact that encodes the same information in fewer tokens. This distinction matters because it unlocks compression ratios that selection alone can't reach, but it comes with a different risk profile. You're no longer just cutting — you're trusting a model to preserve meaning through transformation. Done well, abstractive compression can shrink context by 60–80% without meaningful degradation in downstream task quality. Done carelessly, it silently drops the fact that determines the answer.

---

### What Abstractive Compression Actually Means

When a model reads a 2,000-token document and produces a 400-token summary, it's not extracting sentences — it's constructing new ones that represent the source material's information content. That's abstractive compression. The output could contain words or phrasings that never appeared in the original.

This is what makes it powerful and what makes it dangerous in the same breath.

In a retrieval-augmented system, you might compress retrieved chunks before stuffing them into a prompt. Instead of passing the full 600-token customer support ticket to your LLM, you pass a 120-token digest that captures the issue, the affected component, and any resolution steps already tried. The LLM sees less text, processes faster, and costs less per call.

The compression model and the task model don't have to be the same. In fact, they usually shouldn't be. A smaller, cheaper model — tuned or prompted for summarization — handles the compression step. The expensive frontier model handles the final task. You're paying inference costs proportional to the work each model actually needs to do.

The key distinction from extractive methods is that abstractive compression requires a model to make semantic judgment calls. It has to decide what "matters." That decision is where quality is won or lost.

---

### Prompt Design for Compression

The compression prompt is not an afterthought. It's the spec. A vague instruction like "summarize this" produces summaries optimized for human reading — which isn't the same as summaries optimized for downstream LLM reasoning.

You want compression that preserves the features your task model actually needs. That requires telling the compression model exactly what to keep.

```python
COMPRESSION_PROMPT = """
You are compressing text for use as LLM context in a customer support resolution task.

Preserve:
- The user's stated problem, verbatim or paraphrased precisely
- Any error codes, product names, or version numbers mentioned
- Steps the user has already tried
- Any timestamps relevant to the issue timeline

Drop:
- Pleasantries, greetings, sign-offs
- Repetition and restatement
- Background context not relevant to the technical issue

Output only the compressed text. No preamble. Target: under {target_tokens} tokens.

TEXT TO COMPRESS:
{input_text}
"""
```

Notice what's explicit: the task context, the preservation rules, the drop rules, and the token target. A compression prompt that doesn't specify the downstream task will produce generic summaries. Generic summaries lose task-relevant detail. That's where performance degradation comes from — not from abstractive compression as a technique, but from under-specified compression prompts.

> **Key Insight:** The compression prompt should be written by someone who understands the task, not just the data. If your task model needs causal relationships preserved, say so. If it needs numerical specifics, say so. Generic compression is a trade-off you're making blindly.

---

### Measuring Compression Quality

You can't just compress and ship. You need to know if the compressed context still supports correct answers.

The standard approach is to run a parallel evaluation: take a set of questions that require the source documents to answer, run them against the full context, run them against the compressed context, and compare accuracy. This gives you a compression quality score — the fraction of questions answered correctly with compressed context relative to full context.

```python
def evaluate_compression(
    questions: list[dict],
    full_docs: list[str],
    compressed_docs: list[str],
    task_model: str
) -> float:
    full_answers = [answer(q, full_docs, task_model) for q in questions]
    compressed_answers = [answer(q, compressed_docs, task_model) for q in questions]

    correct = sum(
        grade(q["expected"], ca)
        for q, ca, fa in zip(questions, compressed_answers, full_answers)
        if grade(q["expected"], fa)  # only grade where full context got it right
    )
    eligible = sum(1 for q, fa in zip(questions, full_answers) if grade(q["expected"], fa))
    return correct / eligible if eligible else 0.0
```

This design only measures cases where the full-context model answered correctly — otherwise you're mixing compression error with task model error, and the signal is noise.

Acceptable compression quality depends on your tolerance. For a customer support triage system, 95%+ is a reasonable floor. For a research assistant where the user is doing their own verification, 85% might be fine. Define the threshold before you measure, not after.

> **Warning:** If you compress training data or few-shot examples, the same quality measurement applies. Compressed few-shot examples that subtly alter the demonstrated task format can degrade task model performance in ways that are hard to trace. Treat compressed examples with at least as much skepticism as compressed context.

---

### Hierarchical and Multi-Stage Compression

Single-pass compression has a ceiling. For very long documents, a single compression call may not achieve the ratio you need, or the compression model may lose coherence when asked to process too much at once.

Hierarchical compression addresses this by chunking first, compressing each chunk, then compressing the compressed chunks again if needed.

```
Document (10,000 tokens)
    → Chunk into 500-token segments (20 chunks)
    → Compress each chunk to ~100 tokens (20 compressed chunks, ~2,000 tokens total)
    → If still too long: compress the compressed chunks
    → Final context: ~400–600 tokens
```

Each compression stage needs its own prompt tuned to that stage's input type. The first-stage prompt handles raw source text. The second-stage prompt handles already-compressed summaries — a different shape of input with different failure modes.

The risk in hierarchical compression is cumulative information loss. Each pass discards something. Run quality evaluation at each stage separately so you can see where the degradation is happening, not just that it's happening.

For most production systems, two stages is the practical limit before information loss becomes unacceptable. If you're needing a third stage, the better question is whether you should be retrieving less in the first place.

---

### Handling Structured and Semi-Structured Input

Plain prose compresses more predictably than structured data. When your input contains tables, JSON, code, or key-value records, standard summarization prompts often handle them badly — either preserving structure at the cost of verbosity, or stripping structure and losing precision.

The right approach is to treat structured content as a separate compression case with its own logic.

For JSON payloads, consider field-level filtering before any model-based compression. Drop fields your task model doesn't use. That's compression with zero quality risk.

```python
IMPORTANT_FIELDS = {"user_id", "error_code", "timestamp", "affected_component"}

def compress_json_record(record: dict) -> dict:
    return {k: v for k, v in record.items() if k in IMPORTANT_FIELDS}
```

For tables, instruct the compression model explicitly:

```
This is a data table. Preserve column headers and all numerical values exactly.
Compress row descriptions where rows are substantially similar.
Do not paraphrase numbers.
```

Code is generally the hardest to compress abstractively because precision is semantic. A summarized function signature loses the contract. For code, extractive approaches — stripping comments, removing docstrings, omitting unused functions — usually outperform abstractive summarization. Reserve abstractive compression for natural language context around code, not the code itself.

---

### Caching Compressed Context

Once you've compressed a document, the compressed version is a reusable artifact. Storing it saves the compression inference cost on every subsequent retrieval of that document.

The operational question is cache invalidation. If the source document changes, the cached compressed version is stale. For static document corpora — product manuals, policy documents, historical records — stale compression is rarely a problem. For live data, you need a versioning strategy.

A simple approach: cache the compressed output alongside a hash of the source. On retrieval, hash the current source, compare against the cached hash, and recompress only on mismatch.

```python
import hashlib

def get_compressed(doc_id: str, source_text: str, cache: dict) -> str:
    source_hash = hashlib.sha256(source_text.encode()).hexdigest()
    if doc_id in cache and cache[doc_id]["hash"] == source_hash:
        return cache[doc_id]["compressed"]
    compressed = run_compression(source_text)
    cache[doc_id] = {"hash": source_hash, "compressed": compressed}
    return compressed
```

Compression cache hits return immediately — no inference cost, no latency. For document corpora that are read far more often than they're updated, this alone can reduce compression overhead to near zero in steady state.

---

### Key Takeaways

- Abstractive compression rewrites rather than selects — it achieves higher compression ratios but requires a model to make semantic judgment calls that can silently drop critical information.
- The compression prompt determines quality. Specify the downstream task, what to preserve, and what to drop. A generic "summarize this" instruction is an uncontrolled trade-off.
- Measure compression quality with parallel evaluation against a question set. Define your acceptable accuracy threshold before you run the measurement, not after.
- Structured content — JSON, tables, code — often compresses better with extractive or rule-based methods than abstractive summarization. Use the right tool for the input type.
- Caching compressed artifacts with source hashing eliminates repeated compression inference costs on static or slow-changing document corpora.

---

### Try This

Take a document your system currently passes in full context — a retrieved chunk, a policy doc, a support ticket. Write a compression prompt that explicitly names the downstream task and lists three specific things to preserve and two to drop. Run the compressed version through your task model on five representative queries. Compare accuracy against the full-context run. If accuracy holds, measure the token reduction and calculate the cost savings at your current query volume. That number is what abstractive compression is worth to your system — not in theory, but in production.


---

## Chapter 5: Learned Compression: Soft Tokens and Embeddings

### Chapter Overview

The compression techniques covered so far — summarization, extraction, rewriting — all operate on tokens the model was trained to understand. Learned compression takes a different path: instead of shrinking the text, it compresses the *representation* itself, training auxiliary components to encode rich context into a small set of continuous vectors that the model consumes directly. This chapter covers how that works mechanically, where it genuinely outperforms text-based approaches, and what you're signing up for when you adopt it in a production system.

---

### What Soft Tokens Actually Are

A standard prompt is a sequence of discrete tokens — integers mapped to vocabulary entries. The model embeds each token into a high-dimensional vector, processes the sequence, and generates a response. Soft tokens bypass the vocabulary entirely. They're learned embedding vectors inserted directly into the model's input stream, with no corresponding text representation.

The term "soft prompt" comes from the fine-tuning literature, where researchers found that prepending a few learned vectors to a frozen model's input could steer behavior as effectively as retraining — often with far fewer parameters. Compression borrows the same mechanic for a different purpose: instead of steering behavior, you use a trained encoder to compress a long document into a small set of soft tokens that carry the document's semantic content in dense form.

The encoder is typically a smaller transformer trained to produce these vectors. The target model — frozen, unchanged — receives them as a prefix. From its perspective, the soft tokens look like a normal embedded input sequence, just one it can't decode back into text. The system is trained end-to-end on a task objective: given soft-token-compressed context, the model should answer questions, generate summaries, or complete tasks as accurately as if it had received the full text.

This is not a trick or an approximation in the naive sense. The compression is learned specifically to preserve what the downstream task requires. A soft token prefix of length 32 can, in principle, encode the semantic content of thousands of tokens — if the encoder-decoder pair is trained well.

---

### The Mechanics of Training an Encoder

Training requires a frozen target LLM and a trainable encoder. The standard setup looks like this:

1. Sample a long document and a task (question answering, summarization, classification).
2. Run the encoder over the document to produce K soft tokens.
3. Prepend those soft tokens to the task instruction and pass the combined sequence to the frozen LLM.
4. Compute loss against the ground-truth output.
5. Backpropagate through the LLM (to get gradients on the soft tokens) and into the encoder.

The LLM weights stay frozen. Only the encoder trains. This is important: you're not modifying the model you're serving, which preserves its general capabilities and avoids the cost of full fine-tuning.

```python
import torch
import torch.nn as nn

class SoftTokenEncoder(nn.Module):
    def __init__(self, encoder_model, num_soft_tokens, embed_dim):
        super().__init__()
        self.encoder = encoder_model  # e.g., a small BERT-class transformer
        self.projection = nn.Linear(encoder_model.config.hidden_size, embed_dim)
        self.num_soft_tokens = num_soft_tokens

    def forward(self, input_ids, attention_mask):
        # Encode the document
        outputs = self.encoder(input_ids=input_ids, attention_mask=attention_mask)
        # Pool to fixed length via attention pooling or mean over sequence
        hidden = outputs.last_hidden_state  # (batch, seq_len, hidden)
        # Simple mean pool — real systems use learned pooling
        pooled = hidden.mean(dim=1, keepdim=True)  # (batch, 1, hidden)
        # Project to target model's embedding dim and expand to K tokens
        soft_tokens = self.projection(pooled).expand(-1, self.num_soft_tokens, -1)
        return soft_tokens  # (batch, K, embed_dim)
```

In practice, learned pooling — where the K output vectors are produced by K learned query vectors attending over the full encoder output — preserves significantly more information than mean pooling. The K queries can specialize: one for entities, one for temporal structure, one for causal relationships. The model learns this structure from the task data without explicit supervision.

> **Key Insight:** The number of soft tokens K is your compression ratio control. K=16 is aggressive compression; K=128 gives the encoder more room to work. Start at K=64 and measure task performance before tuning further. The relationship between K and performance is task-dependent and non-linear.

---

### Where Learned Compression Wins

Text-based compression has a ceiling. Abstractive summarization can lose precise details — numbers, names, conditional logic — that the task requires. Extractive compression can miss cross-sentence inferences. Soft token compression is trained directly on task loss, so it learns to preserve exactly what the downstream objective penalizes it for losing.

For retrieval-augmented systems with stable document sets — product documentation, legal corpora, internal knowledge bases — this is a significant advantage. You compress each document once, cache the soft token embeddings, and serve them at inference time with no per-query recomputation. The latency profile shifts: encoding cost is amortized, and the model receives a dramatically shorter sequence.

The gains compound at scale. If your average retrieved context is 3,000 tokens and you compress to K=64 soft tokens, you've cut context length by ~98%. That's not a marginal improvement — it changes the economics of the system.

There's also a quality dimension that surprises people the first time they see it. Because the encoder is trained on task loss rather than reconstruction loss, it can outperform lossless context on held-out tasks. The compression filters noise. A 5,000-token document often contains redundant structure, repeated assertions, and off-topic content. Soft token compression trained on QA tasks learns to ignore all of it.

---

### The Deployment Reality

The advantages are real. So are the constraints, and they're non-trivial.

**Model coupling.** Soft tokens are tied to the specific LLM they were trained against — its embedding dimension, its layer structure, its tokenizer. If you upgrade the base model, you retrain the encoder. There's no transfer between model families. This is the most significant operational constraint.

**Training cost.** You need task-specific training data at scale — typically tens of thousands of (document, query, answer) triples. If your use case has limited labeled data, learned compression will underperform text-based alternatives that require no training.

**No interpretability.** You cannot read a soft token prefix. Debugging a failure means treating the encoder as a black box. If the model answers incorrectly, determining whether the failure is in the encoder (bad compression) or the LLM (bad reasoning over the compressed context) requires ablation studies.

> **Warning:** Do not deploy a soft token system without establishing a faithfulness baseline. Run the same tasks with full context and with compressed context. Measure the gap. If you're seeing >5% degradation on critical tasks, your K is too small or your training data doesn't cover the failure cases. Soft token compression that degrades task performance is just an expensive way to make your system worse.

**Infrastructure requirements.** You're now running two models: the encoder and the LLM. The encoder adds latency for new documents (offset by caching for stable corpora) and requires a pipeline that can inject continuous embeddings into the LLM's input stream. Not all serving frameworks support this out of the box.

---

### Practical Caching Strategy

For corpora that don't change frequently, precomputed soft token caches are where the economics get interesting. The workflow:

1. Run the encoder over every document in your corpus offline.
2. Store the resulting soft token tensors alongside your document metadata.
3. At query time, retrieve documents as usual, but instead of passing text to the LLM, fetch and concatenate their cached soft token representations.
4. The LLM receives a short prefix of soft tokens plus the query — no long-context overhead.

```python
import numpy as np

class SoftTokenCache:
    def __init__(self, cache_path):
        self.cache = {}
        self.cache_path = cache_path

    def precompute(self, doc_id, document_text, encoder, tokenizer, device):
        inputs = tokenizer(document_text, return_tensors="pt",
                          truncation=True, max_length=4096).to(device)
        with torch.no_grad():
            soft_tokens = encoder(**inputs)  # (1, K, embed_dim)
        self.cache[doc_id] = soft_tokens.cpu().numpy()

    def retrieve(self, doc_ids):
        return [torch.tensor(self.cache[did]) for did in doc_ids if did in self.cache]

    def save(self):
        np.savez(self.cache_path, **{k: v for k, v in self.cache.items()})
```

For document sets that update frequently, you need an incremental encoding pipeline — new and modified documents get re-encoded and cached; stable documents remain untouched. The cache hit rate directly determines whether the system's latency profile is competitive with text-based alternatives.

> **Try This:** If you're running a RAG system with a stable document corpus, measure the average token count of your retrieved context chunks. Now estimate what K=64 soft tokens per chunk would reduce that to. If the reduction is more than 50%, soft token caching is worth prototyping — even a rough encoder trained on synthetic QA pairs over your corpus will give you a meaningful signal on whether the compression quality is sufficient for your use case.

---

### Key Takeaways

- Soft tokens encode document content into continuous vectors the LLM consumes directly — compression ratio is controlled by K, the number of output tokens from the encoder.
- The encoder trains on task loss, not reconstruction loss, which means it learns to preserve what the task requires and can outperform lossless context on specific benchmarks.
- Soft token systems are model-coupled: retraining the encoder is required when the base LLM changes, which makes model versioning an operational concern from day one.
- Precomputed caching turns learned compression into a latency advantage for stable corpora — encoding cost is paid once, retrieval serves compressed representations at inference time.
- Debugging failures requires treating the encoder as a black box; establishing a full-context performance baseline before deployment is non-negotiable, not optional.

---

### Try This

Pick a document set you currently retrieve from — ideally 50–200 documents you know well. Using a small encoder (a 6-layer BERT-class model works fine for this experiment), train it with K=32 on a simple QA task over those documents. Use synthetic questions generated by an LLM if you lack labeled data. After 1,000 training steps, run your test set twice: once with full document text, once with soft token compression. Compute exact-match or ROUGE scores for both. The gap you measure is the real cost of compression at K=32 for your specific content type. Adjust K and retrain — two or three iterations will tell you whether the accuracy-efficiency trade-off is viable for your use case before you commit to building the full pipeline.


---

## Chapter 6: Compression Evaluation: Faithfulness and Task Performance

### Chapter Overview

Compression without measurement is just data loss. Every technique covered so far — token pruning, semantic chunking, summarization, soft token approaches — produces a smaller prompt. The question this chapter addresses is whether that smaller prompt still produces the right outputs. Evaluation for compression is distinct from general LLM evaluation: you're not asking "is this output good?" in isolation. You're asking "did removing context change the answer, and by how much?" That requires purpose-built metrics, test harnesses that isolate compression as the variable, and a clear distinction between faithfulness (did we preserve meaning?) and task performance (did the downstream result change?). Both matter. Neither alone is enough.

---

### What You're Actually Measuring

There are two failure modes in compressed prompts, and they're easy to conflate.

The first is faithfulness failure: the compressed prompt no longer accurately represents the information in the original. A summarizer drops a key constraint. A pruner removes the one sentence that disambiguates an ambiguous term. The model receives a distorted input and produces a confident, coherent, wrong answer.

The second is task performance failure: even if the compressed prompt is semantically faithful, the model's behavior changes. This happens because models are sensitive to phrasing, order, and density in ways that aren't fully captured by meaning alone. A fact can survive compression and still produce a different generation if the surrounding context that cued its relevance was trimmed away.

These failures are independent. You can have high faithfulness scores and poor task performance if your compression scheme preserves content but destroys the structural signals the model was relying on. You can also, less commonly, see decent task performance despite low faithfulness scores — this happens when the removed content was redundant for the specific task, even if it wasn't semantically redundant in general.

The practical implication: run both types of evaluation, on your actual tasks, with your actual model. Scores from someone else's benchmark tell you almost nothing useful.

---

### Faithfulness Metrics

Faithfulness measures whether the compressed representation preserves the meaning of the original. The two dominant approaches are reference-based and model-based.

Reference-based metrics — ROUGE, BLEU, BERTScore — compare the compressed output against the source using token overlap or embedding similarity. They're fast and cheap. They're also unreliable for compression evaluation specifically, because they penalize valid paraphrase and reward superficial similarity. A compressed prompt that uses different words to convey identical meaning will score poorly against ROUGE even if a human would call it perfectly faithful.

Model-based faithfulness evaluation is more expensive but more accurate. The standard approach: take a set of factual claims extracted from the original context, then ask a judge model whether each claim is entailed, contradicted, or neutral with respect to the compressed version.

```python
def evaluate_faithfulness(original: str, compressed: str, claims: list[str], judge_model) -> dict:
    results = []
    for claim in claims:
        prompt = f"""Given this compressed context:
---
{compressed}
---

Is the following claim: ENTAILED, CONTRADICTED, or NEUTRAL?
Claim: {claim}
Answer with one word only."""
        verdict = judge_model.complete(prompt).strip().upper()
        results.append({"claim": claim, "verdict": verdict})

    entailed = sum(1 for r in results if r["verdict"] == "ENTAILED")
    return {
        "faithfulness_score": entailed / len(results),
        "details": results
    }
```

Extract claims automatically using the same judge model — prompt it to list factual assertions from the original context, then evaluate each one. Aim for 10–20 claims per document for reliable coverage without excessive cost.

> **Key Insight**
> Faithfulness is not symmetric. A compressed prompt that adds no new information but omits critical facts scores the same as one that misrepresents those facts — both produce low entailment rates. Track omission rate and contradiction rate separately. Omissions are usually acceptable at low rates; contradictions almost never are.

---

### Task Performance Evaluation

Faithfulness tells you about the prompt in isolation. Task performance tells you whether compression changes the model's answer on the job it's actually doing.

The setup is straightforward in principle: run the same task against the original prompt and the compressed prompt, then compare outputs. Where it gets complicated is defining what "the same answer" means when outputs are generative.

For structured outputs — classification, extraction, slot-filling — comparison is exact or near-exact. A compressed prompt that changes a sentiment label from neutral to negative is a measurable regression.

For generative tasks — summarization, Q&A, code generation — you need a scoring function. Options include:

- **Answer equivalence**: ask a judge model whether two answers are semantically equivalent given the question
- **Factual consistency**: measure whether the generated answer contains the same factual claims regardless of phrasing
- **Task-specific rubrics**: for code generation, does the output pass the test suite? For SQL, does it return the same rows?

```python
def task_performance_delta(
    task_input: str,
    original_context: str,
    compressed_context: str,
    model,
    score_fn
) -> dict:
    original_output = model.complete(original_context + "\n\n" + task_input)
    compressed_output = model.complete(compressed_context + "\n\n" + task_input)

    original_score = score_fn(task_input, original_output)
    compressed_score = score_fn(task_input, compressed_output)

    return {
        "original_score": original_score,
        "compressed_score": compressed_score,
        "delta": compressed_score - original_score,
        "relative_change": (compressed_score - original_score) / (original_score + 1e-8)
    }
```

The `delta` is your compression cost. If you're compressing to 50% of original tokens and your task score drops by 2%, that's probably a good trade. If it drops 15%, you've over-compressed.

> **Warning**
> Never evaluate compression using the same model you're compressing for. If GPT-4o is your production model and you use GPT-4o to score faithfulness and task performance, you're measuring the judge's preferences, not ground truth. Use a different model family, human evaluators, or deterministic scoring wherever possible.

---

### Building a Compression Eval Suite

A compression eval suite is a dataset of (context, task, expected output) triples where the expected output is known and stable. The "known and stable" requirement is what makes this hard to build and easy to neglect.

Three things to get right:

**Diversity of information types.** Your suite needs examples where the critical information is numerical, where it's a named entity, where it's a causal relationship, and where it's a negative statement ("the product does not support X"). These compress differently and fail differently.

**Compression ratio coverage.** Don't just test your target compression ratio. Test the full spectrum from 90% retention down to 20%. This produces a performance curve that reveals where your specific compressor degrades — which is never a clean cliff, always a slope with an elbow.

**Adversarial cases.** Include contexts where the critical information is near the end (models and compressors both have recency bias), where it's expressed parenthetically, and where it contradicts more prominent information. These cases expose brittleness that average-case evaluation misses.

Start with 50–100 examples. That's enough to get a signal on gross compression failures. Scale to 500+ before making production decisions about compression ratios.

> **Try This**
> Take five recent production queries where you have ground-truth answers. Compress each context to 60% of original tokens using your current method. Run both the original and compressed contexts through your production model and compare outputs. Don't use a metric — read them manually. You'll immediately see the failure modes specific to your content type. This is worth more than any benchmark score.

---

### The Compression-Performance Tradeoff Curve

Once you have an eval suite and a scoring function, you can plot the curve that actually drives compression decisions: task performance vs. compression ratio.

This curve is specific to your compressor, your model, your tasks, and your content. General results from papers do not transfer. Run it yourself.

A typical curve looks like this: performance stays flat or near-flat from 100% down to about 70% retention, then begins a gradual decline, then drops sharply below some threshold (often around 40–50% retention, but it varies). The goal is to operate just above the elbow.

```python
def compression_curve(suite, compressor, model, score_fn, ratios=None):
    if ratios is None:
        ratios = [1.0, 0.9, 0.8, 0.7, 0.6, 0.5, 0.4, 0.3]

    results = {}
    for ratio in ratios:
        scores = []
        for example in suite:
            compressed = compressor.compress(example["context"], target_ratio=ratio)
            delta = task_performance_delta(
                example["task"], example["context"], compressed, model, score_fn
            )
            scores.append(delta["relative_change"])
        results[ratio] = sum(scores) / len(scores)

    return results
```

Plot this and the answer to "how much should we compress?" becomes obvious rather than a judgment call. The elbow is usually visible. Operate at the elbow, not at maximum compression.

One more thing: rerun this curve whenever the task distribution shifts, the model is updated, or the compressor changes. The elbow moves.

---

### Key Takeaways

- Faithfulness and task performance are separate metrics that can fail independently — measure both.
- Model-based claim entailment is more reliable than token-overlap metrics for faithfulness evaluation. Track omission and contradiction rates separately.
- Task performance evaluation requires a scoring function matched to the task type: exact match for structured outputs, judge-model equivalence or rubric-based scoring for generative tasks.
- Your compression eval suite should include adversarial cases — tail-end critical information, parenthetical facts, internal contradictions — not just representative samples.
- The compression-performance tradeoff curve is specific to your system. Plot it, find the elbow, and operate just above it. Rerun it when anything changes.

---

### Try This

Pick one task your system handles regularly. Find ten examples with known correct answers. Apply your compression pipeline at three different target ratios: 80%, 60%, and 40% of original token count. For each example at each ratio, score the output against the known answer using whatever function makes sense for your task type — exact match, keyword presence, or a quick judge-model call. Plot average score vs. compression ratio. You now have a real compression-performance curve for your system. Compare where you're currently compressing to where the elbow actually is. If they don't match, adjust accordingly.


---

## Chapter 7: Integration Patterns: Where Compression Lives in the Pipeline

### Chapter Overview

Compression doesn't work in isolation. The algorithm matters, but where you place it in the pipeline matters just as much — sometimes more. A technically sound compression strategy deployed in the wrong position will either produce worse output than no compression at all, or it'll compress inputs that didn't need compressing while leaving the real cost drivers untouched. This chapter covers the integration patterns that actually work in production: where compression fits relative to retrieval, caching, batching, and inference; what the tradeoffs are at each position; and how to wire these pieces together without creating a system that's harder to debug than the problem it solves.

---

### Position Zero: Compress Before or After Retrieval?

The most common mistake in RAG pipelines is placing compression after retrieval without thinking through what that means for faithfulness. You retrieve twenty chunks, concatenate them into a context window, then compress — and the compressor doesn't know which chunks matter. It'll drop content based on statistical salience, not relevance to the query.

The better default: compress *after* reranking, not just after retrieval. Reranking gives you a relevance signal the compressor can work with. Once you know chunk 3 is more relevant than chunk 17, you can compress chunk 17 aggressively and preserve chunk 3 nearly verbatim.

That said, there's a case for pre-retrieval compression too. If your document corpus contains long passages with significant redundancy — legal contracts, technical documentation with repeated boilerplate — compressing documents before indexing can reduce embedding computation costs and improve retrieval precision. This is corpus-level compression, distinct from context-level compression, and it operates on a completely different schedule.

These are two separate problems. Conflating them leads to pipelines that do neither well.

```python
# Post-rerank compression: respect relevance scores
def compress_ranked_chunks(chunks: list[RankedChunk], target_tokens: int) -> str:
    # Allocate token budget proportional to relevance score
    total_score = sum(c.score for c in chunks)
    compressed = []
    for chunk in chunks:
        budget = int((chunk.score / total_score) * target_tokens)
        compressed.append(compress_chunk(chunk.text, max_tokens=budget))
    return "\n\n".join(compressed)
```

This isn't the only approach, but the principle holds: let relevance scores inform compression budget allocation.

---

### Compression and the Cache Layer

If you're running a cache in front of your LLM — semantic cache, prefix cache, exact-match cache — compression interacts with it in non-obvious ways.

Prefix caching, which most providers now support implicitly, works by recognizing that the beginning of your prompt is identical across requests. System prompts, instructions, few-shot examples: if these are stable, the provider caches the KV state and you don't pay full attention cost on every call. Compress your system prompt once and it stays compressed. That's straightforward.

The subtler issue is semantic caching, where you cache responses keyed on embedding similarity of the input. If you compress queries before the cache lookup, you need to ensure that semantically equivalent queries compress to semantically similar representations. Aggressive compression can break this — two queries that meant the same thing now look different after compression, and you miss cache hits you should have gotten.

**Key Insight:** Run semantic cache lookups on the *uncompressed* query. Only compress the context that gets sent to the model. The query itself is usually short enough that compressing it buys nothing and risks breaking cache coherence.

This means your pipeline should branch: query goes to the cache lookup path unchanged, while the retrieved context goes through compression before assembly. Don't let compression logic bleed across the pipeline without explicit intent.

---

### Batching and Compression Amortization

Compression is itself a compute operation. If you're using an LLM-based compressor — summarization, extraction, rewriting — you're making an additional API call. At scale, that's real cost and real latency.

Batching helps when your access patterns allow it. Offline ingestion pipelines are the obvious case: process documents in bulk, compress once, store compressed versions. The compression cost is paid at write time, not query time.

For real-time pipelines, batching is harder but not impossible. If you have multiple concurrent requests hitting the same document or context, you can deduplicate compression work: one compression call whose result gets reused across requests that share context. This requires a short-lived compression cache — not the same as your response cache — keyed on a hash of the input content.

```python
import hashlib
from functools import lru_cache

@lru_cache(maxsize=512)
def cached_compress(content_hash: str, content: str, target_tokens: int) -> str:
    return llm_compress(content, target_tokens)

def compress_with_cache(content: str, target_tokens: int) -> str:
    content_hash = hashlib.sha256(content.encode()).hexdigest()[:16]
    return cached_compress(content_hash, content, target_tokens)
```

In practice, cache hit rates on this layer depend heavily on your traffic patterns. High overlap in retrieved context — common in domain-specific applications — makes this worth implementing. Fully diverse retrieval across requests means you'll get near-zero hits and have paid for cache overhead with no benefit.

**Warning:** Don't compress and cache at the document chunk level if your compression is query-aware (i.e., if the compressed output changes based on what question is being asked). Query-aware compression cannot be shared across requests with different queries. Cache only query-agnostic compression results — summaries, entity extractions, boilerplate removals.

---

### Streaming and Latency Budget

Streaming responses change the compression calculus. When you stream, users start receiving tokens before the full completion is generated — latency to first token matters as much as total generation time. If your compression step adds 800ms before any tokens are sent, you've traded user-perceived latency for a cost saving the user can't see.

The solution isn't to skip compression. It's to put compression earlier in the pipeline so it's not on the critical path to first token. If you compress retrieved context during the retrieval phase — overlapping network calls and compression compute — the compression cost gets absorbed rather than added.

```python
import asyncio

async def rag_pipeline(query: str) -> AsyncIterator[str]:
    # Start retrieval and prepare compression concurrently
    chunks, _ = await asyncio.gather(
        retrieve_chunks(query),
        warmup_compressor()  # pre-load model if needed
    )
    reranked = rerank(chunks, query)
    compressed_context = await compress_async(reranked)  # still on critical path,
                                                          # but retrieval latency is sunk
    async for token in stream_completion(query, compressed_context):
        yield token
```

The ideal scenario: compression latency is less than or equal to retrieval latency, making it effectively free from the user's perspective. Measure both. If compression is consistently slower than retrieval, you have a compression implementation problem, not an integration problem.

---

### Multi-Stage Pipelines and Compression Placement

Some pipelines have multiple LLM calls — an extraction step, a reasoning step, a generation step. Each call is a potential compression point, but not all of them are worth compressing.

The extraction call often takes long context and produces short output. That's where compression earns its keep. The reasoning call typically works with structured or already-concise inputs — compressing there may remove information the reasoning step needs. The generation call works from the reasoning output, which is already compact.

So the rule is: compress at ingestion boundaries, not at every call. When raw, messy, long content enters the pipeline — from retrieval, from documents, from tool outputs — that's where compression pays off. Internal handoffs between pipeline stages that already deal in structured or concise data usually don't need it.

**Try This:** Map out your pipeline on paper. Mark every stage where content enters from outside the pipeline (retrieval, API responses, user uploads). Those are your compression candidates. Mark every internal handoff. Draw a line between them. Compression belongs on the first category; adding it to the second usually hurts more than it helps.

---

### Monitoring Compression in Production

You can't tune what you can't see. Compression pipelines need their own instrumentation, separate from the inference metrics you're probably already tracking.

At minimum, track: compression ratio per request (tokens in vs. tokens out), compression latency, and cache hit rate on the compression cache if you're running one. Over time, you want to see how compression ratio correlates with task performance — if you built the evaluation pipeline from the previous chapter, feed those scores back here.

The useful alert thresholds are: compression ratio dropping suddenly (compressor is broken or context structure changed), compression latency spiking (compressor is overloaded or an upstream change caused bigger inputs), and compression cache hit rate collapsing (your traffic pattern changed, invalidating the caching assumption).

Log the uncompressed context length too. If average context length is trending up, it means something upstream changed — retrieval is returning more chunks, documents got longer, prompts grew. Compression can absorb that growth for a while, but it's a signal worth catching early.

---

### Key Takeaways

- Compress after reranking, not just after retrieval — relevance scores should inform token budget allocation across chunks.
- Keep semantic cache lookups on uncompressed queries; only compress the context that reaches the model.
- Cache compression results for query-agnostic operations; never cache query-aware compression across different queries.
- Overlap compression compute with retrieval latency to minimize the impact on time-to-first-token in streaming pipelines.
- Compression belongs at ingestion boundaries — where external, messy content enters the pipeline — not at internal handoffs between stages.

### Try This

Take one real request from your production logs. Trace it through every stage of your pipeline and record, for each stage: the token count of the content being processed, whether any compression is applied, and what the latency of that stage is. Now identify the single highest token-count input in the trace that has no compression applied. Calculate what a 60% compression ratio at that stage would mean for your total context cost on that request. Then check whether compression latency at that stage would exceed the stage's current latency. That gap — cost savings versus latency addition — is the number that tells you whether integration work there is worth prioritizing.


---

## Chapter 8: Production Considerations

### Chapter Overview

Compression strategies that work beautifully in a notebook fall apart under production load — not because the math changes, but because real systems have budgets, SLAs, failure modes, and users who behave unpredictably. This chapter covers what actually changes when you move from prototype to production: how to measure compression's cost against its benefit, how to build systems that degrade gracefully, how to handle the edge cases that only appear at scale, and how to make compression observable so you can tell when something goes wrong.

---

### Measuring What Actually Matters

The naive metric for compression is compression ratio — tokens in versus tokens out. It's useful as a starting signal, but it doesn't tell you whether compression is earning its keep.

What matters in production is the *net* impact on your three real constraints: latency, cost, and output quality. Compression adds processing time on the front end. If your model call takes 800ms and your compression step takes 400ms, you've improved cost but degraded user experience. That tradeoff might be worth it, or it might not — but you need to measure it to know.

Build a simple accounting framework before you ship anything. Track pre-compression token count, post-compression token count, compression latency, total request latency, and a quality proxy (embedding similarity between compressed and original, or a lightweight judge score). Log these per-request. You want to be able to ask, after a week of production traffic: "Is compression helping or hurting, and by how much?"

```python
import time
from dataclasses import dataclass

@dataclass
class CompressionMetrics:
    original_tokens: int
    compressed_tokens: int
    compression_latency_ms: float
    total_latency_ms: float
    quality_score: float  # 0.0 - 1.0

    @property
    def compression_ratio(self) -> float:
        return self.compressed_tokens / self.original_tokens

    @property
    def tokens_saved(self) -> int:
        return self.original_tokens - self.compressed_tokens

    def worth_it(self, token_cost_usd: float, latency_budget_ms: float) -> bool:
        cost_savings = self.tokens_saved * token_cost_usd
        latency_overhead_ok = self.compression_latency_ms < latency_budget_ms
        return cost_savings > 0 and latency_overhead_ok and self.quality_score > 0.85
```

Log these to wherever your other application metrics live — Datadog, CloudWatch, a Postgres table you query with Metabase. The tool doesn't matter. The discipline of measuring does.

---

### Graceful Degradation

Compression can fail. A model used for summarization returns garbage. A regex-based cleaner mangles structured data. A retrieval step surfaces the wrong context. If your pipeline has no fallback, that failure propagates directly to the user.

The pattern that works: define a compression budget, attempt compression, and fall back to truncation if the result falls outside acceptable bounds. This isn't elegant, but it's robust.

```python
def compress_with_fallback(text: str, max_tokens: int, compressor, tokenizer) -> str:
    try:
        compressed = compressor.compress(text)
        token_count = len(tokenizer.encode(compressed))

        if token_count > max_tokens * 1.1:  # 10% tolerance
            raise ValueError(f"Compression insufficient: {token_count} tokens")

        quality = estimate_quality(text, compressed)
        if quality < 0.8:
            raise ValueError(f"Quality too low: {quality:.2f}")

        return compressed

    except Exception as e:
        log.warning(f"Compression failed ({e}), falling back to truncation")
        tokens = tokenizer.encode(text)
        return tokenizer.decode(tokens[:max_tokens])
```

The fallback here is truncation — crude, but predictable. A predictable bad outcome is almost always better than an unpredictable one. Your on-call engineer can reason about truncation at 2am. They cannot reason about a summarization model that hallucinated a nonsense summary.

> **Warning:** Don't skip the quality check on compression output. A compressor that silently produces low-quality output is worse than one that fails loudly — you'll ship degraded responses without knowing it.

---

### Caching Compressed Context

If the same base context gets compressed repeatedly — a system prompt, a knowledge base excerpt, a user's profile — you're paying the compression cost on every request. Cache the output instead.

The key for the cache should be a hash of the *original* content, not the compressed output. If the original changes, you want a cache miss. If it hasn't changed, you want a hit.

```python
import hashlib
import json
from functools import lru_cache

class CompressionCache:
    def __init__(self, backend):  # Redis, memcached, dict — doesn't matter
        self.backend = backend

    def get_or_compress(self, text: str, compressor, ttl_seconds: int = 3600) -> str:
        key = f"compressed:{hashlib.sha256(text.encode()).hexdigest()[:16]}"
        cached = self.backend.get(key)

        if cached:
            return json.loads(cached)["text"]

        compressed = compressor.compress(text)
        self.backend.setex(key, ttl_seconds, json.dumps({"text": compressed}))
        return compressed
```

TTL matters here. Static content like system prompts can have long TTLs. Dynamic content — recent conversation history, live data pulled from a database — needs short TTLs or none at all. Getting this wrong means serving stale context, which produces subtly wrong outputs that are hard to debug.

> **Key Insight:** For multi-tenant systems, scope your cache keys by tenant or user where the content differs per-user. A compressed profile for User A should never be served to User B, even if the original texts are similar.

---

### Handling the Edge Cases That Appear at Scale

In development, your inputs are clean. In production, they're not. Users paste in entire PDFs. They submit the same question 40 times with slight variations. They find creative ways to exceed every limit you thought you'd accounted for.

Three edge cases to handle explicitly before you ship:

**Very short inputs.** If a user sends a 50-token message, compression adds overhead with no benefit. Set a minimum token threshold below which compression is skipped entirely. Something around 200-300 tokens is usually right, but calibrate to your model's pricing and your compressor's fixed overhead cost.

**Repetitive inputs.** Some compression strategies perform poorly on highly repetitive text — not because they fail, but because they over-compress and lose the signal that the repetition was intentional. Conversation history where the user keeps rephrasing the same question is a good example. A quick entropy check before compression can catch this.

**Structured data in free text.** JSON, code blocks, and tables embedded in conversational context are compression landmines. A summarization compressor will try to paraphrase them. Don't let it. Detect structured sections before compression and pass them through unchanged, then re-splice after.

```python
import re

def extract_structured_blocks(text: str) -> tuple[str, dict]:
    """Extract code/JSON blocks, replace with placeholders, return both."""
    blocks = {}
    counter = [0]

    def replacer(match):
        key = f"__BLOCK_{counter[0]}__"
        blocks[key] = match.group(0)
        counter[0] += 1
        return key

    cleaned = re.sub(r'```[\s\S]*?```|`[^`]+`', replacer, text)
    return cleaned, blocks

def reinsert_blocks(text: str, blocks: dict) -> str:
    for key, value in blocks.items():
        text = text.replace(key, value)
    return text
```

---

### Making Compression Observable

You cannot fix what you cannot see. Compression introduces a new class of bugs that are subtle: the model answers correctly but missed a detail that was compressed away. You won't catch these in logs that only record inputs and outputs. You need to log the compression step itself.

At minimum, log: original token count, compressed token count, which compressor was used, compression latency, and — if you have a quality estimator — the quality score. If you're running A/B tests between compressors, log the variant too.

Build a dashboard that shows compression ratio over time, broken down by content type if you have multiple content types flowing through the system. Sudden drops in compression ratio often indicate a change in input distribution — users are sending shorter messages, or a new feature is injecting more tokens. Sudden drops in quality scores indicate a compressor regression or a new input pattern the compressor handles poorly.

Alerting thresholds worth setting: compression latency p99 above your budget (start at 3x your median), quality score median below 0.8 for any content type, and fallback rate above 5% (high fallback rate means your primary compressor is struggling).

> **Try This:** Add a `compression_fingerprint` field to your request logs — a short hash of the compressed output. When a user reports a bad response, you can retrieve the exact compressed context that was sent to the model, not just the original. This makes debugging dramatically faster.

---

### Key Takeaways

- Compression ratio is not a production metric. Track net latency, cost delta, and quality score together — compression that saves money but degrades quality is a net loss.
- Always implement a fallback. Truncation is not ideal, but it's predictable. Predictable failure is operationally manageable.
- Cache compressed context aggressively for static or semi-static content, and scope cache keys correctly in multi-tenant systems. Stale context is worse than no cache.
- Structured content — code blocks, JSON, tables — must be extracted before compression and reinserted after. Compressors that summarize these blocks introduce silent correctness errors.
- Observability on the compression step itself is non-negotiable at scale. You need logs and alerts on the compression layer, not just the final model call.

---

### Try This

Pick one prompt in your current system that's regularly hitting context limits or driving up cost. Add instrumentation around it this week: log the token count before and after any compression or truncation you're already doing, log the total request latency, and log a simple quality proxy (cosine similarity between a short embedding of the original and the output is fine). Run it for 48 hours of real traffic. You'll almost certainly discover that one input pattern is responsible for a disproportionate share of your token spend — and that's where to focus compression work first. Don't optimize across the board. Find the expensive case and solve that.


---

## Conclusion

Context optimization is not a late-stage concern. By the time latency is visibly hurting user experience and token costs are showing up in board decks, the window for clean architectural decisions has mostly closed. The engineers who treat compression as a first-class design choice — not a patch applied under pressure — end up with systems that scale without requiring a full rewrite. That is the practical upside of everything covered here: front-loading the thinking saves the scramble later.

What to tackle first depends on where the waste is. Most production systems bleed context in one of three places: retrieval pipelines returning too much, conversation history carrying stale state, or system prompts that accumulated over time without ever being pruned. Extractive methods — scoring and selecting — are the right starting point because they are fast to implement, easy to audit, and impose no semantic risk. Once those gains are captured and measured, abstractive rewriting and learned compression open the path to tighter information density without sacrificing fidelity. The evaluation framework matters throughout: compressing tokens while degrading task performance is not optimization, it is just breaking things slowly.

The field is moving. Soft token methods will improve as training pipelines mature. Context windows will keep expanding, which will not eliminate compression so much as shift where it is most valuable. The engineers who understand the tradeoffs at each layer — what can be compressed, what cannot, and at what cost — will navigate those changes without being caught off guard by them. That understanding is the durable takeaway. The specific tools and thresholds will shift; the underlying logic will not.

---

## Appendix A: Glossary

**Abstractive Compression** — A compression method that rewrites source content into a shorter form, generating new text that preserves meaning rather than selecting existing spans. Introduces semantic transformation risk not present in extractive approaches.

**BM25** — A probabilistic ranking function used in information retrieval to score document relevance against a query. Relies on term frequency and inverse document frequency; does not use embeddings. Common baseline in hybrid retrieval pipelines.

**Context Window** — The maximum number of tokens a model can process in a single forward pass, encompassing both prompt and generated output. Hard limit enforced by model architecture; exceeding it truncates input or raises an error.

**Density (Information Density)** — The ratio of semantically load-bearing content to total tokens in a prompt segment. High-density text conveys more task-relevant information per token. The primary target metric for abstractive compression.

**Extractive Compression** — A compression method that reduces content by selecting and retaining original spans — sentences, passages, chunks — and discarding the rest. Preserves exact source language; introduces no semantic rewriting.

**Faithfulness** — The degree to which a compressed representation accurately preserves the factual and semantic content of the source. Evaluated independently from task performance; a compression can be unfaithful without causing measurable task degradation, and vice versa.

**KV Cache** — Key-value cache. In transformer inference, the stored attention keys and values for already-processed tokens. Prompt compression reduces the tokens that must be encoded, which directly reduces KV cache memory usage and prefill latency.

**Learned Compression** — Compression methods that use trained models to produce representations — typically soft tokens or embeddings — rather than selecting or rewriting natural language. Representations are not human-readable and are model-specific.

**LLMLingua** — A family of token-level compression methods from Microsoft Research that use a small proxy language model to score token importance and filter low-salience tokens from prompts before inference with a larger model.

**Perplexity Scoring** — Using a language model's per-token loss (negative log-likelihood) as a proxy for information content. High-perplexity tokens are less predictable given context; low-perplexity tokens carry less information. Used in LLMLingua-style filtering to identify droppable tokens.

**RAG (Retrieval-Augmented Generation)** — An architecture that retrieves external documents or passages at inference time and injects them into the prompt as context before generation. Retrieved chunks are a primary target for compression, as retrieval often returns more content than the model needs.

**Reranking** — A second-stage retrieval step that scores an initial candidate set of retrieved chunks for relevance to a query and returns only the top-scoring subset. A form of extractive compression applied at the retrieval boundary.

**Semantic Chunking** — A document splitting strategy that partitions text at natural semantic boundaries — topic shifts, entity transitions — rather than at fixed character or token counts. Produces chunks with higher internal coherence, which improves both retrieval precision and compression fidelity.

**Soft Tokens** — Fixed-length learned embedding vectors that represent compressed source content. Prepended to a prompt and processed by a model trained to condition on them. Not interpretable as natural language; require model fine-tuning to use.

**Task Performance** — In compression evaluation, the accuracy, F1, or other task-specific metric measured on a benchmark after compression, compared against an uncompressed baseline. The primary downstream signal for whether compression is safe to deploy.

**Token Budget** — An explicit constraint placed on the maximum number of tokens allocated to a prompt segment or full prompt. Used to drive compression decisions; when retrieved context exceeds budget, compression or selection is applied to fit within it.

---

## Appendix B: Tools & Resources

### Compression Libraries

**LLMLingua / LongLLMLingua** — Microsoft Research. Token-level prompt compression using a small proxy LLM to score token importance. LongLLMLingua adds coarse-to-fine compression for long-document contexts. Available on GitHub and via PyPI.

**Selective Context** — Lightweight library that filters redundant content from prompts by computing self-information scores with a small language model. Minimal dependencies, easy to integrate as a preprocessing step.

**RECOMP** — Research implementation of abstractive and extractive compressors trained specifically for RAG pipelines. Compressors are trained to produce summaries conditioned on a query.

### Retrieval & Reranking

**Cohere Rerank** — Hosted reranking API. Takes a query and a list of retrieved passages, returns relevance scores. Usable as a drop-in compression step at the retrieval boundary.

**FlashRank** — Lightweight, local reranking library using cross-encoder models. No API calls required; suitable for latency-sensitive or air-gapped deployments.

**LlamaIndex** — RAG orchestration framework with built-in support for chunk scoring, reranking, and node postprocessors. Compression logic can be inserted at multiple pipeline stages.

**LangChain** — General LLM orchestration framework with document compression abstractions, including `ContextualCompressionRetriever` for wrapping existing retrievers with filtering or summarization steps.

### Evaluation

**RAGAS** — Evaluation framework for RAG pipelines. Measures faithfulness, answer relevance, and context precision/recall. Useful for evaluating compression impact on retrieval-augmented tasks.

**TruLens** — Feedback function framework for LLM application evaluation. Supports custom metrics and integrates with multiple model providers for groundedness and relevance scoring.

**BERTScore** — Reference-based metric using contextual embeddings to measure semantic similarity between generated and reference text. Useful for evaluating abstractive compression fidelity without requiring exact lexical overlap.

### Infrastructure & Observability

**LangSmith** — LLM observability platform from LangChain. Traces token counts per pipeline step, making it straightforward to measure compression ratios in production and track regression.

**Helicone** — Proxy-based LLM observability tool. Logs requests and responses with token counts and latency; useful for establishing baseline cost and latency benchmarks before compression is applied.

**OpenTelemetry** — Vendor-neutral instrumentation standard. Appropriate for instrumenting compression latency and token reduction as spans within existing observability stacks.

---

## Appendix C: Further Reading

**"Lost in the Middle: How Language Models Use Long Contexts"** — Liu et al., 2023. Documents the empirical finding that models underweight information placed in the middle of long contexts. Motivates position-aware compression and ordering strategies.

**"LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models"** — Jiang et al., 2023. Introduces the perplexity-based token filtering approach that underlies the LLMLingua family. Core paper for understanding learned extractive compression.

**"RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation"** — Xu et al., 2023. Proposes training dedicated compressors for RAG contexts; compares abstractive and extractive approaches on downstream task performance.

**"Efficient Transformers: A Survey"** — Tay et al., 2022. Broad survey of architectural approaches to reducing attention complexity. Background for understanding why long-context inference is expensive and what the architecture-level alternatives to prompt compression look like.

**"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"** — Lewis et al., 2020. Original RAG paper. Establishes the baseline architecture that most production retrieval pipelines extend — and that compression is most commonly applied to.

**"Benchmarking Large Language Models in Retrieval-Augmented Generation"** — Chen et al., 2023. Evaluates LLM performance specifically on RAG tasks; useful for understanding which task types are most sensitive to context quality and, by extension, compression fidelity.

**"The Power of Noise: Redefining Retrieval for RAG Systems"** — Cuconasu et al., 2024. Counterintuitive finding that irrelevant retrieved context can sometimes improve generation. Informs how aggressively context should be filtered.

**"Compressing Context to Enhance Inference Efficiency of Large Language Models"** — Li et al., 2023. Surveys soft-token and encoder-based compression approaches; covers training setups for cross-model compatibility.

**"Improving Language Models by Retrieving from Trillions of Tokens"** — Borgeaud et al., 2022 (RETRO). Demonstrates retrieval-conditioned generation at scale; relevant context for understanding chunking and retrieval boundary design decisions.

**"Attention Is All You Need"** — Vaswani et al., 2017. The transformer paper. Worth rereading specifically for the attention complexity analysis — O(n²) in sequence length — which is the foundational reason context window costs scale the way they do.

**"Constitutional AI: Harmlessness from AI Feedback"** — Bai et al., 2022. Included not for its primary subject but for its use of model-generated critiques and revisions — a technique directly applicable to abstractive compression pipelines that use LLMs to rewrite source content.

**"Semantic Chunking for RAG"** — Greg Kamradt, 2023. Practical walkthrough of breakpoint-based semantic chunking using embedding similarity. Not a formal paper, but the implementation reference most practitioners actually use.

---

*Prompt Compression in Production — Version 1.0 — April 2026*
*By David Kelly Price | pyckle.co*

---

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*

---

## Related Blog Posts

- [Prompt Compression Is Going to Production](https://pyckle.co/blog/prompt-compression-is-going-to-production-the-benchmarks-still-arent-ready.html)
- [More Context Is Not Better Context](https://pyckle.co/blog/more-context-is-not-better-context.html)
- [68% of That LLM Bill Was Optional](https://pyckle.co/blog/sixty-eight-percent-of-that-llm-bill-was-optional.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*
