---
title: "Context Engineering for Developers"
subtitle: "How to Give AI the Right Information at the Right Time"
author: "David Kelly Price"
version: "1.0"
date: 2026-04-20
status: draft
type: ebook
target_audience: "Software developers using AI coding assistants daily — comfortable with LLMs, frustrated with context failures, ready to be intentional about what goes in the window"
estimated_pages: 75
chapters:
  - "The Context Problem"
  - "What Goes in the Window"
  - "Context Hierarchy: Not All Tokens Are Equal"
  - "Dynamic Context: Assembling the Right Pieces"
  - "Retrieval as Context Management"
  - "Compression Without Distortion"
  - "Session Memory and Persistence"
  - "Context for Multi-File Tasks"
  - "Measuring Context Quality"
tags:
  - pyckle
  - ebook
  - context-engineering
  - llm
  - prompt-engineering
  - developer-tools
  - context-window
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Figures:
- Captioned and numbered (Figure 1, Figure 2, etc.)
- Referenced by number in body text
-->

---

# Context Engineering for Developers

## How to Give AI the Right Information at the Right Time

**By David Kelly Price**

Version 1.0 — April 2026

---

## Table of Contents

- About This Guide
- Chapter 1: The Context Problem
- Chapter 2: What Goes in the Window
- Chapter 3: Context Hierarchy — Not All Tokens Are Equal
- Chapter 4: Dynamic Context — Assembling the Right Pieces
- Chapter 5: Retrieval as Context Management
- Chapter 6: Compression Without Distortion
- Chapter 7: Session Memory and Persistence
- Chapter 8: Context for Multi-File Tasks
- Chapter 9: Measuring Context Quality
- Conclusion
- Appendix A: Glossary
- Appendix B: Tools & Resources
- Appendix C: Further Reading

---

## About This Guide

This book is about a specific kind of failure — the kind where you have a capable model, a clear question, and you still get a useless answer. Not because the model is bad. Because it didn't have what it needed to be good.

That failure has a name: a context problem. And once you understand it at a mechanical level, you start seeing it everywhere. In the hallucinated function that would have been correct if the model had seen the right file. In the refactor that broke three dependencies the assistant didn't know about. In the multi-step task that lost the thread halfway through because nothing anchored the session.

Context engineering is the practice of intentionally managing what goes into an AI's context window so that outputs are accurate, relevant, and grounded in reality. It sits between prompt engineering — which focuses on how you phrase requests — and model architecture — which is someone else's problem. Context engineering is the layer most developers underestimate, and it's where a disproportionate share of real-world AI quality comes from.

This guide is for developers who are already using AI assistants daily. You're not a skeptic trying to be convinced. You've seen the good and the bad. What you may not have done yet is think systematically about why the good happens and why the bad happens — and what you can do to shift that ratio.

The chapters are structured to build on each other, but they're also designed to be read independently if you know the problem you're trying to solve. Chapter 1 frames the problem. Chapters 2 and 3 cover the fundamentals. Chapters 4 through 8 get progressively more technical and applied. Chapter 9 wraps it up with measurement — because you can't improve what you can't evaluate.

There's no preamble fluff here. Every section earns its place. If something seems obvious, it's there because the obvious parts are the ones people skip — and skipping them is usually where the problems start.

---

## Chapter 1: The Context Problem

The model isn't the bottleneck.

That's a hard thing to accept after years of framing AI progress as a model quality problem — bigger parameters, better benchmarks, longer context windows. But for most developers using AI assistants in production work, the model is not where things break down. A coding assistant backed by a frontier model and given poor context will perform worse than a smaller model given excellent context. Every time.

The context window is the totality of what a model can see when generating a response. It includes your prompt, any prior conversation, code snippets, documentation, tool outputs, system instructions, and anything else that gets passed to the model at inference time. The model can only reason about what's in that window. It doesn't have persistent memory between sessions. It doesn't have access to your filesystem unless you give it that access explicitly. It doesn't know about the PR you merged last Thursday or the architectural decision your team made in Confluence. It knows what it was trained on — which is frozen — and what you put in the window — which is your job.

That's the context problem in its simplest form: the gap between what the model needs to know to be useful and what you actually give it.

### Why This Gap Exists

The gap exists for two reasons that compound each other.

First, most developers don't think about context explicitly. You write a prompt, maybe paste in a function, and expect the model to figure it out. That works well enough for simple tasks — rename this variable, explain this algorithm, write a unit test for this function. But as tasks grow in complexity and scope, the implicit context requirements grow faster than the explicit context you're providing. You know your codebase. You know the intent behind that function. You know that the `UserService` module is deprecated and anything touching it should use `AuthAdapter` instead. The model doesn't know any of that unless you tell it.

Second, putting the right context in the window is not a trivial problem. Larger codebases have hundreds of files. Conversations have hours of history. Documentation is scattered. The model's context window has hard token limits. You can't dump everything in and call it done — you'll hit the limit, you'll degrade quality with noise, and you'll waste money on tokens that don't contribute to a better answer. You have to be selective. And being selective well requires understanding what the model actually needs, which is the skill this book is about.

### The Context Gap in Practice

Consider a concrete scenario. You're working on a Python service that handles payments. You open your AI assistant and ask it to write a function that charges a customer. The model generates something reasonable — it creates a Stripe charge, handles the response, raises an exception on failure. Clean code.

But your actual Stripe integration doesn't use the Stripe library directly. Three years ago, your team wrapped it in an internal abstraction layer called `PaymentGateway` because you wanted to be able to swap payment processors. Every payment operation goes through `PaymentGateway.charge()`, which handles retry logic, idempotency keys, and audit logging that compliance requires. The model didn't generate any of that. It couldn't — it had no idea `PaymentGateway` existed.

Now you have a choice: paste in the correct code, write it yourself, or provide the model with enough context to do it right in the first place. The third option is almost always the fastest if you set it up correctly. But it requires knowing what to provide.

The `PaymentGateway` class definition, the interface it exposes, maybe one example of an existing payment function — that's probably less than 200 lines of code and it completely changes what the model produces. Instead of generic Stripe boilerplate, you get something that fits your actual architecture. That's not the model getting smarter. That's context engineering.

### Context Failures Versus Model Failures

There's a practical reason to distinguish these two failure modes: they have different fixes.

A model failure is when the model, given complete and accurate context, still produces a wrong or poor output. These are relatively rare with frontier models on well-defined tasks. They do happen — models hallucinate, make reasoning errors, get confused by ambiguous specifications. But for everyday development work, genuine model failure is not the dominant failure mode.

A context failure is when the model produces a wrong or poor output because it was missing information, had the wrong information, or had too much noise obscuring the signal. This is the dominant failure mode. The fix isn't a better model. It's better context.

When you're debugging a bad AI output, the first question should always be: what did the model not know? Not: is the model capable of this? Capability is almost never the issue. Information almost always is.

**> Key Insight**
> A hallucination is often a context failure in disguise. When a model invents a function name or an API endpoint, it's usually filling a gap you left open. Give it the real thing and the hallucination disappears.

### The Intentionality Shift

Most developers relate to AI assistants the way they relate to a search engine — you type something, you get something back, you evaluate the result. It's reactive. You're not thinking about what the search engine needs; you're just asking and filtering.

Context engineering requires a different mental posture. You have to think about what the model needs before you ask. What files are relevant to this task? What conventions does my codebase follow? What does the model need to know about my constraints — time, tech stack, existing patterns — to produce something I can actually use?

This is a small cognitive shift with disproportionately large returns. You're not doing more work overall. You're front-loading a small amount of context preparation instead of back-loading a large amount of output correction.

The developers who get the most out of AI assistants are not the ones who are best at prompting in the rhetorical sense — constructing elaborate instructions or magic phrases. They're the ones who are best at selecting and organizing relevant information before they ask. That's the skill. Everything in this book is a specific application of it.

### What This Book Will Not Solve

It's worth being clear about scope.

Context engineering will not make a model competent at tasks it fundamentally cannot do. There are things that require reasoning capabilities beyond what current models reliably achieve — novel research, complex multi-step mathematical proofs, subtle judgment calls that require accumulated real-world experience. Better context won't fix these.

It also won't help you if your problem is a poorly defined task. If you don't know what you want, the model can't know either. Context engineering assumes you have a clear task. If you don't, start there.

And it won't eliminate the need for human review. AI-generated code should be read by a human before it's deployed. Context engineering makes the output better and more trustworthy. It doesn't make it infallible.

What it will do: dramatically reduce the failure rate on everyday development tasks, make multi-step and multi-file operations work reliably, and give you a framework for diagnosing and fixing context-related quality problems when they appear.

---

**Chapter 1 Key Takeaways**

1. The context window is the totality of what a model can see at inference time. It is the only input that matters.
2. The gap between what a model needs and what you give it is the primary source of AI quality failures in practice.
3. Model failures and context failures have different fixes. Identifying which you're dealing with is the first diagnostic step.
4. Context engineering requires intentional, pre-task information selection — not just better prompting.
5. Better context almost always outperforms the marginal benefit of a better model on the same inadequate context.

**Try This — Chapter 1**

Pick a task where your AI assistant recently produced a mediocre or wrong output. Write down everything you knew at the time of the prompt that was relevant to the task. Now look at what you actually gave the model. Measure the gap. That gap is your starting point.

---

## Chapter 2: What Goes in the Window

Before you can optimize context, you need an accurate model of what context actually is. Most developers think of it as "the prompt plus some code." That's incomplete in ways that matter.

The context window of a modern language model is a flat sequence of tokens. Everything the model receives — system instructions, conversation history, files you've attached, tool call outputs, your most recent message — gets serialized into that sequence and processed together. There is no privileged internal structure from the model's perspective. A system prompt is not "more important" at a mechanical level; it's just tokens that happen to appear earlier. (We'll get into why position matters in Chapter 3.) Understanding this flat structure is the foundation for understanding how to work with it effectively.

### The Six Categories of Context

Context in a typical AI coding session comes from six distinct sources. Each has different characteristics, different costs, and different value profiles.

**System prompts** are the instructions that frame the model's role, behavior, and constraints. They appear before any user interaction. In managed AI tools they're often invisible to you; in API-based systems you control them directly. A well-designed system prompt sets up the model to behave consistently across a session — it might specify coding style, preferred patterns, output format requirements, and behavioral constraints like "never modify files without explicit confirmation." A poorly designed system prompt wastes tokens on things the model already does well, or contradicts instructions you'll give later.

**Conversation history** is the accumulated exchange of messages in a session. Every prior turn — your messages and the model's responses — typically gets included in subsequent requests. This is what gives a session the feeling of continuity. It also means that conversation history is an ever-growing consumer of your context budget. Long sessions accumulate history that dilutes the signal-to-noise ratio as early exchanges become less relevant to the current task.

**Attached files and code** are the explicit context additions you make per request. This is where most context engineering decisions happen: which files to include, which functions to highlight, which examples to show. The quality of these decisions determines a large fraction of output quality.

**Tool outputs** are increasingly common as AI systems integrate with external tools — file systems, search engines, code execution environments, APIs. When a model calls a tool and gets a result, that result gets added to the context. Tool outputs can be extremely valuable (the model gets real, current data) or extremely noisy (a tool dumps 10,000 tokens of logs into the window when 200 were relevant).

**Retrieved content** is context assembled via search — semantic search over documentation, vector search over a codebase, or keyword search over a knowledge base. We'll spend an entire chapter on retrieval, but it belongs in this taxonomy: retrieved content is context you didn't explicitly attach, but that gets assembled programmatically based on relevance to the task.

**Metadata and state** is the softer category: current file position, cursor location, recent edits, branch name, environment variables. Many AI coding tools inject this automatically. When done well it's invisible and useful. When done poorly it's just token burn.

### Token Economics

Every word, character, and piece of code in your context costs tokens. Tokens are not a theoretical abstraction — they're the unit of measure for what goes in, and they have hard limits.

The relationship between text and tokens is approximately 1 token per 4 characters for English text, though code tends to be more token-dense because of symbols, indentation, and structure. A 1,000-line Python file might cost 4,000–8,000 tokens depending on how dense the code is. A verbose docstring-heavy module could cost more than a tightly written equivalent.

Modern frontier models have context windows ranging from 32,000 tokens to over 1,000,000 tokens. The larger numbers sound like they eliminate the problem. They don't, for three reasons.

First, larger windows are slower and more expensive. Processing a 200,000-token context costs dramatically more than processing a 20,000-token context, both in latency and API pricing. For interactive tools this matters immediately; for batch operations it matters at scale.

Second, model quality degrades with context length in ways that aren't fully linear. This is sometimes called the "lost in the middle" problem — information that appears in the middle of a very long context receives less reliable attention than information at the beginning or end. We'll cover this in depth in Chapter 3, but the implication is that stuffing more tokens in doesn't always produce better results.

Third, most real-world codebases are much larger than even the biggest context windows. A mature service might have 500,000 lines of code across thousands of files. Even with a million-token window, you can't fit the whole thing. Selection is always required.

**> Key Insight**
> Token limits force you to make decisions about what matters. That's not a bug. It's a useful constraint that makes you think about relevance before you waste time getting the model to think about everything.

### What Models Actually Do with Context

Understanding the mechanism helps you make better decisions about what to include.

Transformers — the architecture underlying most current language models — process context through self-attention. At a high level, every token attends to every other token in the context, and the model uses those attention relationships to build its understanding of what each token means in context. This is why context is not just reference material sitting in a database — it actively shapes how the model interprets and generates text.

This has practical implications. When you include a code example in context, the model doesn't just "look it up" — it processes the example in relation to everything else in the window. A function signature becomes more interpretable when the module it belongs to is also present. A bug description becomes more actionable when the actual buggy code is visible. Context isn't just data; it's the material the model reasons with.

It also means that irrelevant context isn't neutral. It dilutes the signal. If you include three pages of code that aren't related to the task, those tokens compete with the tokens that are relevant. The model's attention is not infinitely precise — background noise matters.

### Relevance vs. Presence

The single most useful distinction in context engineering is between relevant context and present context.

Present context is easy. It's whatever you've put in the window — whatever is technically there. Relevant context is what would actually help the model perform the task correctly.

A developer working without a context strategy tends to add present context. They paste in the file they have open, the error message they're looking at, maybe a recent conversation snippet. Some of this is relevant. Some is not. The ratio is whatever it happens to be.

Context engineering is the practice of increasing relevance density — making sure a higher fraction of what's in the window is actually useful for the task at hand.

This requires knowing the task clearly enough to know what information it requires. A refactoring task requires different context than a debugging task. Adding a new feature requires different context than optimizing an existing one. The context selection logic follows the task type. We'll build that out in Chapter 4.

**> Warning**
> Don't confuse file size with context value. A 2,000-line file that's mostly irrelevant to the current task is worse context than 50 lines of directly relevant code. Bigger is not better.

### Format Matters

How you present context to the model matters, not just what you present.

Models are trained on enormous amounts of structured text — documentation, code, API references, comments, README files. They're pattern-matching machines, and they respond well to patterns they've seen a lot. This means that well-formatted context is easier for a model to process than equivalent content presented in a disorganized way.

For code, this means providing complete function signatures rather than truncated snippets, using proper indentation, and including type annotations where they exist. A function with proper typing tells the model more per token than the same function without it.

For documentation, this means structure. Bullet points and headers create visual organization that the model has seen associated with reference content during training. A paragraph with the same information as a properly formatted API reference is harder to use effectively.

For error messages, this means the full trace. The "relevant part" of a stack trace is often not obvious from a human perspective. Include the full trace and let the model identify what's relevant.

For natural language instructions, this means specificity over length. "Refactor this function to use the Repository pattern" is better context than two paragraphs explaining what the Repository pattern is in general terms. Assume the model knows the pattern. Tell it what you want done.

### What to Leave Out

Knowing what to exclude is as important as knowing what to include.

**Redundant context** — information that the model already has from training or from earlier in the conversation — wastes tokens without adding value. You don't need to explain what a Python decorator is. You don't need to explain REST conventions. You don't need to summarize a function you included in full.

**Speculative context** — files or documentation you "might need" — is usually a mistake. Include what you know is relevant. If you're wrong, you'll find out from the output and can correct.

**Stale context** — outdated versions of files, superseded documentation, old conversation turns — can actively mislead. It's not neutral noise; it's noise that points in a wrong direction.

**Organizational context** — team structures, project history, business rationale for decisions — is rarely relevant to a technical task. It fills tokens without improving output.

The discipline is choosing deliberately. Not "what might be useful?" but "what does this specific task require?"

---

**Chapter 2 Key Takeaways**

1. The context window is a flat token sequence. System prompts, history, files, tool outputs, and retrieved content are all just tokens.
2. Token economics are real — larger windows aren't free, and quality degrades at extreme lengths.
3. Models reason with context, not just from it. Irrelevant content is active noise, not neutral background.
4. Relevance density — the fraction of context that's actually useful — is the primary metric to optimize.
5. Format matters. Well-structured context is more efficiently processed than equivalent unstructured content.

**Try This — Chapter 2**

For your next three AI coding sessions, keep a running log of what you put in the context window and why. After each session, mark each piece of context as "high relevance," "medium relevance," or "low relevance" based on whether it actually appeared to influence the output. Estimate the token cost of each category. Most developers find they're spending 30–60% of their budget on low-relevance content.

---

## Chapter 3: Context Hierarchy — Not All Tokens Are Equal

If all tokens were processed equally, context engineering would be simpler. You'd optimize for relevance, maximize token count within budget, and call it done. But tokens are not equal. Their position in the sequence, their relationship to adjacent tokens, and their structural role all affect how reliably the model attends to them.

Understanding these dynamics is what separates context that looks correct from context that actually works.

### The Attention Gradient

Transformer models process context through attention mechanisms that, in practice, don't distribute attention evenly across all tokens. Two positions in particular receive disproportionate attention: the beginning of the context and the end.

This is sometimes called the "primacy-recency effect" by analogy with human memory, but the mechanism is different. In humans, primacy and recency effects arise from how memory encoding and retrieval work. In transformers, the pattern emerges from how positional encoding and attention patterns interact during training — models see beginnings and ends more often as reference points, and this shapes how they process them.

The practical implication: information you put at the beginning of the context or immediately before the model's generation point gets more reliable attention than information buried in the middle.

This matters most in long contexts. In a 2,000-token exchange it's not significant. In a 100,000-token context it can be the difference between the model correctly applying a constraint you specified and completely ignoring it.

**> Key Insight**
> If a constraint, requirement, or critical piece of information must be reliably respected, don't bury it in the middle of a long context. Put it near the beginning (as a system prompt or prominent instruction) or repeat it near the end, immediately before the task description.

### The "Lost in the Middle" Problem

Research on long-context language models has consistently found that information placed in the middle of a very long context is retrieved less reliably than information at either end. This effect is pronounced enough to affect real-world outputs and has been replicated across multiple model families.

This has a counterintuitive consequence: increasing context window size doesn't proportionally increase the useful information a model can work with. A model given 50,000 tokens of context with the key information in the middle may perform worse than the same model given 10,000 tokens with the key information well-positioned.

The "lost in the middle" effect should change how you structure context assembly, particularly for long sessions or large codebases. Critical information — the function being modified, the error to be fixed, the interface being implemented — belongs at the boundaries, not in the middle of a pile of supporting material.

Figure 1 illustrates a typical attention distribution across a long context window. Attention is highest at the beginning and end, with a valley in the middle region.

```
Attention weight (schematic)

High |██ ·  ·  ·  ·  ·  ·  ·  ·  ·  · ██|
     |                                    |
Low  |    ·  ·  ·· ·· ·  ·  ·  ·         |
     +------------------------------------+
     Start                              End
     Position in context window
```

*Figure 1: Schematic attention distribution across a long context window. Critical information should be placed at boundaries, not the middle.*

### Structural Signals

Models respond to structural signals in context — headers, delimiters, code blocks, docstrings — because these patterns were associated with specific types of content during training. A function signature inside a code block signals "this is an API to use." A comment in the form of a docstring signals "this describes how the function works." A bold header in markdown signals "this is a section boundary."

Using these structural signals intentionally can improve how reliably the model applies context. When you're providing reference material the model should treat as an API specification, format it like one. When you're providing a constraint, put it in a prominent structural position — not buried in a paragraph.

Some developers develop conventions for marking critical context explicitly:

```python
# CONSTRAINT: This service only handles USD. Do not generate currency conversion logic.

# CONTEXT: PaymentGateway wraps Stripe. Use PaymentGateway.charge(), never stripe.charge() directly.

# CURRENT TASK: Add retry logic to the refund function
```

This is a bit heavy-handed for short, focused sessions, but in long-running tasks where the context grows complex, explicit markers help the model reliably find what it needs.

**> Warning**
> Repeating critical information too many times can backfire. If you echo the same constraint five times in hopes that the model will respect it, you're adding noise and signaling — to both yourself and the model — that you don't trust the system. Fix the structural problem instead.

### Signal vs. Noise

In information theory, signal-to-noise ratio describes how much useful information exists relative to background interference. Context has a signal-to-noise problem. Every token that isn't relevant to the task is noise that competes with tokens that are.

The noise problem is compounded by the attention gradient: noisy tokens in the right position (beginning or end) can actually do more damage than noisy tokens in the middle, because they receive disproportionate attention. A verbose, unfocused system prompt that rambles for 2,000 tokens before getting to the actual constraints is a signal-quality problem. The model is paying attention, but it's paying attention to noise.

High-signal context is dense with relevant information. Every token earns its place. Low-signal context is padded with related-but-not-relevant material that looks useful but isn't.

The practical test: if removing a piece of context wouldn't change what the model needs to know to do the task correctly, remove it.

### Recency Bias in Conversations

Conversation history creates a specific form of the hierarchy problem: later messages receive more attention than earlier ones, even when earlier messages contain more relevant information.

In a session where you set up the task at the beginning and then issued follow-up instructions over twenty turns, the model is more reliably attending to the last two or three exchanges than to the setup. If the setup established important constraints — patterns to follow, things to avoid, architectural decisions to respect — those constraints may degrade in influence as the conversation grows.

This is why long sessions tend to drift. It's not that the model forgot your earlier instructions in any meaningful sense; they're still in the context. It's that recency bias means the weight of those earlier instructions diminishes relative to whatever just happened.

There are two responses to this problem. One is to periodically restate critical constraints — not as repetition for its own sake, but as deliberate anchor re-insertion. The other is structural: don't rely on conversation history to hold critical information. Keep it in the system prompt or in a persistent reference block that gets re-included with each request.

### Priority Tiers

A practical framework for thinking about context hierarchy uses three tiers based on criticality and reliability requirements:

**Tier 1 — Must Reliably Reach the Model**: Constraints that cannot be violated, the primary task description, critical type signatures or interfaces, and security requirements. This information goes in structurally prominent positions — system prompt or immediately adjacent to the task.

**Tier 2 — Should Reliably Reach the Model**: Supporting code context, relevant examples, related functions, documented conventions. This gets included in the main context body, structured clearly, and positioned toward the boundaries if possible.

**Tier 3 — Nice to Have**: Background documentation, historical context, explanatory material. This is the first thing to cut when you're approaching token limits. It's also the content most likely to end up in the "lost in the middle" zone.

Most context assembly mistakes are Tier 3 content being included when there's not enough budget for Tier 1 and 2 content to breathe.

**> Try This**
> Take your standard context template — whatever you usually include when starting an AI coding session — and assign every element to Tier 1, 2, or 3. Then check the positions: is your Tier 1 content at the boundaries? Is your Tier 3 content being included at all? Adjust accordingly and measure the difference in output quality.

### Context Salience and Examples

One category of context that consistently punches above its weight: examples. A single worked example of "how we do X in this codebase" is often worth more than a paragraph describing X, because models learn from patterns far more efficiently than from descriptions.

If your codebase has a convention for writing repository classes, include one well-chosen existing repository class rather than explaining the convention in prose. The model will extract the pattern from the example and apply it — typically more accurately than if you tried to describe the pattern explicitly.

This is particularly true for:
- Naming conventions
- Error handling patterns
- Test structure and fixture conventions
- Comment and documentation style
- API response formats

The salience advantage of examples comes from their concreteness. They show exactly what the output should look like, in the actual syntax and style of your codebase. An explanation tells; an example demonstrates.

---

**Chapter 3 Key Takeaways**

1. Tokens at the beginning and end of context receive more reliable attention than tokens in the middle.
2. The "lost in the middle" problem means larger context windows don't proportionally increase effective information capacity.
3. Structural signals — headers, code blocks, markers — affect how models categorize and weight context.
4. Conversation history creates recency bias; early constraints and requirements can lose influence as sessions grow.
5. Concrete examples are among the highest-signal context you can provide — more reliable than prose descriptions of the same pattern.

**Try This — Chapter 3**

Take a task where you normally write a paragraph of instructions. Replace half the prose with a concrete example — a before/after code snippet, an existing function that implements the target pattern, or a worked analog. Compare the output quality. In most cases you'll get better results with fewer tokens.

---

## Chapter 4: Dynamic Context — Assembling the Right Pieces

Static context — a fixed system prompt and a manually curated set of files — works for simple, repeating tasks. For anything more complex, it falls apart. The task changes, the relevant files shift, the requirements evolve, and a static context that was accurate at 9am is misleading by 2pm.

Dynamic context assembly is the practice of building the context window programmatically, selecting and composing relevant information based on the current task at runtime. It's what separates developers who get consistently good results from developers who get inconsistent ones.

### Why Static Context Fails at Scale

Static context has two failure modes that compound each other.

The first is staleness. Code changes. The function you included in your standard context template gets refactored. The interface you described in your system prompt gets updated. If your context is manually curated and not automatically refreshed, it drifts from reality. Stale context is worse than no context — it's confidently wrong.

The second is irrelevance. A static context can't anticipate every task type. If you front-load a general-purpose coding context with your most commonly relevant files, those files will be irrelevant to a significant fraction of tasks. Every irrelevant token in your standard context is noise you're paying for across every request.

Dynamic assembly addresses both problems. Context is assembled fresh for each task, drawn from authoritative sources, and scoped to what the current task actually needs.

### The Task-Context Relationship

Before you can assemble context dynamically, you need a model of what different task types require. The context requirements vary substantially:

**Debugging a specific error**: Requires the error message and full stack trace, the function(s) in the call stack, any relevant configuration, and the test or condition that triggered it. Does not require broad codebase context.

**Adding a new feature**: Requires the interface where the feature will be added, existing similar features as examples, any relevant data models, and the patterns used by adjacent code. May require entry points or routing if the feature involves new API endpoints.

**Refactoring a function**: Requires the function itself, every call site, its dependencies, and the target pattern. A refactor that touches call sites without knowing about them is incomplete.

**Writing tests**: Requires the unit under test, the test framework in use, existing tests for similar units (as examples), any fixtures or test utilities, and mock/stub patterns used by the codebase.

**Code review or explanation**: Requires the code being reviewed, any interfaces it implements, the documented conventions it should follow, and context about what it replaced or extends if there's meaningful history.

None of these task types benefits from a generic "here's the whole project" context dump. Each has a specific profile. Building a context assembly strategy means mapping task types to their context profiles.

**> Key Insight**
> Context assembly should be driven by task type, not by habit. The question is always: what does this specific task require to be done correctly, and where does that information live?

### Rule-Based Assembly

The simplest form of dynamic assembly is rule-based: you define rules that map task characteristics to context components, and apply them mechanically.

A rule-based system might look like this:

```python
class ContextAssembler:
    def assemble(self, task: Task) -> Context:
        ctx = Context()

        # Always include: system constraints, current file
        ctx.add(self.system_constraints)
        ctx.add(self.current_file(task.target_file))

        # Task-specific additions
        if task.type == "debug":
            ctx.add(self.error_trace(task.error))
            ctx.add(self.call_stack_files(task.stack_trace))

        elif task.type == "feature":
            ctx.add(self.similar_features(task.description, limit=2))
            ctx.add(self.relevant_models(task.target_module))
            ctx.add(self.interface_definitions(task.target_file))

        elif task.type == "refactor":
            ctx.add(self.all_call_sites(task.target_function))
            ctx.add(self.dependency_graph(task.target_function))

        elif task.type == "test":
            ctx.add(self.existing_tests(task.target_file))
            ctx.add(self.test_fixtures(task.target_module))
            ctx.add(self.test_utilities())

        # Apply token budget
        return ctx.trim_to_budget(self.token_budget)
```

This is obviously simplified, but the structure is sound. Task type determines what gets included. A token budget enforces selectivity. The assembly logic is explicit and auditable.

Rule-based assembly works well when task types are predictable and the cost of implementing retrieval is not worth it. For many teams and projects, this is the right first step — systematic without being over-engineered.

### Retrieval-Based Assembly

For larger codebases where rules alone can't anticipate which specific files or functions are relevant, retrieval-based assembly extends the rule-based approach with search.

Instead of hardcoding which files to include for a "feature" task, a retrieval-based system searches the codebase for content relevant to the task description:

```python
async def assemble_feature_context(task: Task) -> Context:
    ctx = Context()
    ctx.add(system_constraints)
    ctx.add(current_file(task.target_file))

    # Retrieve similar code based on task description
    similar_code = await code_search(task.description, limit=5)
    ctx.add(similar_code)

    # Retrieve relevant interfaces
    interfaces = await interface_search(task.target_module)
    ctx.add(interfaces)

    # Retrieve relevant data models via import graph
    models = await import_neighbors(task.target_file)
    ctx.add(models.filter(type="model"))

    return ctx.trim_to_budget(token_budget)
```

Retrieval-based assembly is more powerful but has failure modes that rule-based systems don't: search can return irrelevant results, which are then included as noise. Chapter 5 covers retrieval in depth, including how to make search reliable enough to trust for context assembly.

### Priority and Budget Management

Any dynamic assembly system needs a mechanism for managing the token budget when the total desired context exceeds the available window.

A simple approach is priority-weighted trimming:

```python
class Context:
    def __init__(self):
        self.items: list[ContextItem] = []

    def add(self, item: ContextItem, priority: int = 2):
        item.priority = priority
        self.items.append(item)

    def trim_to_budget(self, budget_tokens: int) -> "Context":
        # Sort by priority descending, then add until budget exhausted
        sorted_items = sorted(self.items, key=lambda x: x.priority, reverse=True)

        result = Context()
        remaining = budget_tokens

        for item in sorted_items:
            if item.token_count <= remaining:
                result.items.append(item)
                remaining -= item.token_count
            elif item.priority == 3:  # Critical — truncate rather than skip
                truncated = item.truncate_to(remaining)
                result.items.append(truncated)
                remaining = 0
                break

        return result
```

Priority 3 is critical (system constraints, primary task), priority 2 is supporting context, priority 1 is optional background. When the budget is exceeded, low-priority items get dropped first. Critical items get truncated if necessary rather than omitted.

**> Warning**
> Be careful with truncation of code. A truncated function that shows only the signature and the first few lines might be worse than no function at all — it provides false confidence that the relevant code is present. Truncation works better for documentation and prose than for code.

### Context Invalidation

Dynamic assembly is only as good as the freshness of the data it draws from. If your context assembly system pulls from a search index, that index needs to stay current with the codebase. Stale indexes produce stale context, which produces wrong outputs.

Context invalidation — knowing when to refresh which components — is an often-overlooked part of dynamic context systems. The right approach depends on the size and velocity of the codebase:

- **Small, low-velocity codebases**: Re-index on every session start. The cost is negligible.
- **Medium codebases**: Re-index on git commits. Hook into your CI system or pre-commit hooks.
- **Large, high-velocity codebases**: Incremental indexing triggered by file changes. More complex, but necessary for accuracy.

Whatever the approach, the goal is that the context assembly system is pulling from an index that reflects the actual current state of the codebase. A context that's accurate about yesterday's code is wrong about today's.

### Composition Patterns

A few patterns for context composition that consistently work well in practice:

**Anchor + Radiate**: Start with the file or function that's directly involved in the task (the anchor), then expand outward to its direct dependencies, call sites, and related modules (the radiating layer). The anchor is always Tier 1; the radiating layer is Tier 2. This keeps the context focused while ensuring necessary surrounding structure is present.

**Example + Spec**: For tasks that require following existing patterns, include one high-quality example of the target pattern alongside the specification or interface. The example shows what the output should look like; the spec shows what it must satisfy.

**Error + Context**: For debugging tasks, always pair the error message with the code that produced it. An error message alone tells the model what went wrong but not where. The code context tells the model what could have caused it. Both together enable a diagnosis.

**History-Sensitive Assembly**: For tasks in ongoing sessions, notice when the most relevant context was already discussed and summarized earlier in the conversation. Rather than re-including the raw files, reference the existing summary. This avoids redundancy and preserves budget for new material.

---

**Chapter 4 Key Takeaways**

1. Static context degrades through staleness and irrelevance. Dynamic assembly solves both problems by selecting context at runtime.
2. Different task types require different context profiles. Map your common task types to their specific context requirements.
3. Rule-based assembly is the right starting point — systematic without requiring search infrastructure.
4. Retrieval-based assembly scales to larger codebases but requires reliable search to avoid introducing noise.
5. Token budget management should be priority-weighted: critical context is preserved under compression, optional context is dropped first.

**Try This — Chapter 4**

Write a context assembly function for your most common AI coding task type. Make it explicit: what files, what components, what supporting material does the task require? Implement it as a simple script that assembles and prints a context block. Use it for five sessions and observe whether the outputs improve.

---

## Chapter 5: Retrieval as Context Management

Retrieval is the mechanism that makes context assembly scale. Without it, you're limited to what you can manually identify and include. With it, the system can find relevant code and documentation that you didn't know to include — the function in a module you forgot about, the pattern established in a file you haven't touched in six months, the documented convention buried in a wiki page you've never opened.

Retrieval-augmented generation (RAG) is the term for architectures that include a retrieval step between a user query and a model response. In practice, for AI coding assistants, retrieval means searching your codebase and returning the most relevant chunks to include in context. The quality of that search is directly proportional to the quality of the assembled context.

### How Retrieval Works

At the core of retrieval-based context management is an index — a precomputed data structure that makes search fast and accurate.

For semantic search (finding code by meaning rather than exact keywords), the most common approach is a vector index:

1. **Chunking**: The codebase is split into pieces — functions, classes, methods, documentation blocks. Each chunk is a candidate for retrieval.
2. **Embedding**: Each chunk is converted to a dense vector representation using an embedding model. The embedding model maps the meaning of the text to a point in high-dimensional vector space.
3. **Storage**: Vectors are stored in a vector database (ChromaDB, Pinecone, Weaviate, pgvector, and others). The database supports fast approximate nearest-neighbor search.
4. **Query**: At request time, the query (typically the task description or a reformulation of it) is embedded into the same vector space, and the database returns the chunks whose embeddings are nearest — meaning most semantically similar — to the query.

The result is a list of code chunks ranked by semantic similarity to the query. The top results go into context.

This is powerful because it finds relevant code even when the exact words don't match. A query for "payment retry logic" might surface a function called `handle_charge_failure` that the developer never would have thought to search for manually.

### Keyword vs. Semantic Search

Semantic search is not always better than keyword search. They have different strengths and complement each other.

**Keyword search** (BM25, TF-IDF, exact match) is excellent at finding exact function names, class names, error codes, and specific strings. When you're looking for `PaymentGateway.charge()` calls across the codebase, keyword search finds them reliably. Semantic search may miss them if the embedding space doesn't closely associate the query with those exact tokens.

**Semantic search** is better at conceptual queries — "how does the authentication middleware work," "where is rate limiting implemented," "show me examples of the repository pattern." These queries describe meaning, not text, and semantic search finds meaning.

Hybrid search combines both signals, typically using Reciprocal Rank Fusion (RRF) to merge the ranked lists from keyword and semantic search into a single list. Hybrid search consistently outperforms either approach alone for code retrieval tasks.

```python
def hybrid_search(query: str, k: int = 5) -> list[CodeChunk]:
    # Parallel search
    semantic_results = vector_search(query, k=k*2)
    keyword_results = bm25_search(query, k=k*2)

    # Reciprocal Rank Fusion
    scores = defaultdict(float)
    constant = 60  # RRF constant, typically 60

    for rank, chunk in enumerate(semantic_results):
        scores[chunk.id] += 1.0 / (constant + rank + 1)

    for rank, chunk in enumerate(keyword_results):
        scores[chunk.id] += 1.0 / (constant + rank + 1)

    # Sort by combined score, return top k
    ranked = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [get_chunk(chunk_id) for chunk_id, _ in ranked[:k]]
```

**> Key Insight**
> For codebase search, use hybrid search. Pure semantic search misses exact name matches. Pure keyword search misses conceptual matches. Hybrid handles both.

### Chunking Strategy

How you split code before indexing substantially affects retrieval quality. Poor chunking is one of the most common reasons retrieval underperforms — and it's fixable.

**Bad chunking — fixed character counts**: Splitting every 500 characters produces chunks that cut in the middle of functions, separate type signatures from implementations, and destroy the semantic coherence that makes retrieval accurate. Fixed character chunking is fast to implement and consistently mediocre.

**Better chunking — AST-based splitting**: Parse the code with a language-specific parser and split at logical boundaries — functions, methods, classes. Each chunk is a complete semantic unit. This is more work to implement but dramatically improves retrieval precision.

```python
import ast

def chunk_python_file(file_path: str) -> list[CodeChunk]:
    source = Path(file_path).read_text()
    tree = ast.parse(source)
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            start_line = node.lineno - 1
            end_line = node.end_lineno
            chunk_source = "\n".join(source.split("\n")[start_line:end_line])

            chunks.append(CodeChunk(
                id=f"{file_path}::{node.name}",
                content=chunk_source,
                file_path=file_path,
                name=node.name,
                type="function" if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)) else "class",
                start_line=node.lineno,
                end_line=node.end_lineno
            ))

    return chunks
```

**Even better — context-augmented chunks**: Include a header with the file path, module name, and enclosing class name (for methods). Embeddings of chunks with this header encode more contextual information and retrieve more accurately.

```
# File: payments/gateway.py
# Module: payments.gateway
# Class: PaymentGateway

def charge(self, amount: int, customer_id: str, idempotency_key: str) -> ChargeResult:
    """Process a payment charge through the configured processor."""
    ...
```

### Re-Ranking

First-stage retrieval — the vector search — optimizes for speed. It finds candidates efficiently using approximate nearest-neighbor algorithms. But the candidates aren't always perfectly ranked. A cross-encoder re-ranker can improve the final ranking by doing a more expensive, more precise relevance comparison between the query and each candidate.

Cross-encoder re-ranking takes the top N results from first-stage retrieval (typically 20–50) and re-scores them using a model that processes the query and candidate together, rather than separately. The computational cost is manageable because you're only re-scoring a small candidate set.

In practice, re-ranking improves precision noticeably for technical queries and is worth implementing when context budget is tight — when you can only include 5 chunks and the difference between top-5 and correct-top-5 matters.

### When Retrieval Fails

Retrieval is not magic. It has specific failure modes:

**Query-document mismatch**: The query doesn't use the same terminology as the codebase. A query about "order processing" won't retrieve code that uses "transaction management" if the embeddings don't map these concepts together. Solution: query expansion — generate multiple reformulations of the query and search with all of them.

**Stale index**: The index doesn't reflect recent code changes. Solution: incremental re-indexing on file changes.

**Chunk boundary problems**: A critical function is split across two chunks, neither of which is complete enough to be useful. Solution: AST-based chunking plus overlap (include a small number of lines from adjacent chunks).

**Retrieval of irrelevant "similar-looking" code**: The search returns code that uses the same words but in a different context. Solution: metadata filtering — limit retrieval to specific directories, modules, or file types when the task is scoped.

**> Warning**
> Retrieved context that looks relevant but isn't is more dangerous than no context. The model will reason from it confidently. Always validate retrieval quality before trusting it in production context assembly pipelines.

### Practical Retrieval Infrastructure

For a team building retrieval-based context management, the minimum viable infrastructure is:

1. An embedding model (a hosted API or a local model like `nomic-embed-text` or `all-MiniLM-L6-v2`)
2. A vector store (ChromaDB for simplicity, pgvector for Postgres integration, or a hosted service)
3. An indexing pipeline (runs on code change or on demand)
4. A retrieval function that accepts a query and returns ranked chunks
5. A token counter to ensure results fit in budget before inclusion

This is a few hundred lines of code. It's not a large infrastructure investment for the quality improvement it provides.

---

**Chapter 5 Key Takeaways**

1. Retrieval makes context assembly scale by finding relevant code that wasn't manually identified.
2. Hybrid search — combining semantic and keyword signals — outperforms either approach alone for code retrieval.
3. Chunking strategy is a first-order determinant of retrieval quality. AST-based chunking outperforms fixed character counts significantly.
4. Re-ranking improves precision on a small candidate set without the cost of re-ranking the full index.
5. Retrieval has specific failure modes that must be understood and mitigated for production use.

**Try This — Chapter 5**

Set up a minimal semantic search over a codebase you're actively working on. Use ChromaDB and any embedding model accessible via API. Index the codebase at the function level. Then, for your next five coding tasks, run the task description through search and look at what it returns. Evaluate each result: is it actually relevant? This will calibrate your intuition for where retrieval helps and where it fails.

---

## Chapter 6: Compression Without Distortion

The context window has a fixed budget. The information relevant to a task often exceeds it. Compression is how you fit more into less space without losing what matters.

Compression in the context window sense means reducing the token count of a piece of content while preserving its usefulness — the information the model needs to produce a correct output. The challenge is that not all compression is equal. Some techniques preserve signal while reducing noise. Others reduce both signal and noise proportionally. And some — applied carelessly — reduce signal faster than noise, which is the worst outcome.

### Two Kinds of Compression

The first kind is **lossy compression**: you deliberately discard information, accepting that some content is lost. The bet is that what you discard was low-value and what you keep is high-value. Summarization is lossy. Truncation is lossy.

The second kind is **lossless compression**: you represent the same information in fewer tokens. Better formatting is sometimes lossless — removing redundant whitespace, condensing verbose docstrings, eliminating duplicate information. This is the better starting point when possible, because there's no risk of discarding something important.

In practice, you'll use both. The question is knowing when each is appropriate and how to execute it without distorting the signal.

### Summarization

Summarization is the most common form of lossy compression for context. You take a long piece of text — a document, a conversation history, a file — and replace it with a shorter version that captures the key points.

For conversation history, summarization is almost always the right move once a session grows past a certain length. A conversation that started with "help me build a payment module" and has gone through twenty turns of design discussion, implementation, and debugging can be summarized into a paragraph that preserves the decisions made and the current state without the exploratory back-and-forth that's no longer relevant.

```
Session Summary (Turns 1-22):
Implementing PaymentGateway wrapper around Stripe for the payments module.
Key decisions:
- Using PaymentGateway.charge() for all payment operations, never Stripe directly
- Idempotency keys are generated by the caller, not the gateway
- Retry logic is handled inside the gateway with exponential backoff, max 3 attempts
- Current state: charge() and refund() implemented and tested; webhook handler in progress
Open questions: How to handle partial refunds on multi-item orders
```

That summary might be 120 tokens. The conversation it replaced was 8,000. The model picks up the session with accurate context about what was built and what's still needed.

**> Key Insight**
> Summarize conversation history aggressively. Decisions and current state are high signal. The exploration that led to those decisions is usually low signal once the decision is made.

### Code Compression

Code compression is harder than text compression because code has specific semantics that prose does not. You can paraphrase a paragraph. You cannot paraphrase a function signature without potentially changing what it means.

That said, there are effective approaches to reducing code token count without losing relevant information:

**Interface-only representation**: For functions or classes that provide an API the model needs to call but that aren't the subject of modification, include only the signature and docstring, not the implementation.

```python
# Full (180 tokens):
def charge(self, amount: int, customer_id: str, idempotency_key: str) -> ChargeResult:
    """Process a payment charge."""
    logger.info(f"Charging {amount} for customer {customer_id}")
    try:
        response = self._stripe_client.payment_intents.create(
            amount=amount,
            currency="usd",
            customer=customer_id,
            idempotency_key=idempotency_key,
            metadata={"internal_id": str(uuid4())}
        )
        return ChargeResult(
            success=True,
            charge_id=response.id,
            amount=response.amount
        )
    except stripe.error.CardError as e:
        logger.error(f"Card declined: {e.user_message}")
        return ChargeResult(success=False, error=e.user_message)
    except stripe.error.StripeError as e:
        logger.error(f"Stripe error: {str(e)}")
        raise PaymentGatewayError(str(e)) from e

# Compressed (35 tokens):
def charge(self, amount: int, customer_id: str, idempotency_key: str) -> ChargeResult:
    """Process a payment charge. Raises PaymentGatewayError on Stripe errors."""
    ...
```

The compressed version tells the model everything it needs to know to call this function correctly. The implementation is irrelevant unless the model is being asked to debug or modify it.

**Selective inclusion of class methods**: For large classes, include only the methods relevant to the task, not the full class. Pair it with a comment listing the omitted methods so the model knows they exist.

```python
class PaymentGateway:
    # Also contains: refund(), webhook_verify(), get_customer(), update_customer()

    def charge(self, amount: int, customer_id: str, idempotency_key: str) -> ChargeResult:
        """..."""
        ...

    def get_charge(self, charge_id: str) -> ChargeResult:
        """..."""
        ...
```

**Comment pruning**: Remove comments that restate what the code does (if the code is readable), but preserve comments that explain why or document non-obvious behavior.

**> Warning**
> Don't compress code that's the subject of the task. If the model is supposed to modify a function, it needs the full function — truncating it creates risk of the model "completing" the truncated version incorrectly or missing the section that needs modification.

### Documentation Compression

Documentation — README files, wiki pages, API references — is often verbose. It's written for humans who need context, orientation, and examples. For model context, you typically need only the factual content: what a function does, what arguments it accepts, what it returns, what errors it can throw.

One technique: use the model itself to pre-compress documentation. Run the documentation through a summarization prompt once and store the compressed version. Use the compressed version in context instead of the original.

```
Compress the following API documentation to the minimum content necessary for a developer to correctly call the API. Preserve all parameter names, types, return types, error conditions, and behavioral constraints. Remove prose explanations, examples, and rationale.
```

This preprocessing step is done once, not at query time, so it doesn't add latency to requests. The result is a version of the documentation that's 30–50% smaller while retaining all the information relevant to calling the API correctly.

### Conversation History Compression Strategies

Conversation history grows monotonically. Every turn adds tokens. Without management, a long coding session eventually exceeds the context budget, and older turns start being truncated from the beginning — which means the model loses the task setup and early decisions.

Three strategies for managing conversation history:

**Rolling window**: Keep only the last N turns in context. Simple to implement, but loses early decisions and can cause the model to lose track of the original task in long sessions.

**Selective retention**: Keep the initial task description (always high value), any explicitly stated constraints, and the last M turns. Drop the exploratory middle. This preserves the anchors while keeping context fresh.

**Progressive summarization**: When conversation history exceeds a threshold, summarize the oldest segment and replace those turns with the summary. This preserves semantic content without the raw turn-by-turn history.

Progressive summarization is the most sophisticated and the most effective for long sessions:

```python
class ConversationManager:
    def __init__(self, max_tokens: int = 8000, summary_threshold: int = 6000):
        self.turns: list[Turn] = []
        self.summaries: list[str] = []
        self.max_tokens = max_tokens
        self.summary_threshold = summary_threshold

    def add_turn(self, role: str, content: str):
        self.turns.append(Turn(role=role, content=content))

        if self.current_token_count() > self.summary_threshold:
            self._compress_oldest_turns()

    def _compress_oldest_turns(self):
        # Take oldest 10 turns, summarize them
        to_compress = self.turns[:10]
        summary = summarize_turns(to_compress)
        self.summaries.append(summary)
        self.turns = self.turns[10:]

    def build_context(self) -> str:
        parts = []
        if self.summaries:
            parts.append("## Prior Session Summary\n" + "\n".join(self.summaries))
        for turn in self.turns:
            parts.append(f"**{turn.role}**: {turn.content}")
        return "\n\n".join(parts)
```

### What Not to Compress

Not everything should be compressed. Some content loses too much value in compression to be worth the token savings:

**Error messages and stack traces**: Compress these and you risk losing the specific error type, the exact line number, or the variable value that makes the trace actionable. Include them in full.

**Type signatures and interfaces**: These are already compressed. A well-written type signature conveys maximum information per token. Summarizing it further usually means losing precision.

**Security constraints and requirements**: Never summarize security requirements. If there's a constraint that certain data must never be logged, that must be present verbatim and prominently. Compression introduces ambiguity. Ambiguous security constraints get ignored.

**The primary task description**: The thing you're asking the model to do should be stated clearly and completely, not compressed. Compression serves supporting material, not the core request.

---

**Chapter 6 Key Takeaways**

1. Compression is necessary because relevant information exceeds context budget. The goal is reducing tokens while preserving signal.
2. Lossless compression (reformatting, deduplication) is always preferable to lossy compression (summarization, truncation).
3. Interface-only representation of code reduces tokens dramatically with minimal signal loss for APIs the model calls but doesn't modify.
4. Progressive summarization of conversation history is the best strategy for long sessions.
5. Don't compress error messages, type signatures, security requirements, or the primary task description.

**Try This — Chapter 6**

Take a session context that's approaching or exceeding your token budget. Identify the three largest components by token count. For each, determine: can it be losslessly compressed? If so, compress it. If not, is it high or low signal? Low-signal, high-token content is the first candidate for removal or summarization.

---

## Chapter 7: Session Memory and Persistence

Every time you start a new conversation with an AI assistant, you start from zero. The model has no memory of the previous session. The context decisions you don't make explicitly, the patterns that have been established, the architectural decisions from last week's session — none of it is automatically available. You are, in a very literal sense, reintroducing yourself every time.

This is not a theoretical limitation. It's a practical problem that shows up in how AI assistants perform over time on evolving codebases. Without a memory strategy, every session is its own island. The developer has to carry all the relevant history in their head and re-inject it manually. That's cognitive overhead, it's error-prone, and it means the early turns of every session are spent on setup rather than work.

Session memory and persistence is the practice of capturing, storing, and re-injecting information that would otherwise have to be established manually.

### What Needs to Be Remembered

Not everything deserves persistence. The value of a piece of session information depends on how durable it is and how expensive it is to re-establish.

High-persistence, high-reinstatement-cost information — the kind that benefits most from explicit memory:

- **Architectural decisions**: "We're using the Repository pattern throughout. All database access goes through repositories in `db/repositories/`."
- **Project-specific conventions**: "Error handling follows this pattern: catch at service layer, log, return a typed Result object. Don't let exceptions propagate to controllers."
- **Constraints and requirements**: "This service runs on Python 3.10. Don't use match statements or structural pattern matching."
- **Current task state**: "We're in the middle of implementing the webhook handler. ChargeCreated and RefundCreated events are done. ChargeRefunded is next."
- **Decisions made and why**: "We chose to store idempotency keys in Redis rather than the database because the database lookup was adding 40ms to every payment request."

Low-persistence or low-reinstatement-cost information — typically not worth explicit memory:

- General knowledge the model already has from training
- Content that's already in the codebase and easily retrievable
- Intermediate outputs that were transient

### Types of Memory Architecture

Memory architectures for AI systems generally fall into four categories, each with different characteristics:

**In-session context**: Information maintained within a single session through the conversation history. This is the default, requires no additional infrastructure, and has the obvious limitation that it doesn't persist across sessions.

**External short-term memory**: A file or database that persists session state across conversations. At the simplest, this is a text file that gets prepended to each session's system prompt. More sophisticated versions include structured state with timestamps and relevance scores.

**External long-term memory**: A persistent knowledge store that accumulates over time — architecture decisions, conventions, project context — and gets queried at session start to populate the initial context. This is retrieved, not fully included, which means it scales to large amounts of accumulated knowledge.

**Codebase as memory**: The codebase itself is the most reliable memory store. Comments, docstrings, README files, architecture decision records (ADRs) — these are memory that persists in the version control system and is always current. Treating the codebase as a memory resource means investing in keeping these artifacts up to date.

### Practical Session Persistence

For most developers, the right starting point is a simple session state file — a markdown file that gets updated at the end of each working session and loaded at the start of the next.

```markdown
# Payment Module Session State
Last updated: 2026-04-18

## Current Status
Implementing webhook handler. ChargeCreated and RefundCreated done.
Next: ChargeRefunded event type.

## Architectural Decisions
- All payment ops via PaymentGateway, never Stripe directly
- Idempotency keys: caller-generated, UUID format
- Retry: exponential backoff, max 3 attempts, in gateway layer
- Errors: typed Result objects at service layer, never propagate to controllers

## Active Constraints
- Python 3.10 — no structural pattern matching
- Payments are USD-only for now
- Stripe API version: 2024-04-10 (pinned)

## Files Being Worked On
- payments/webhook_handler.py (in progress)
- payments/gateway.py (complete)
- payments/models.py (complete)
- tests/test_webhook_handler.py (in progress)
```

This file is under 200 tokens. At session start, it gets loaded into the system prompt. The model has the context it needs to pick up where the previous session left off.

**> Try This**
> Create a session state file for your current active project. Write it as if you were briefing a colleague who was going to take over your work for an afternoon. Include: current status, key decisions and their rationale, active constraints, and what's in progress. Load this at the start of your next AI coding session and observe how much setup time you save.

### Memory as Code Comments

The most durable form of memory is embedded in the code itself. Comments, docstrings, and type annotations persist across all sessions, don't require any external infrastructure, and are always associated with the relevant code.

The practice of writing useful comments changes slightly in the context of AI-assisted development. A comment that explains why a decision was made — "Using Redis for idempotency keys; DB lookup was adding 40ms" — is not just useful to human readers. It's context the AI assistant will pick up every time that file is included. It anchors the model's understanding of the code's intent.

Architecture decision records (ADRs) are a formalization of this idea at the project level. An ADR documents a significant architectural decision, the context that prompted it, the options considered, and the decision made. Stored in the codebase (typically `docs/decisions/` or `architecture/`), they're retrievable by semantic search, version-controlled, and naturally structured for model consumption.

A well-maintained ADR directory is one of the highest-ROI investments a team can make for AI-assisted development. It creates persistent, structured, authoritative context about why the codebase is the way it is — the hardest kind of context to reconstruct on the fly.

### Memory Quality Decay

Memory isn't static. Codebase decisions that were accurate six months ago may have been superseded. Architectural patterns that were in use when a session state file was written may have been replaced. Stale memory is actively harmful — it's confidently wrong context.

Any memory system needs to account for decay. Strategies:

**Timestamp everything**: Every memory item should have a creation date and, ideally, an expiry or review date. Old memory items get flagged for review rather than automatically included.

**Ground memory in the codebase**: Memory claims that can be verified against the codebase should be. If the session state says "all database access goes through `db/repositories/`," that claim should be checkable. If it's no longer true, the memory is stale.

**Session-start memory validation**: At the start of each session, spend one turn having the model verify that the session state is consistent with the current state of the relevant files. This catches stale memory before it influences work.

```
At the start of this session: Review the session state below and verify it against the current codebase files provided. Flag any inconsistencies between the session state and the current code before we begin work.
```

**> Warning**
> Stale memory about security constraints or error handling is particularly dangerous. An AI that "remembers" an old error handling pattern and applies it instead of the current one can introduce bugs in security-sensitive code. Validate memory for high-stakes domains before relying on it.

### Multi-Developer Memory

When multiple developers are working on the same codebase with AI assistance, shared memory becomes a question of coordination. A session state file that one developer maintains won't reflect the context of another developer's sessions. Architectural decisions made in one session should be available to all sessions.

The best solutions here leverage existing coordination mechanisms:

- **ADRs in version control**: Shared, versioned, authoritative.
- **Team conventions in CLAUDE.md or equivalent**: A project-level instruction file that all sessions load automatically.
- **Codebase conventions in comments and docstrings**: Present to all sessions that read the relevant files.

The pattern to avoid: a single developer maintaining a personal session state file and not systematizing what they've learned into the codebase or shared project documents. That developer's AI sessions get progressively better while their teammates' sessions remain at baseline — and the knowledge is lost when the session state isn't maintained.

---

**Chapter 7 Key Takeaways**

1. Models have no persistent memory between sessions. Re-injecting relevant context manually is cognitively expensive and error-prone.
2. High-value memory: architectural decisions, project-specific conventions, active constraints, current task state.
3. A session state file under 200 tokens, loaded at session start, is a practical and immediately effective memory strategy.
4. The codebase is the most durable memory store. Comments, docstrings, and ADRs persist across all sessions and tools.
5. Memory decays. Timestamp and validate memory items, especially for high-stakes domains.

**Try This — Chapter 7**

At the end of your next working session, spend five minutes writing a session state note in the format above. At the start of the following session, load it as the first thing in context. Track whether you spend less time on setup and whether the early outputs are more accurately anchored to your actual codebase state.

---

## Chapter 8: Context for Multi-File Tasks

Single-file tasks are the easy case. You paste in a function, ask a question, get an answer. The context is bounded and obvious. But real development work is almost always multi-file. Refactoring a service touches the service, its interfaces, its tests, and its callers. Adding a feature involves models, repositories, service logic, API endpoints, and their corresponding tests. Debugging a subtle bug requires tracing through multiple files to understand how a bad state propagated.

Multi-file tasks are where context engineering gets hard — and where the payoff for getting it right is largest. A model given the wrong subset of files in a multi-file task doesn't produce a mediocre answer. It produces a confidently wrong answer, one that looks correct in isolation but breaks the system when deployed.

### The Multi-File Context Problem

The fundamental challenge of multi-file context is that you need to include enough files to give the model a complete picture, without including so many files that you exceed the context budget or overwhelm the relevant signal with noise.

This requires understanding the dependency relationships between files. You can't select the right files without knowing which files are connected to the task at hand.

For a modification to `payments/gateway.py`:
- Direct dependencies: `payments/models.py`, `payments/exceptions.py` (imported by gateway)
- Direct dependents: `payments/service.py`, `api/endpoints/payments.py` (import from gateway)
- Indirect dependents: anything that calls the payment service (potentially many files)
- Tests: `tests/test_gateway.py`, possibly `tests/integration/test_payments.py`

The right context for this modification includes the file being changed, its direct dependencies (because the model needs to know the interfaces it works with), and its direct dependents (because the model needs to know how the modified code will be used). The indirect dependents are usually too many to include but should be checked for breaking changes after the modification.

### Import Graph Analysis

The most reliable way to determine file relationships is to analyze the import graph. An import graph is a directed graph where each node is a file and each edge represents an import relationship: A → B means A imports from B.

```python
import ast
from pathlib import Path
from collections import defaultdict

def build_import_graph(codebase_path: str) -> dict:
    graph = defaultdict(set)          # file -> set of files it imports
    reverse_graph = defaultdict(set)  # file -> set of files that import it

    for py_file in Path(codebase_path).rglob("*.py"):
        rel_path = str(py_file.relative_to(codebase_path))
        try:
            tree = ast.parse(py_file.read_text())
        except SyntaxError:
            continue

        for node in ast.walk(tree):
            if isinstance(node, (ast.Import, ast.ImportFrom)):
                if isinstance(node, ast.ImportFrom) and node.module:
                    module_path = node.module.replace(".", "/") + ".py"
                    full_path = Path(codebase_path) / module_path
                    if full_path.exists():
                        graph[rel_path].add(module_path)
                        reverse_graph[module_path].add(rel_path)

    return graph, reverse_graph

def get_context_files(target_file: str, graph: dict, reverse_graph: dict) -> dict:
    return {
        "target": [target_file],
        "dependencies": list(graph.get(target_file, [])),       # files target imports
        "dependents": list(reverse_graph.get(target_file, [])), # files that import target
    }
```

This analysis tells you exactly which files are directly connected to the file under modification. For context assembly, start with this set — it's almost always the right first tier.

**> Key Insight**
> The import graph is ground truth for file relationships. Guessing which files are related is unreliable. Deriving the set from the actual imports is precise and automatable.

### Blast Radius Analysis

For larger tasks — refactoring a widely-used interface, changing a shared data model, modifying error handling across a layer — you need to understand the blast radius: how many files will be affected by this change, and which ones?

Blast radius analysis extends import graph analysis to multiple hops:

```python
def blast_radius(target_file: str, reverse_graph: dict, max_depth: int = 3) -> dict:
    """Returns files affected at each hop distance from target."""
    affected = {0: {target_file}}
    visited = {target_file}

    for depth in range(1, max_depth + 1):
        current_level = set()
        for file in affected[depth - 1]:
            for dependent in reverse_graph.get(file, []):
                if dependent not in visited:
                    current_level.add(dependent)
                    visited.add(dependent)
        affected[depth] = current_level

    return affected

# Example output:
# {
#   0: {"payments/gateway.py"},
#   1: {"payments/service.py", "api/endpoints/payments.py"},
#   2: {"api/middleware/auth.py", "background/tasks.py"},
#   3: {"main.py", "scripts/process_pending.py"}
# }
```

Blast radius analysis serves two purposes in context engineering:

1. It identifies files you must include in context for a complete picture of the change's impact.
2. It provides information for the model to reason about consequences — "these 8 files depend on the function you're modifying; here's how they use it."

For modifications with large blast radii (dozens of affected files), you can't include all of them in context. Prioritize: include the direct dependents (depth 1) in full, provide a list of deeper dependents so the model is aware they exist, and structure the task to address them systematically rather than all at once.

### Incremental Context Building

For truly large multi-file tasks, the right approach is not a single massive context but an incremental strategy: break the task into sub-tasks, each with appropriate context for that sub-task, and chain the results.

Consider a refactor that involves changing a data model and updating all the code that uses it. A monolithic approach tries to fit everything in one context: the model, the service, the repository, the API endpoints, the tests, and the migration. That's often too much.

An incremental approach:

1. **Sub-task 1**: Update the data model. Context: model file, its migrations, any serialization code. Output: updated model.
2. **Sub-task 2**: Update the repository. Context: repository file, the updated model (from sub-task 1), existing tests. Output: updated repository.
3. **Sub-task 3**: Update the service. Context: service file, the updated model and repository interfaces (from sub-tasks 1 and 2). Output: updated service.
4. **Sub-task 4**: Update the API endpoints. Context: endpoint files, the updated service interface (from sub-task 3). Output: updated endpoints.
5. **Sub-task 5**: Update tests. Context: all updated interfaces (from sub-tasks 1–4), existing test files. Output: updated tests.

Each sub-task has a focused, appropriate context. The output of each sub-task becomes context for subsequent sub-tasks. The total work is the same, but each model call has high-quality, scoped context.

**> Try This**
> Take a multi-file task you have coming up. Before starting, draw the import relationships between the files involved. Identify the primary target, its direct dependencies, and its direct dependents. Assemble context with all of these, explicitly labeled. Run the task and compare output quality to your typical approach.

### Test Context

Tests deserve special attention in multi-file context because they occupy a dual role: they're both documentation (showing how the code should be used) and constraints (defining what "correct" means for this code).

When assembling context for a modification task, including the existing tests for the modified code is almost always high value:

- They show the model what behavior must be preserved
- They show existing usage patterns
- They constrain the modifications to not break existing expectations

When assembling context for a testing task, including existing tests for similar units is high value:

- They show the test patterns used in the codebase
- They show fixture and mock conventions
- They calibrate the expected level of coverage

```python
def assemble_modification_context(target_file: str) -> Context:
    ctx = Context()
    ctx.add(system_constraints, priority=3)
    ctx.add(read_file(target_file), priority=3)

    graph, reverse_graph = build_import_graph(codebase_path)

    # Direct dependencies
    for dep in graph.get(target_file, []):
        ctx.add(read_file(dep), priority=2)

    # Direct dependents (as reference, lower priority)
    for dep in list(reverse_graph.get(target_file, []))[:3]:
        ctx.add(read_file(dep), priority=1)

    # Existing tests — high value for understanding expected behavior
    test_file = find_test_file(target_file)
    if test_file:
        ctx.add(read_file(test_file), priority=2)

    return ctx.trim_to_budget(token_budget)
```

### Cross-Cutting Concerns

Some multi-file tasks involve cross-cutting concerns — logging, authentication, error handling, observability — that affect code scattered across many files without clear import relationships. Updating error handling patterns, for instance, might touch 30 files that have no import relationship with each other.

For these tasks, retrieval is essential. A semantic search for "error handling" or "exception handling" across the codebase surfaces the affected code regardless of file structure. This is exactly the use case where keyword-based file selection fails and semantic retrieval wins.

The pattern: use retrieval to identify affected files, then group them by module or layer, then address each group as a sub-task with appropriate context for that group.

**> Warning**
> Cross-cutting changes are the highest-risk multi-file tasks for context failures. The affected files have no obvious structural connection, which means manual file selection almost always misses something. Use retrieval, not intuition, to find the full scope.

---

**Chapter 8 Key Takeaways**

1. Multi-file tasks require understanding dependency relationships before assembling context.
2. Import graph analysis provides ground truth about which files are directly connected to the target file.
3. Blast radius analysis identifies the full scope of impact for a change — essential for understanding consequences.
4. Incremental context building (chained sub-tasks) scales to larger refactors where monolithic context isn't feasible.
5. Test files are high-value context for modification tasks — they define expected behavior and constrain the acceptable modifications.

**Try This — Chapter 8**

Build a simple import graph for your primary codebase. For your most-used core module, run blast radius analysis to depth 3. Look at the results. Does the blast radius match your intuition? Are there dependents you didn't know about? This exercise often reveals structural coupling that wasn't visible before.

---

## Chapter 9: Measuring Context Quality

Everything covered in the previous eight chapters is only useful if you can tell whether it's working. Context quality is not a subjective impression — it's measurable. And if you're making deliberate decisions about what goes in the context window, you should be measuring the effect of those decisions.

Most developers don't do this. They assess AI output quality impressionistically: "that seemed better," "this session felt more productive." Impressionistic assessment is better than nothing but it doesn't scale and it doesn't tell you which decisions are driving the improvement.

This chapter covers how to measure context quality systematically — the metrics, the methods, and the feedback loops that let you improve context strategy over time.

### What Good Context Looks Like

Before measuring, it helps to have a clear description of what you're trying to achieve.

Good context is:

**Relevant**: A high fraction of the tokens in the window are actually useful for the task. Low-relevance context wastes budget and dilutes signal.

**Complete**: Everything the model needs to produce a correct output is present. Missing context leads to hallucinated completions or outputs that don't fit the actual system.

**Current**: The context accurately reflects the current state of the codebase and requirements. Stale context leads to outputs that would have been correct last month.

**Well-structured**: Content is formatted clearly, positioned appropriately (per Chapter 3), and labeled so the model can reliably interpret what each component is.

**Appropriately compressed**: Long content is represented at the appropriate level of detail — full where detail matters, summarized or interface-only where it doesn't.

These five properties correspond to five things that can go wrong, and each has a measurable indicator.

### Relevance Measurement

Relevance measures how much of the context actually influenced the output.

The conceptual measure is token efficiency: what fraction of tokens in the context contributed to the correct generation? In practice you can't measure this directly, but you can approximate it.

**Post-hoc relevance evaluation**: After a session, review the context you provided and rate each component as high, medium, or low relevance based on whether it appeared to influence the output. Track this across sessions and look for patterns — are specific types of content consistently low relevance?

**A/B context testing**: For repeating task types, run the same task with different context compositions and compare output quality. Remove one component from context and observe whether output quality changes. If removing a component doesn't change quality, it was low relevance.

**Token efficiency metric**: Track total tokens in context versus output quality score. If you can achieve the same output quality with 30% fewer tokens, you've identified 30% noise.

```python
class ContextQualityTracker:
    def __init__(self):
        self.sessions: list[SessionRecord] = []

    def record_session(
        self,
        context_components: list[ContextComponent],
        output_quality_score: float,  # 1-5 human rating
        task_type: str
    ):
        self.sessions.append(SessionRecord(
            total_tokens=sum(c.token_count for c in context_components),
            components=context_components,
            quality_score=output_quality_score,
            task_type=task_type,
            timestamp=datetime.now()
        ))

    def component_quality_correlation(self, component_type: str) -> float:
        """Returns correlation between including this component type and output quality."""
        with_component = [s.quality_score for s in self.sessions
                         if any(c.type == component_type for c in s.components)]
        without_component = [s.quality_score for s in self.sessions
                            if not any(c.type == component_type for c in s.components)]

        if not with_component or not without_component:
            return None

        return mean(with_component) - mean(without_component)
```

### Completeness Measurement

Completeness is harder to measure because you can't always know what the model needed that it didn't have. But there are indicators:

**Hallucination rate**: When a model invents a function, class, or API that doesn't exist, it's filling a gap you left open. Track the rate at which model outputs include invented code elements. A high hallucination rate is a completeness signal.

**Correction rate**: How often do you have to correct an output because it assumed something incorrect about your system? Each correction represents a piece of context that should have been present.

**First-pass acceptance rate**: What fraction of model outputs can be used with no or minimal modification? This is the broadest quality signal, combining relevance, completeness, and accuracy.

For debugging tasks specifically, resolution rate is informative: what fraction of debugging sessions with AI assistance reach a correct diagnosis on the first attempt versus requiring multiple rounds?

**> Key Insight**
> Tracking corrections is the highest-signal measurement available without specialized tooling. Every time you correct a model output, that correction represents a context gap. After a month of tracking corrections, you'll have a clear picture of what your context is consistently missing.

### Output Quality Metrics by Task Type

Different task types have different natural quality metrics:

| Task Type | Primary Quality Metric | Measurement Method |
|---|---|---|
| Code generation | Runs without modification | Automated linting + runtime check |
| Debugging | Correct root cause identified | Human verification |
| Refactoring | Passes existing tests | Test suite run |
| Documentation | Accurately describes code | Human review |
| Test generation | Tests catch real bugs | Mutation testing |
| Code review | Issues found match real issues | Post-deploy comparison |

Where automated measurement is possible, use it. Automated signals at scale are more reliable than human ratings, which are subject to fatigue, mood, and inconsistency.

For code generation tasks, a CI pipeline that runs the model output through type checking, linting, and a quick test suite is achievable and produces an objective signal:

```bash
#!/bin/bash
# evaluate_generation.sh

OUTPUT_FILE=$1

# Type check
mypy $OUTPUT_FILE --ignore-missing-imports > /dev/null 2>&1
TYPE_OK=$?

# Lint
ruff check $OUTPUT_FILE > /dev/null 2>&1
LINT_OK=$?

# Run tests if test file provided
if [ -f $2 ]; then
    pytest $2 -q > /dev/null 2>&1
    TEST_OK=$?
else
    TEST_OK=0  # No tests to run
fi

# Score: 0, 1, 2, or 3
SCORE=$((3 - TYPE_OK - LINT_OK - TEST_OK))
echo $SCORE
```

This gives you a reproducible quality score per generation that you can correlate with context configuration.

### Retrieval Quality Metrics

For systems using retrieval-based context assembly, retrieval quality is a separate measurement concern:

**Precision at K**: Of the top K results returned by retrieval, what fraction are actually relevant to the task? This requires human labeling but can be sampled.

**Recall**: Of all the actually relevant chunks in the codebase, what fraction does retrieval surface in the top K? This is harder to measure because you need ground truth about all relevant chunks.

**Mean Reciprocal Rank (MRR)**: For queries where there's a known correct result, what rank does retrieval assign to it? MRR averages the reciprocal rank across a test set.

For practical purposes, precision@5 measured on a sample of 50 tasks is sufficient to detect gross retrieval quality problems and track improvement over time.

### Building a Feedback Loop

Measurement is only valuable if it feeds back into decisions. The feedback loop for context quality looks like this:

1. **Define task types** for the work you do regularly.
2. **Instrument your sessions** to track context composition and output quality.
3. **Identify the lowest-quality task types** — where is output quality worst?
4. **Diagnose the context gaps** for those task types — what's missing, stale, or poorly structured?
5. **Make a specific context change** — add a new context component, remove a low-relevance one, change assembly rules.
6. **Measure the effect** over the next N sessions.
7. **Iterate**.

This is not a heavy process. It can be as simple as a spreadsheet where you rate each session 1–5, note the task type, and note what context you used. After 20–30 sessions you have enough data to see patterns.

**> Warning**
> Don't mistake output volume for output quality. A model that produces more code is not producing better code. Measure quality — does it work, does it fit the codebase, does it require correction — not quantity.

### Context Quality in Team Settings

When multiple developers are using AI assistants on the same codebase, context quality becomes a shared concern. One developer's improvement to the context strategy benefits everyone if it's systematized. One developer's poor context habits affect only them.

Practical approaches for team-level context quality:

**Shared context templates**: Standardize the context components used for common task types. Document them in the project. New team members start with a context strategy, not at baseline.

**Review AI-assisted code as you would any code**: The code review process catches context failures by catching their outputs. An AI-generated function that doesn't follow project conventions indicates a context gap. Note it and fix the context, not just the code.

**Track AI-related regressions**: When a bug is introduced by AI-generated code, investigate the context that produced it. Was the relevant constraint present? Was the related code visible? This is the most reliable signal about systematic context gaps.

**ADR maintenance as team habit**: Architectural decision records, kept current, are shared memory for all AI sessions across the team. A team that maintains good ADRs gives every AI session better context with zero per-session effort.

### The Meta-Metric: Time to Correct Output

All the specific metrics above serve a single meta-metric: how long does it take to get from a task description to a correct, production-ready output?

Context engineering should reduce this time. If your context strategy is working, you should see:
- Fewer rounds of correction needed
- Less time spent fixing model outputs to fit your actual codebase
- More first-pass acceptance of generated code
- Faster completion of complex multi-file tasks

If you're spending the same amount of time editing model outputs as before, your context strategy isn't working — regardless of what the individual metrics say. The meta-metric is the honest measure.

---

**Chapter 9 Key Takeaways**

1. Context quality is measurable. Impressionistic assessment is insufficient for systematic improvement.
2. Track hallucination rate, correction rate, and first-pass acceptance rate as leading indicators of context quality.
3. Automated quality signals (type checking, linting, test pass rate) are more reliable than human ratings at scale.
4. Retrieval quality requires its own measurement: precision@K and MRR on sampled queries.
5. The meta-metric is time-to-correct-output. If that doesn't improve, the strategy isn't working.

**Try This — Chapter 9**

Start tracking your corrections. For the next two weeks, every time you significantly modify a model output to fix a mistake, note: what was wrong, and what context would have prevented it? After two weeks, look for the most common gap. That's your next context engineering improvement.

---

## Conclusion

Context engineering is not a new discipline given a fancy name. It's the operationalization of a simple insight: what an AI produces is a function of what you give it. Control what you give it, and you control the output.

The chapters in this book cover the mechanics — token economics, attention gradients, retrieval systems, compression techniques, session persistence, dependency graphs, measurement frameworks. But the mechanics serve a single practical goal: closing the gap between what the model needs and what you provide.

That gap is the source of most AI quality failures in real development work. Not model limitations. Not prompt phrasing. The gap between the information required for a correct output and the information actually present in the window when the model generates.

Every technique covered here is a way to close some part of that gap:

- Dynamic assembly closes the gap between static, stale context and the specific, current context a task requires.
- Retrieval closes the gap between what you know to include and what's actually relevant.
- Compression closes the gap between the information that should be present and the token budget you have to fit it in.
- Session persistence closes the gap between a model's zero-memory state and the accumulated knowledge from prior sessions.
- Dependency analysis closes the gap between single-file thinking and multi-file reality.
- Measurement closes the gap between guessing whether it's working and knowing.

None of these are complete solutions in isolation. A retrieval system with poor chunking strategy fails. A compression strategy that truncates code produces wrong outputs. A measurement framework that tracks the wrong metric optimizes for the wrong thing. The pieces work together.

Where to start depends on where the gap is largest for your specific work. If you're building on a small codebase with well-defined task types, start with dynamic assembly. If your codebase is large and you're spending time manually hunting for relevant files, start with retrieval. If your sessions keep losing the thread, start with session persistence. If you're doing frequent multi-file refactors, start with import graph analysis. If you don't know where the gap is largest, start with measurement.

The developers who get the most out of AI coding assistants are not the ones who use the most powerful models or write the cleverest prompts. They're the ones who are most deliberate about what goes in the window. That deliberateness is learnable. This book was the starting point.

---

## Appendix A: Glossary

**Attention mechanism**: The component of transformer models that allows each token to consider all other tokens in the context. Attention weight distributions determine which tokens most influence each generated token.

**BM25**: A keyword-based ranking function used in information retrieval. In hybrid search, BM25 scores supplement semantic similarity scores.

**Blast radius**: The set of files affected when a target file is modified, determined by tracing the import graph to dependent files at increasing hop distances.

**Chunking**: The process of splitting documents or code into smaller pieces for indexing and retrieval. Chunk boundaries significantly affect retrieval quality.

**Context engineering**: The practice of intentionally managing what goes into an AI's context window to optimize output quality, relevance, and accuracy.

**Context window**: The fixed-size input sequence that a language model processes at inference time. All context — prompts, history, files, tool outputs — must fit within this window.

**Cross-encoder re-ranking**: A re-ranking approach that scores query-document pairs by processing them together, producing more precise relevance scores than bi-encoder (embedding) approaches.

**Dynamic context assembly**: The practice of programmatically selecting and composing context components at request time, based on the current task, rather than using a static predefined context.

**Embedding**: A dense vector representation of text in a high-dimensional space. Embeddings map semantic meaning to vector geometry, enabling semantic similarity search.

**Hybrid search**: A retrieval approach combining keyword-based search (BM25) and semantic search (vector similarity), typically merged using Reciprocal Rank Fusion.

**Import graph**: A directed graph representing import relationships between files in a codebase. Used for dependency analysis and blast radius calculation.

**Lost in the middle**: The empirically observed phenomenon where language models attend less reliably to information positioned in the middle of a long context, compared to the beginning and end.

**Lossy compression**: Context compression that discards some information, accepting that the compressed version captures only the most relevant content.

**Lossless compression**: Context compression that represents the same information in fewer tokens without discarding content, through reformatting or deduplication.

**Mean Reciprocal Rank (MRR)**: A retrieval quality metric measuring, on average, how highly the correct result ranks in search results. Computed as the mean of the reciprocal ranks across a test set.

**Precision@K**: The fraction of the top K retrieval results that are actually relevant. A measure of retrieval precision.

**Progressive summarization**: A conversation history management strategy where the oldest conversation segments are periodically summarized and replaced with the summary, preserving semantic content while reducing token count.

**RAG (Retrieval-Augmented Generation)**: An architecture where relevant documents or code chunks are retrieved from an index and included in context before model generation.

**Reciprocal Rank Fusion (RRF)**: A technique for merging multiple ranked lists into a single list by combining the reciprocal ranks of each item across the input lists.

**Relevance density**: The fraction of tokens in a context window that are actually useful for the current task. Optimizing relevance density is a central goal of context engineering.

**Session state**: Information about the current state of a working session — decisions made, constraints established, current task progress — that can be persisted and re-loaded to provide continuity across sessions.

**Signal-to-noise ratio**: In context engineering, the ratio of relevant to irrelevant tokens in the context window. Higher signal-to-noise produces better model outputs.

**System prompt**: Instructions provided to a language model before the user interaction begins. Typically used to establish role, constraints, and behavioral guidelines.

**Token**: The basic unit of text processed by language models. Approximately 4 characters for English text; code is typically more token-dense.

**Token budget**: The maximum number of tokens allocated for context assembly. Context that would exceed the budget must be prioritized and trimmed.

**Vector database**: A database optimized for storing and searching embedding vectors. Examples: ChromaDB, Pinecone, Weaviate, pgvector.

---

## Appendix B: Tools & Resources

### Embedding Models

**Hosted APIs**
- OpenAI text-embedding-3-small / text-embedding-3-large — Strong general-purpose embeddings, well-suited for code
- Anthropic — No dedicated embedding API; use OpenAI or open-source for embeddings in Claude-based systems
- Cohere Embed — Good multilingual support, competitive on code

**Open-Source / Self-Hosted**
- `nomic-embed-text` — Strong performance, Apache 2.0 license, runs locally via Ollama
- `all-MiniLM-L6-v2` — Small, fast, good quality-per-compute-cost ratio
- `bge-large-en-v1.5` — Strong performance, popular in RAG applications
- `CodeBERT` / `GraphCodeBERT` — Trained specifically on code; useful for pure code retrieval tasks

### Vector Databases

**Embedded / Local**
- **ChromaDB** — Easiest to get started with; runs in-process, no server required; good for development and small-to-medium codebases
- **LanceDB** — Embedded, supports multimodal, good performance

**Server-Based / Self-Hosted**
- **Weaviate** — Full-featured, supports hybrid search natively, Docker-deployable
- **Qdrant** — Fast, resource-efficient, supports filtering; good for production deployments
- **Milvus** — High-scale, Kubernetes-friendly; more operational overhead

**Managed Services**
- **Pinecone** — Managed, scalable; good for teams that don't want to operate infrastructure
- **pgvector** — Postgres extension; ideal if you're already on Postgres; no additional infrastructure

### Code Parsing

- **tree-sitter** — Fast, incremental parser with support for most languages; the standard tool for language-aware chunking
- **libcst** — Concrete syntax tree library for Python; preserves formatting and comments; useful for Python-specific tools
- **ast** (Python stdlib) — Built-in Python AST module; simpler than tree-sitter for Python-only projects

### Search Frameworks

- **LlamaIndex** — Framework for RAG applications; handles chunking, indexing, retrieval, and query pipelines; extensive integrations
- **LangChain** — Broader AI application framework with RAG components; more opinionated than LlamaIndex
- **Haystack** — Production-oriented RAG and search framework; strong evaluation tooling
- **txtai** — Lightweight, embeddings-focused; good for smaller projects

### Token Counting

- **tiktoken** (OpenAI) — Fast BPE tokenizer for OpenAI models; also useful as an approximation for other models
- **anthropic.count_tokens()** — Built into Anthropic SDK; accurate for Claude models
- **transformers.AutoTokenizer** — Hugging Face library; accurate for any tokenizer you can load

### Context Management Libraries

- **ContextCite** — Research tool for attributing model outputs to specific context components; useful for understanding what context actually influenced a response
- **LangChain ConversationSummaryMemory** — Automatic summarization of conversation history; easy to integrate into LangChain-based systems
- **Memento** — Lightweight conversation memory with compression and persistence

### Development Tools

- **Pyckle** — Semantic code search and session context management for AI-assisted development; designed specifically for the context engineering use cases in this book
- **Continue** — Open-source AI coding assistant with configurable context; supports custom context providers
- **Aider** — Command-line AI coding assistant with strong multi-file context handling; uses repo-map for dependency awareness

---

## Appendix C: Further Reading

### Papers

**On Retrieval-Augmented Generation**

- Lewis et al. (2020). *Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.* The paper that defined RAG as an architecture. Foundational reading for understanding why retrieval-in-context works.

- Gao et al. (2023). *Retrieval-Augmented Generation for Large Language Models: A Survey.* Comprehensive survey covering RAG variants, retrieval strategies, and evaluation methods through 2023.

**On Context Length and Attention**

- Liu et al. (2023). *Lost in the Middle: How Language Models Use Long Contexts.* The empirical study behind the "lost in the middle" phenomenon. Essential reading for understanding position effects in long context windows.

- Shi et al. (2023). *Large Language Models Can Be Easily Distracted by Irrelevant Context.* Demonstrates that irrelevant context degrades performance even when sufficient relevant context is present.

**On Code Retrieval**

- Feng et al. (2020). *CodeBERT: A Pre-Trained Model for Programming and Natural Languages.* Foundational paper on code embeddings; useful for understanding why code retrieval requires different considerations than text retrieval.

- Guo et al. (2021). *GraphCodeBERT: Pre-training Code Representations with Data Flow.* Extends CodeBERT with data flow information; relevant for understanding structure-aware code retrieval.

**On Chunking and Indexing**

- LlamaIndex blog series on chunking strategies — practical, empirically grounded, regularly updated. Covers fixed-size, recursive, semantic, and document-structure-aware chunking with quality comparisons.

**On Memory in Language Models**

- Park et al. (2023). *Generative Agents: Interactive Simulacra of Human Behavior.* While focused on agent simulation, the memory architecture described — retrieval-based episodic memory with relevance scoring — is directly applicable to developer tool memory systems.

### Books and Long-Form Resources

- *Designing Data-Intensive Applications* (Kleppmann, 2017) — Not about AI, but essential background for understanding the data infrastructure underlying retrieval systems: indexing, search, consistency, and scale.

- *Information Retrieval: Implementing and Evaluating Search Engines* (Büttcher, Clarke, Cormack, 2010) — Deep treatment of search and retrieval theory; useful for understanding the algorithms behind the tools.

### Online Resources

- The LlamaIndex documentation and blog — Among the most practical resources on RAG implementation, chunking strategy, evaluation, and advanced retrieval patterns.

- Eugene Yan's blog (eugeneyan.com) — ML practitioner writing on applied retrieval, recommendation systems, and production AI systems. Consistently useful.

- The Anthropic documentation on prompt engineering and context window management — First-party guidance on how Claude specifically uses context; worth reading alongside model-agnostic resources.

- Papers With Code (paperswithcode.com) — For tracking state-of-the-art on embedding models, retrieval benchmarks, and context management research.

---

*Context Engineering for Developers* — Version 1.0 — April 2026

By David Kelly Price


---

## Related Blog Posts

- [The 1M Context Window Trap](https://pyckle.co/blog/the-1m-context-window-trap.html)
- [More Context Is Not Better Context](https://pyckle.co/blog/more-context-is-not-better-context.html)
- [Long Context Windows Don't Replace Retrieval — They Replace Excuses](https://pyckle.co/blog/long-context-windows-dont-replace-retrieval-they-replace-excuses.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*