---
title: "Memory Systems for AI Developer Tools"
subtitle: "Persistent Context, Knowledge Graphs, and Long-Term Recall for Coding Assistants"
author: "David Kelly Price"
version: "1.0"
date: 2026-04-20
status: draft
type: ebook
target_audience: "Architects and senior engineers building or evaluating AI developer tools — interested in how tools can accumulate and reuse knowledge across sessions"
estimated_pages: 75
chapters:
  - "The Problem with Stateless AI"
  - "Types of Memory: Episodic, Semantic, Procedural"
  - "Session Memory: Within-Conversation Recall"
  - "Cross-Session Persistence"
  - "Code as Memory: What the Codebase Knows"
  - "Knowledge Graphs for Code Understanding"
  - "Retrieval-Augmented Memory"
  - "Memory Decay and Refresh"
  - "Evaluation: Does the Tool Actually Remember?"
tags:
  - pyckle
  - ebook
  - memory-systems
  - ai-tools
  - context
  - knowledge-graph
  - persistence
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Figures:
- Captioned and numbered (Figure 1, Figure 2, etc.)
- Referenced by number in body text
-->

---

# Memory Systems for AI Developer Tools

## Persistent Context, Knowledge Graphs, and Long-Term Recall for Coding Assistants

**By David Kelly Price**

Version 1.0 — April 2026

---

## Table of Contents

- About This Guide
- Chapter 1: The Problem with Stateless AI
- Chapter 2: Types of Memory: Episodic, Semantic, Procedural
- Chapter 3: Session Memory: Within-Conversation Recall
- Chapter 4: Cross-Session Persistence
- Chapter 5: Code as Memory: What the Codebase Knows
- Chapter 6: Knowledge Graphs for Code Understanding
- Chapter 7: Retrieval-Augmented Memory
- Chapter 8: Memory Decay and Refresh
- Chapter 9: Evaluation: Does the Tool Actually Remember?
- Conclusion
- Appendix A: Glossary
- Appendix B: Tools & Resources
- Appendix C: Further Reading

---

## About This Guide

This book is about a gap that doesn't get talked about enough: AI coding assistants that can't remember anything between sessions, and what it actually takes to fix that.

The gap matters more than people realize. When a developer asks a coding assistant about a bug they discussed yesterday, and the assistant has no idea what yesterday was — that's not a minor inconvenience. That's a fundamentally broken product experience dressed up in impressive demo clothing. The tool looks smart until the moment you need it to actually know something, and then it doesn't.

Memory systems for AI developer tools are an engineering problem, not a research problem. The underlying cognitive science is well-established. Retrieval-augmented generation has been a known technique for years. Vector databases are commodity infrastructure. What's missing, almost universally, is the architectural thinking that connects these pieces into something that genuinely serves a developer across sessions, projects, and months of accumulated context.

That's what this book covers.

The audience is architects and senior engineers who are either building AI developer tools or evaluating them for adoption. This is not a beginner's guide to language models. There's no section explaining what embeddings are from first principles. The assumption is that you've worked with these systems and you're trying to make them actually good — not just functional in a demo.

Each chapter covers one layer of the memory problem: what kind of memory is needed, how to build it, what breaks in practice, and how to know if it's working. The chapters are meant to be read in order — the later chapters build on concepts from the earlier ones — but each is self-contained enough to work as a standalone reference once you've read through once.

The code examples are real. The trade-offs are real. The warnings about what doesn't work are things that have been tried and found wanting, not things that sound plausible from a whiteboard.

---

## Chapter 1: The Problem with Stateless AI

Every time a developer starts a new conversation with an AI coding assistant, they're starting over. The assistant knows nothing about the project, nothing about the decisions made last week, nothing about the bug that took three days to track down. It knows everything the model was trained on and nothing the user has told it before.

That's statelessness. And for a tool that's supposed to help professional engineers do complex work over extended time horizons, statelessness is a fundamental architectural defect.

### Why Statelessness Feels Fine at First

The initial experience with AI coding assistants is usually impressive. You paste in a function and ask what it does. You describe a bug and ask for a fix. You ask for a code review. These interactions work well because they're self-contained — all the information needed to answer the question is present in the question itself.

But professional engineering work is not a series of self-contained questions. It's a continuous project with accumulated decisions, established patterns, known constraints, and shared context built up over time. The moment a developer starts relying on an AI assistant for real work — not demos, not experiments, but daily engineering — they start running into the edges of statelessness.

They paste in context that they've pasted in before. They re-explain architectural decisions the tool should already know. They ask about a module the tool helped them build three weeks ago and get back generic advice with no awareness of how that module actually works. The assistant is smart but has no memory of having been smart before.

**Key Insight:** Statelessness is not a limitation of the underlying model. It's an architectural choice — or more accurately, an architectural omission. The model's intelligence is preserved across sessions. Only the context is lost.

### The Tax on Developer Productivity

Stateless tools impose a hidden tax. Every interaction requires context reconstruction: explaining the project, explaining the conventions, explaining what's been tried. That tax is small per interaction and enormous in aggregate.

Consider a developer working with an AI assistant for eight hours a day. Each morning, they need to re-establish context. Each time they switch topics or start a new conversation thread, they re-explain. If they're generous and say this costs ten minutes per context switch, and they switch contexts six times a day, that's an hour — twelve and a half percent of the working day — spent telling the tool things it should already know.

The subtler cost is cognitive. The developer must maintain a mental model of what the AI knows and what it doesn't, and manage the gap. That's overhead imposed on the person who should be the tool's primary beneficiary.

There's also a quality cost. When a tool lacks context, it gives generic advice. Generic advice isn't wrong — it's just not tailored to the specific problem, which means it's less useful and often actively misleading in context-dependent situations. An assistant that doesn't know your codebase uses patterns that don't match your conventions. One that doesn't know your architectural decisions suggests changes that contradict them.

### The Context Window Is Not Memory

A common response to the statelessness problem is: just put everything in the context window. Modern models have context windows of 128k, 200k, even a million tokens. Can't you just stuff the whole codebase in there?

You can't, for several reasons.

First, context windows are expensive. Token cost scales linearly, and for large codebases, the raw material cost of including everything on every request becomes prohibitive. A 500,000-token codebase at current API pricing is not something you include on every developer keystroke.

Second, context windows degrade. Research consistently shows that models perform worse on retrieval tasks when relevant information is buried in the middle of a very long context. This is the "lost in the middle" problem — the model's attention is not uniformly distributed across the context. Stuffing the context full of code does not guarantee the model will find and use the right code.

Third, context windows don't persist. At the end of the conversation, the context is gone. The next conversation starts empty. The context window solves the within-session problem but does nothing for the cross-session problem, which is where most of the real cost lives.

**Warning:** Don't confuse a large context window with a memory system. They solve related but distinct problems. A context window is working memory — fast, temporary, expensive. A memory system is long-term storage — slower to retrieve, persistent, and almost always necessary for tools that are used seriously.

### What Stateful Behavior Actually Requires

Building a stateful AI developer tool requires solving three distinct problems.

The first is capture: identifying what's worth remembering. Not every interaction deserves to be preserved. Decisions, patterns, architectural constraints, and hard-won bug fixes deserve to be remembered. Routine questions about syntax don't. A memory system that captures everything will be overwhelmed with noise. A memory system that captures nothing defeats its own purpose.

The second is organization: structuring what's been captured so it can be found. Raw storage is not memory. A pile of conversation transcripts is not a useful memory system. The captured information needs to be organized in a way that supports retrieval — by topic, by recency, by relevance to the current task.

The third is retrieval: finding the right memory at the right moment without requiring the developer to remember what they told the tool and when. Retrieval must be fast enough to not interrupt the flow of work and accurate enough to return useful results rather than tangentially related ones.

These three problems compound on each other. A system that captures too much makes organization hard. Poor organization makes retrieval unreliable. Unreliable retrieval makes the whole system feel broken even if the underlying data is there.

### The Stakes for Tool Builders

For teams building AI developer tools, statelessness is a competitive liability. The tools that win over time will be the ones that genuinely accumulate useful knowledge about how their users work. The tools that stay stateless will remain impressive in demos and frustrating in daily use.

Users tolerate statelessness when they don't have a better option. Once they've used a tool that remembers — that knows their conventions, their project structure, their architectural constraints — they find stateless tools intolerable.

This is what makes memory systems strategically important, not just technically interesting. They're the mechanism by which a good tool becomes irreplaceable.

---

**Key Takeaways**

1. Statelessness is an architectural choice, not a model limitation. The intelligence persists; only the context is discarded.
2. The cost of statelessness is hidden but real: time spent reconstructing context, degraded advice quality, and cognitive overhead.
3. Context windows are working memory, not long-term storage. They don't solve the cross-session problem.
4. A memory system requires solving three distinct problems: capture, organization, and retrieval.
5. For tool builders, memory is a competitive differentiator — the mechanism by which a tool becomes genuinely irreplaceable.

**Try This:** For one week, keep a log every time you re-explain something to an AI assistant that you've explained before. Count the instances and estimate the time cost. This is the problem you're solving.

---

## Chapter 2: Types of Memory: Episodic, Semantic, Procedural

Cognitive science has spent decades developing a taxonomy of memory. That taxonomy was built to describe human cognition, but it maps surprisingly well onto the kinds of memory AI developer tools need. Understanding the distinction between episodic, semantic, and procedural memory isn't just conceptually interesting — it drives practical decisions about what to store, how to store it, and how to retrieve it.

### Episodic Memory: What Happened

Episodic memory is memory of events. Specific things that happened, at a specific time, in a specific context. In human cognition, this is autobiographical memory — the recollection of experiences rather than abstract facts.

For an AI developer tool, episodic memory is the record of interactions. The conversation where the developer and the assistant worked through a tricky database migration. The session where they decided to use event sourcing for the order system. The exchange where the developer explained why the legacy authentication module was untouchable for the next six months.

Episodic memory is the raw material of experience. It's temporally organized — things happened in a certain order, and that order sometimes matters. It's also highly specific — the value is in the particulars, not just the general principle.

The challenge with episodic memory is scale. Interactions accumulate fast. A developer using an AI assistant seriously might generate hundreds of meaningful interactions per month. If episodic memory stores every interaction in raw form, retrieval becomes needle-in-haystack at best and incoherent at worst.

Good episodic memory systems for AI tools don't store transcripts verbatim. They extract and compress: what was decided, what was learned, what was an important constraint or discovery. The raw transcript is the evidence; the episodic memory is the meaningful summary.

**Key Insight:** The goal of episodic memory is not a complete journal of every interaction. It's a record of the interactions that changed what the tool knows about the user, the project, or the problem domain.

### Semantic Memory: What Is True

Semantic memory is memory of facts and concepts, abstracted away from the specific experience in which they were learned. You know that Python is dynamically typed not because you remember the specific conversation where you learned it, but because it's become a stable fact in your world model.

For AI developer tools, semantic memory is knowledge about the project, the codebase, and the domain. The fact that this system uses PostgreSQL. That the authentication service is the source of truth for user identity. That the data pipeline runs nightly at 2 AM and has a hard timeout at 30 minutes. That the team uses a specific branching strategy and the main branch is protected.

Semantic memory is more durable than episodic memory. A fact about the database schema is unlikely to change next week. Once established, it should be available without needing to re-derive it from context every time.

The relationship between episodic and semantic memory is important: semantic memory is often derived from episodic memory. The fact that the system uses PostgreSQL was learned in a specific conversation (an episodic memory), but once it's known, it doesn't need to be anchored to that conversation anymore — it becomes a stable fact (a semantic memory).

Implementing semantic memory well requires a mechanism for promoting episodic observations into stable factual knowledge. This is a non-trivial design problem. Not every observation from an episodic memory deserves to become a semantic fact. Some things are transient. Some are uncertain. The promotion process needs to handle these cases.

### Procedural Memory: How To Do Things

Procedural memory is knowledge of how to do things — skills and patterns rather than facts or experiences. In human cognition, this includes motor skills, habits, and learned workflows. You don't think about the mechanics of typing; you just type.

For AI developer tools, procedural memory is patterns and conventions. This team writes tests before code. Error handling in this codebase always includes structured logging with a specific format. Database queries go through a query builder, never raw SQL. API responses always wrap data in a specific envelope structure.

Procedural memory is qualitatively different from the other types. It's not about what happened or what is true — it's about how things are done. And for a coding assistant, this is extraordinarily important. A tool that doesn't know how things are done in a specific codebase will consistently suggest code that doesn't fit, triggering corrections that could have been avoided.

Extracting procedural memory from code is a different problem than extracting it from conversations. Code embeds patterns implicitly. The conventions are in the code, but they're not labeled as conventions — you have to recognize them as patterns by observing them repeatedly.

**Warning:** Don't conflate procedural memory with style preferences. Procedural memory is load-bearing: it affects whether suggested code actually works in context. Style preferences are cosmetic. Treating them with equal weight wastes storage and retrieval bandwidth on low-value information.

### How the Types Work Together

In practice, these three types of memory interact continuously. A developer reports a bug in the authentication service. The tool's episodic memory includes a session from six months ago where the authentication service was redesigned. The semantic memory includes the fact that authentication tokens expire after 24 hours and that the token refresh logic has a known edge case with timezone-aware datetimes. The procedural memory includes the pattern this team uses for writing authentication middleware.

With all three types available, the tool can give genuinely useful help. Without them, it gives generic advice about authentication bugs that might be technically correct and completely useless in context.

The three types also have different decay rates. Episodic memories become less relevant over time as projects evolve. Semantic facts can become stale when the codebase changes — the fact that the system uses PostgreSQL stops being accurate the day the team migrates to a different database. Procedural patterns are the most durable, though they too can change when teams make deliberate decisions to shift conventions.

This is why memory systems can't be static storage. They need to handle the dynamics of each type differently, and they need mechanisms for updating and invalidating stale memories. That problem gets its own chapter later.

### Mapping to Implementation

The practical implication of this taxonomy is that a memory system for an AI developer tool probably needs at least three distinct storage and retrieval mechanisms, not one unified store.

Episodic memory wants something like a log or journal with temporal ordering — structured records of interactions with timestamps and enough context to make them searchable. Vector similarity search works well here because retrieval is usually "what have we discussed that's related to this current topic."

Semantic memory wants something closer to a knowledge base — structured facts with clear provenance and explicit schema. A graph structure works well because facts often relate to each other (the authentication service depends on the user database, which uses PostgreSQL). Key-value storage works for simpler cases.

Procedural memory wants something like a pattern library — examples of how things are done in this codebase, indexed by context. These patterns are often best extracted from the codebase itself rather than from conversations, which makes code understanding tools an important part of the system.

The temptation is to use one storage mechanism for everything because it simplifies the architecture. Resist it. The access patterns and update dynamics for these three types are different enough that forcing them into a single store creates friction that shows up as poor retrieval quality.

```python
# Example: Distinguishing memory types at capture time
class MemoryClassifier:
    def classify(self, content: str, context: dict) -> MemoryType:
        """
        Route captured content to the appropriate memory store.
        Returns MemoryType.EPISODIC, SEMANTIC, or PROCEDURAL.
        """
        if self._is_decision_or_event(content, context):
            return MemoryType.EPISODIC
        elif self._is_stable_fact(content, context):
            return MemoryType.SEMANTIC
        elif self._is_pattern_or_convention(content, context):
            return MemoryType.PROCEDURAL
        else:
            return MemoryType.EPISODIC  # Default: store as episodic, promote later

    def _is_stable_fact(self, content: str, context: dict) -> bool:
        # Indicators: declarative sentences, present tense, architectural claims
        # "The system uses X", "Authentication is handled by Y", "The schema has Z"
        stable_fact_patterns = [
            r'\bthe system\b.*\buses\b',
            r'\bis (handled|managed|owned) by\b',
            r'\bthe schema\b',
            r'\bthe database\b.*\b(has|contains|stores)\b',
        ]
        return any(re.search(p, content, re.IGNORECASE)
                   for p in stable_fact_patterns)

    def _is_pattern_or_convention(self, content: str, context: dict) -> bool:
        # Indicators: always/never patterns, "we do X", team conventions
        convention_patterns = [
            r'\bwe always\b',
            r'\bwe never\b',
            r'\bthe convention\b',
            r'\bour pattern\b',
            r'\bstandard (way|approach|practice)\b',
        ]
        return any(re.search(p, content, re.IGNORECASE)
                   for p in convention_patterns)
```

This classification isn't perfect — it doesn't need to be. The goal is good-enough routing with a fallback to episodic (the most flexible store) when classification is uncertain.

### A Note on Prospective Memory

Human cognitive science includes a fourth type sometimes called prospective memory — memory of intended future actions. "Remember to check the cache invalidation logic when you touch the payment service." This maps to something like a task or reminder system for AI tools.

It's worth acknowledging but not overcomplicating. For most AI developer tools, prospective memory is better handled by explicit task tracking systems than by a general-purpose memory architecture. The blur happens when a developer tells the assistant something that's part instruction and part future constraint: "We're planning to migrate off Redis next quarter, so don't suggest anything that adds new Redis dependencies." That's both a semantic fact and a future constraint. Store it as semantic, flag it with a future-state marker, and let the retrieval system surface it when relevant.

---

**Key Takeaways**

1. Episodic memory records events and interactions. It's temporally ordered, specific, and degrades in relevance over time.
2. Semantic memory records stable facts about the project and domain. It's durable, queryable, and must be invalidated when facts change.
3. Procedural memory records patterns and conventions. It's the most stable type and most directly affects code suggestion quality.
4. The three types have different storage and retrieval requirements — don't force them into a single store.
5. Classification at capture time routes memories to the right store and enables different decay and retrieval strategies per type.

**Try This:** Take ten recent AI coding interactions and manually classify each as primarily episodic, semantic, or procedural. Notice which type dominates. That's the type your memory system most urgently needs.

---

## Chapter 3: Session Memory: Within-Conversation Recall

Session memory is the closest thing to solved in the memory problem space. Within a single conversation, a language model has access to everything that's been said — that's the context window. But "solved" overstates it. Session memory has real constraints that require active management, and the decisions made about within-session memory directly affect how much useful information makes it into cross-session storage.

### The Context Window as Working Memory

The context window is working memory for language models. It's fast — everything in it is immediately available to the model without retrieval. It's temporary — it disappears when the session ends. It's bounded — there's a hard limit, and pushing past it either truncates or degrades.

These properties map exactly to working memory in cognitive science. Human working memory holds seven plus or minus two chunks of information, processes them quickly, and loses them if not transferred to long-term storage. The context window holds N tokens, processes them without retrieval latency, and loses them at session end.

The implication is that session memory management should borrow from what cognitive science knows about working memory optimization. The most important information should be the most accessible. Old, low-relevance content should be compressed or discarded to make room. The transfer to long-term storage (cross-session persistence) should happen on significant events, not just at session end.

### Context Window Pressure

Context window pressure is what happens when the conversation gets long. In most current AI systems, the context window fills up and the oldest content gets truncated. This is brutal from a memory perspective — the oldest content often includes the most important context: the initial problem description, the architectural constraints established early in the session, the constraints the developer mentioned that make certain approaches impossible.

A smarter approach is selective compression. Rather than truncating old content by age, compress it by relevance. Content that established stable facts can be summarized to a few sentences. Content that was exploratory and led nowhere can be discarded entirely. Content that contains specific decisions or constraints should be preserved verbatim or with only light compression.

```python
# Selective compression strategy for context window management
class ContextCompressor:
    def compress_old_turns(
        self,
        turns: list[ConversationTurn],
        target_token_budget: int
    ) -> list[ConversationTurn]:
        """
        Compress older turns to free context window space.
        Preserves high-signal content, compresses low-signal.
        """
        if self._token_count(turns) <= target_token_budget:
            return turns

        # Score turns by signal value
        scored = [(turn, self._signal_score(turn)) for turn in turns]

        # Sort by (recency_weight * signal_score) — recent + high-signal wins
        compressed = []
        budget_used = 0

        for turn, score in sorted(scored, key=lambda x: x[1], reverse=True):
            if budget_used + turn.token_count <= target_token_budget:
                compressed.append(turn)
                budget_used += turn.token_count
            elif score > HIGH_SIGNAL_THRESHOLD:
                # Compress rather than discard
                summary = self._summarize_turn(turn)
                compressed.append(summary)
                budget_used += summary.token_count

        return sorted(compressed, key=lambda t: t.timestamp)

    def _signal_score(self, turn: ConversationTurn) -> float:
        score = 0.0

        # Decisions and constraints are high-signal
        if turn.contains_decision():
            score += 3.0
        if turn.contains_constraint():
            score += 2.5

        # Code that was accepted by the user is high-signal
        if turn.has_accepted_code():
            score += 2.0

        # Pure exploration that went nowhere is low-signal
        if turn.is_exploratory() and not turn.was_followed_up():
            score -= 1.0

        return score
```

This kind of selective compression keeps the most important context alive even in long sessions. The key insight is that not all content deserves equal preservation — compression budgets should be spent on the lowest-signal content first.

### Attention and the Position Bias Problem

Even within the context window, there's a retrieval problem. Models don't attend uniformly across their context. They tend to attend more strongly to content at the beginning (the system prompt, early context) and at the end (the most recent turns). Content in the middle gets less reliable attention.

This is the "lost in the middle" problem, and it has real consequences for session memory. If a developer mentions an important constraint in the middle of a long conversation, the model may effectively ignore it when responding to later questions — not because the information isn't there, but because attention doesn't reliably reach it.

There are two practical mitigations.

The first is periodic re-injection. When retrieving from the cross-session store, inject retrieved memories at the end of the context (near the current query) rather than the beginning. When important session-established facts are at risk of being lost in the middle, explicitly re-surface them near the current query.

The second is structured system prompts. Rather than letting important context drift into the middle of the conversation history, pin it to the system prompt — which reliably gets strong attention. When a developer establishes a critical constraint, the system should update the system prompt to include it explicitly.

```python
# Re-injecting critical context near the current query
class SessionContextManager:
    def prepare_context_for_query(
        self,
        query: str,
        conversation_history: list[ConversationTurn],
        memory_store: MemoryStore
    ) -> tuple[str, list[ConversationTurn]]:
        """
        Prepare context for the next model call.
        Returns updated system prompt and conversation history.
        """
        # Retrieve relevant memories from cross-session store
        relevant_memories = memory_store.retrieve(query, top_k=5)

        # Identify critical session-established facts that might be
        # buried in the middle of conversation history
        critical_session_facts = self._extract_critical_facts(
            conversation_history
        )

        # Build updated system prompt with pinned facts
        updated_system_prompt = self._build_system_prompt(
            base_prompt=self.base_system_prompt,
            pinned_facts=critical_session_facts,
            retrieved_memories=relevant_memories,
            query=query
        )

        return updated_system_prompt, conversation_history
```

**Key Insight:** The context window is not a neutral container. Position matters. Content at the edges gets more attention than content in the middle. Managing what lands where is part of session memory engineering.

### Turn Summarization vs. Turn Preservation

There's a design choice in session memory between summarizing turns as they age and preserving them verbatim until forced to truncate.

Summarization loses information but saves tokens. Verbatim preservation is accurate but expensive. The right answer depends on the task type.

For code-focused conversations, verbatim preservation is almost always worth the cost for turns that contain actual code. Code is dense with specific information — variable names, function signatures, algorithmic structure — that compression routinely destroys. A summary of a code discussion that doesn't include the code itself is often useless.

For conversational planning discussions — "what's the right approach for this problem?" — summarization is appropriate. The specific words matter less than the conclusions.

A practical heuristic: preserve any turn that contains code, a decision, or a constraint, verbatim. Summarize everything else.

### Handoff to Cross-Session Storage

The session ends. What happens to the session memory?

This transition — from session memory to cross-session storage — is where most AI developer tools fall down. Either they don't persist anything (stateless), or they dump the entire conversation transcript into storage (expensive and poorly organized), or they do some basic summarization that strips out the specific technical details that make the memory actually useful.

Good session-end handoff does several things:

It extracts decisions, constraints, and key facts from the session and promotes them into the semantic memory store. "We decided to use a queue-based approach rather than polling" is a semantic fact worth preserving, not buried in a transcript.

It identifies code artifacts produced during the session — functions written, bugs fixed, patterns established — and indexes them into the procedural memory store.

It creates a compact episodic record of the session: what was worked on, what was resolved, what was left open. Not a transcript — a summary that preserves the important arc of the conversation.

It triggers index updates. If the codebase was changed during the session, any code-understanding indexes need to be refreshed to reflect the changes.

```python
# Session-end memory extraction
class SessionHandoff:
    def __init__(self, llm_client, memory_store):
        self.llm = llm_client
        self.store = memory_store

    async def process_session_end(
        self,
        session: Session
    ) -> SessionHandoffResult:
        # Run extractions in parallel
        decisions, code_artifacts, episode = await asyncio.gather(
            self._extract_decisions(session),
            self._extract_code_artifacts(session),
            self._create_episode_record(session)
        )

        # Write to appropriate stores
        await asyncio.gather(
            self.store.semantic.batch_write(decisions),
            self.store.procedural.batch_write(code_artifacts),
            self.store.episodic.write(episode)
        )

        return SessionHandoffResult(
            decisions_captured=len(decisions),
            artifacts_captured=len(code_artifacts),
            episode_id=episode.id
        )

    async def _extract_decisions(self, session: Session) -> list[SemanticFact]:
        prompt = f"""
        Review this conversation and extract:
        1. Architectural decisions made
        2. Technical constraints established
        3. Facts about the codebase or project that were confirmed or discovered

        Format each as a structured fact with: content, confidence (0-1), category.
        Skip exploratory discussion that didn't conclude in a decision.

        Conversation:
        {session.transcript_summary}
        """
        response = await self.llm.complete(prompt)
        return self._parse_facts(response)
```

The handoff is the bridge between session memory and cross-session memory. Most tools either ignore it or oversimplify it. Getting it right is where the compound value of a memory system begins to accumulate.

---

**Key Takeaways**

1. The context window is working memory — fast, bounded, and temporary. It must be actively managed, not passively allowed to fill.
2. Selective compression by signal value outperforms naive truncation by age.
3. Position within the context window affects model attention — critical content should be pinned near the system prompt or re-injected near the current query.
4. Code turns should be preserved verbatim; conversational turns can be summarized.
5. Session-end handoff is the critical bridge between session memory and cross-session storage. Most tools neglect it.

**Try This:** Instrument a single long AI session. At regular intervals, ask the model to summarize what it knows about the project constraints. Compare those summaries as the session progresses. Notice when and where important context disappears — that's where your session memory system needs the most work.

---

## Chapter 4: Cross-Session Persistence

Cross-session persistence is the hard part. Getting information out of one session and reliably into the next requires solving engineering problems that have no clean analog in the context-window-as-memory model. This chapter covers the architectural options, the trade-offs between them, and what actually matters in practice.

### What Persistence Means

Persistence, in this context, means that information survives the end of a session and is available at the start of the next one — reliably, quickly, and without requiring the user to re-provide it.

This seems obvious but the operational requirements are worth stating explicitly. The information must be stored somewhere that survives process termination, server restarts, and deployment updates. It must be indexed in a way that supports retrieval by topic or relevance, not just by timestamp or ID. It must be available fast enough that injecting it into a new session doesn't introduce noticeable latency. And it must be kept current as the underlying project evolves.

None of these are hard problems individually. Together, they require choosing an architecture that doesn't optimize one at the expense of the others.

### Storage Architecture Options

**Plain file storage** is the simplest option. Write memory records to files in a structured format (JSON, YAML, Markdown). Read them back at session start.

This works at small scale and has real advantages: it's transparent, it's version-controllable, it's portable, and it doesn't require external infrastructure. The developer can read their own memory files, which builds trust. The disadvantage is retrieval: finding the right file at the right moment requires either loading everything (expensive for large histories) or maintaining a secondary index (complexity creep).

```
memory/
├── MEMORY.md              # Index file — loaded every session
├── semantic/
│   ├── project_facts.md   # Stable facts about the project
│   ├── team_conventions.md
│   └── architecture.md
├── episodic/
│   ├── 2026-01-15_auth_refactor.md
│   ├── 2026-01-22_payment_bug.md
│   └── 2026-02-03_db_migration.md
└── procedural/
    ├── error_handling_patterns.md
    ├── test_conventions.md
    └── api_response_patterns.md
```

**Vector database storage** enables semantic retrieval — finding memories by meaning rather than by keyword or exact path. The trade-off is infrastructure dependency and opacity: the developer can't easily inspect what's stored or debug retrieval problems.

Chroma, Qdrant, and Weaviate are the common choices for locally-hosted vector storage. Pinecone and similar managed services work for cloud-hosted tools. The choice between local and managed depends on whether the tool needs to work offline and what the operational support model looks like.

**Relational storage** (SQLite for local, PostgreSQL for hosted) is appropriate for structured semantic memories — facts with clear schemas, relationships between entities, queries that benefit from filtering by category or recency.

**Graph databases** are appropriate when the relationships between pieces of memory are as important as the memories themselves. Which architectural decisions are related to which modules? Which conventions apply to which parts of the codebase? Neo4j and similar graph databases make these relationship queries efficient in ways that relational databases don't.

### The Practical Architecture

For most AI developer tools, the right architecture combines two or three storage mechanisms, not one.

A common pattern: vector storage for semantic and episodic retrieval (find memories by topic), relational storage for structured metadata (filter by recency, category, confidence), and file storage for the raw content and audit trail.

```
Query: "What do we know about the authentication service?"
   ↓
Vector similarity search against memory embeddings
   → Returns top-K semantically relevant memories
   ↓
Metadata filter: recency > 90 days AND confidence > 0.7
   → Filters to high-confidence, recent memories
   ↓
Load full content from file store
   → Returns complete memory records
   ↓
Inject into context window
```

This layered approach is slightly more complex to implement but provides much better retrieval quality than any single store alone.

**Key Insight:** The storage mechanism determines retrieval quality. Semantic retrieval requires vector search. Structural queries require relational storage. You can't get both from either alone.

### Embedding Strategy

Vector search requires embeddings — numeric representations of memory content that capture semantic meaning. The quality of these embeddings directly determines retrieval quality.

General-purpose embedding models (OpenAI's text-embedding-3-small, Cohere's embed-v3, or open-source alternatives like BGE or E5) work reasonably well for most memory content. Code-specific embedding models perform better when memories contain substantial code.

A few specific decisions affect embedding quality significantly.

First, what to embed. Embedding full memory records — including metadata, timestamps, and contextual notes — often produces worse retrieval than embedding just the semantic core content. A memory record might say "Decision made on 2026-01-15: use event sourcing for the order service because of audit requirements." Embedding the full sentence with the date included makes the embedding partially about the date, which hurts retrieval when queries don't include dates.

Second, chunking strategy. For longer memory records, embedding the full record produces an embedding that represents the average meaning across all content, which dilutes the specific concepts. Chunk larger records and embed chunks separately. This increases storage and query cost but significantly improves retrieval precision.

Third, embedding freshness. When memory content is updated, the embeddings must be regenerated. A stale embedding pointing to updated content will return the right record but the semantic match quality degrades. Track embedding generation timestamps and queue re-embedding when content changes.

```python
# Memory indexing with embedding strategy
class MemoryIndexer:
    def __init__(self, embedding_model, vector_store):
        self.embedder = embedding_model
        self.store = vector_store

    def index_memory(self, memory: Memory) -> list[str]:
        """
        Index a memory record. Returns list of chunk IDs created.
        """
        # Extract embeddable content (strip metadata, timestamps)
        core_content = self._extract_core_content(memory)

        # Chunk if content exceeds threshold
        if len(core_content) > CHUNK_THRESHOLD_CHARS:
            chunks = self._chunk_content(core_content, overlap=50)
        else:
            chunks = [core_content]

        chunk_ids = []
        for i, chunk in enumerate(chunks):
            embedding = self.embedder.embed(chunk)
            chunk_id = f"{memory.id}_chunk_{i}"

            self.store.upsert(
                id=chunk_id,
                embedding=embedding,
                metadata={
                    "memory_id": memory.id,
                    "memory_type": memory.type.value,
                    "created_at": memory.created_at.isoformat(),
                    "confidence": memory.confidence,
                    "chunk_index": i,
                    "total_chunks": len(chunks),
                }
            )
            chunk_ids.append(chunk_id)

        return chunk_ids
```

### Retrieval at Session Start

When a new session starts, the memory system needs to prime the context with relevant background. This is different from query-time retrieval — it happens before the user has asked anything, based on minimal context.

What's available at session start: the project identifier, the current working directory, recently modified files, and possibly a brief description of what the developer is working on.

A practical approach: retrieve a small set of high-confidence semantic memories (stable facts about the project, key architectural decisions) that should always be available, plus a recency-weighted sample of recent episodic memories (what was worked on recently). This gives the model baseline project knowledge plus continuity with recent work.

```python
# Session initialization context
async def initialize_session(
    project_id: str,
    recent_files: list[str],
    memory_store: MemoryStore
) -> SessionContext:

    # Always-on: stable project facts
    stable_facts = await memory_store.semantic.get_high_confidence(
        project_id=project_id,
        min_confidence=0.85,
        limit=10
    )

    # Recent work context
    recent_episodes = await memory_store.episodic.get_recent(
        project_id=project_id,
        days=14,
        limit=5
    )

    # File-specific context: what do we know about recently touched files?
    file_memories = await memory_store.get_for_files(
        file_paths=recent_files,
        project_id=project_id
    )

    return SessionContext(
        stable_facts=stable_facts,
        recent_episodes=recent_episodes,
        file_memories=file_memories
    )
```

**Warning:** Don't inject everything at session start. Dumping the entire memory store into the context window at initialization defeats the purpose of having a retrieval system. Inject the minimum needed to establish baseline context, then retrieve more dynamically as queries come in.

### Identity and Multi-Project Management

AI developer tools often need to handle multiple projects per user, and sometimes multiple users per deployment. The memory system must cleanly separate memories by project (and by user in multi-user deployments) while still supporting cross-project retrieval when the user is explicitly working across projects.

Project scoping is typically done at the storage level: every memory record includes a project identifier, and retrieval queries default to filtering by the current project. Cross-project retrieval is an explicit capability, not the default behavior.

User scoping in multi-user deployments requires careful attention to data isolation. Memory records often contain sensitive project information. Storage and retrieval must enforce user-level access controls. This is not a memory-specific problem — it's a standard access control problem — but it has to be designed in from the beginning, not bolted on later.

---

**Key Takeaways**

1. Cross-session persistence requires choosing storage that matches the retrieval pattern: vector search for semantic retrieval, relational for structured queries, file storage for transparency and portability.
2. Most practical systems combine two or three storage mechanisms rather than relying on one.
3. Embedding quality determines retrieval quality. Embed semantic core content, not full records including metadata. Re-embed when content changes.
4. Session initialization should inject minimal context — enough for baseline awareness, not everything in the store.
5. Project and user scoping must be designed in from the start. They cannot be retrofitted cleanly.

**Try This:** Design the schema for a cross-session memory store for your current project. Before writing any code, answer: how will you retrieve by topic? How will you handle updates and deletions? How will you scope by project and user? Most architectural problems become visible at schema design time.

---

## Chapter 5: Code as Memory: What the Codebase Knows

The codebase is the largest and richest memory artifact available to an AI developer tool — and it's almost always underused. Every function, class, module, comment, test, commit message, and documentation file encodes knowledge about how the system works, what decisions were made, and how things are supposed to be done. The codebase is the team's memory, externalized.

The problem is that this knowledge is implicit and unstructured. It's in the code, but reading code to extract knowledge is not the same as having that knowledge in a retrievable form.

### The Codebase as Accumulated Decision History

Look at any mature codebase and you're looking at the accumulated history of thousands of decisions. Why is the authentication module structured the way it is? Because three years ago there was a security incident, and the team refactored to separate token validation from session management. Why does the payment service have that unusual retry logic? Because the payment gateway has a specific behavior on timeout that caused duplicate charges, and the retry logic is tuned to avoid triggering it.

None of that context is in the code itself, usually. But the code's structure — the separation of concerns, the unusual retry behavior — is a clue that decisions were made. A tool that can read the code and recognize the structure is one step toward surfacing those decisions. A tool that can also read the git history and find the commit that introduced that structure can surface the actual decision.

This is what "code as memory" means. The codebase isn't just source code to be understood on demand — it's a persistent record of what has been built, decided, and learned. A memory system for an AI developer tool that doesn't tap into this record is leaving most of the available signal on the table.

### Static Analysis for Knowledge Extraction

Static analysis tools produce structured representations of code that are significantly more useful for memory purposes than raw source text.

Abstract Syntax Trees (ASTs) expose the logical structure of code independently of whitespace and formatting. From an AST, you can extract: every function definition and its signature, every class and its methods, every import and dependency, every docstring and comment.

```python
import ast
import pathlib

class CodeKnowledgeExtractor:
    def extract_from_file(self, file_path: str) -> FileKnowledge:
        source = pathlib.Path(file_path).read_text()
        tree = ast.parse(source)

        visitor = KnowledgeVisitor()
        visitor.visit(tree)

        return FileKnowledge(
            file_path=file_path,
            functions=visitor.functions,
            classes=visitor.classes,
            imports=visitor.imports,
            docstrings=visitor.docstrings,
            complexity_signals=visitor.complexity_signals
        )

class KnowledgeVisitor(ast.NodeVisitor):
    def __init__(self):
        self.functions = []
        self.classes = []
        self.imports = []
        self.docstrings = []
        self.complexity_signals = []

    def visit_FunctionDef(self, node):
        func_info = {
            "name": node.name,
            "line": node.lineno,
            "args": [arg.arg for arg in node.args.args],
            "returns": ast.unparse(node.returns) if node.returns else None,
            "docstring": ast.get_docstring(node),
            "is_async": isinstance(node, ast.AsyncFunctionDef),
            "decorator_count": len(node.decorator_list),
        }
        self.functions.append(func_info)

        # High complexity signals are worth flagging
        if len(node.body) > 50 or self._has_deep_nesting(node):
            self.complexity_signals.append({
                "type": "high_complexity",
                "location": f"{node.name}:{node.lineno}",
                "signal": "large function or deep nesting"
            })

        self.generic_visit(node)
```

This structured extraction is the foundation for code-as-memory. The extracted knowledge can be indexed for retrieval, used to answer structural questions about the codebase, and combined with semantic memory to produce contextually aware responses.

### Git History as Episodic Memory

The git history is episodic memory for the codebase. Every commit records what changed, when, who made the change, and (in the commit message) why. This is exactly the kind of episodic memory that makes a coding assistant useful — not just what the code looks like now, but how it got here.

Mining git history for useful memories requires some processing. Raw commit messages vary wildly in quality. A commit message saying "fix bug" is episodic memory with almost no signal. A message saying "fix race condition in token refresh — concurrent requests during token expiry window could issue duplicate tokens" is rich episodic memory worth preserving.

```python
import subprocess
import json

class GitHistoryMiner:
    def extract_high_signal_commits(
        self,
        repo_path: str,
        min_message_length: int = 50
    ) -> list[CommitMemory]:
        """
        Extract commits with substantive commit messages.
        Short messages (fixup, merge, etc.) are filtered out.
        """
        result = subprocess.run(
            ["git", "log", "--format=%H|%an|%at|%s|%b", "--", "."],
            cwd=repo_path,
            capture_output=True,
            text=True
        )

        memories = []
        for line in result.stdout.strip().split("\n"):
            parts = line.split("|", 4)
            if len(parts) < 5:
                continue

            commit_hash, author, timestamp, subject, body = parts
            full_message = f"{subject}\n{body}".strip()

            if len(full_message) < min_message_length:
                continue  # Low-signal commit

            # Get files changed
            files = self._get_changed_files(repo_path, commit_hash)

            memories.append(CommitMemory(
                commit_hash=commit_hash,
                author=author,
                timestamp=int(timestamp),
                message=full_message,
                files_changed=files,
                memory_type=MemoryType.EPISODIC,
            ))

        return memories
```

Not all of git history needs to be indexed — that would be expensive and full of low-signal commits. The heuristic above (filter by message length) is crude but effective. Better heuristics add: filtering for commits that touch specific high-importance files, commits that contain bug-related keywords, and commits with a high changed-lines-to-files ratio (suggesting significant rewrites rather than routine changes).

### Test Files as Documented Behavior

Test files are an underappreciated source of knowledge. Well-written tests document intended behavior, edge cases, and invariants in a way that production code doesn't. A test named `test_payment_fails_gracefully_when_gateway_timeout_exceeds_30s` tells you something important about how the system is supposed to behave that isn't obvious from reading the payment processing code.

Extracting knowledge from tests is structurally similar to extracting knowledge from production code, with different emphasis: test names are the knowledge signals (they describe expected behavior), and test structure (what's being set up, what's being asserted) adds detail.

```python
def extract_test_knowledge(self, test_file_path: str) -> list[BehaviorFact]:
    """
    Extract behavioral facts from test files.
    Each test is a documented expectation about system behavior.
    """
    source = pathlib.Path(test_file_path).read_text()
    tree = ast.parse(source)

    facts = []
    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
            if node.name.startswith("test_"):
                # Test name describes expected behavior
                behavior_desc = self._parse_test_name(node.name)

                # Extract what's being tested (setup + assertion)
                context = self._extract_test_context(node)

                facts.append(BehaviorFact(
                    description=behavior_desc,
                    context=context,
                    source_file=test_file_path,
                    source_line=node.lineno,
                    confidence=0.8,  # Tests are high-confidence behavior docs
                ))

    return facts
```

### Inline Comments and Docstrings as Semantic Memory

Comments in code are explicit knowledge — the developer took the time to write something down because the code alone doesn't explain it. This is exactly the kind of content that deserves to be in semantic memory.

The challenge is distinguishing high-value comments from low-value ones. "Increment counter" above `i += 1` is not a memory worth preserving. "Workaround for timezone bug in Python's datetime.astimezone() before 3.9 — see issue #1842" is worth preserving.

Signal indicators for high-value comments:
- References to issue trackers, PR numbers, or external resources
- "Workaround", "hack", "NOTE", "TODO", "FIXME" markers
- Explanations of non-obvious behavior or constraints
- Performance justifications ("this avoids N+1 query by...")
- Safety constraints ("never call this outside a transaction")

```python
IMPORTANT_COMMENT_PATTERNS = [
    r'\b(workaround|hack|NOTE|FIXME|WARNING|IMPORTANT|CRITICAL)\b',
    r'#\s*\d+|issue|PR|ticket',  # Issue/PR references
    r'\bavoid\b.*\b(N\+1|race\s+condition|deadlock)\b',
    r'\bnever\b|\balways\b|\bmust\b',  # Strong invariant language
    r'https?://',  # Links to documentation or issues
]
```

**Warning:** Don't index stale comments. Comments that reference `TODO: fix in version 2.0` from three years ago and version 4.0 is the current release are noise, not signal. Apply recency weighting to comment-derived memories, and flag comments that reference specific version numbers for freshness review.

### Keeping Code Memory Current

The codebase changes constantly. Any code-as-memory system must handle these changes without requiring full re-indexing every time.

The practical approach is incremental indexing triggered by file changes. When a file is modified — detected via filesystem events or post-commit hooks — the knowledge extracted from that file is re-extracted and the old records are invalidated.

```bash
# post-commit hook: trigger incremental index update
#!/bin/bash
changed_files=$(git diff-tree --no-commit-id -r --name-only HEAD)
echo "$changed_files" | while read file; do
    if [[ "$file" == *.py ]] || [[ "$file" == *.ts ]] || [[ "$file" == *.go ]]; then
        python3 -m memory_indexer update --file "$file"
    fi
done
```

Full re-indexing should be triggered less frequently: on initial setup, after major refactors that touch many files, or on a scheduled cadence (weekly, for instance) to catch drift between incremental updates.

---

**Key Takeaways**

1. The codebase is the largest and richest memory artifact available — and it's almost always underused. It encodes decisions, constraints, and patterns implicitly.
2. Static analysis (ASTs) converts raw source code into structured, indexable knowledge about functions, classes, and dependencies.
3. Git history is episodic memory for the codebase. High-signal commits (substantive messages, bug fixes, significant rewrites) deserve extraction and indexing.
4. Test files document intended behavior. Test names are behavioral facts worth indexing.
5. Incremental indexing (triggered by file changes) is more practical than full re-indexing for keeping code memory current.

**Try This:** Run a static analysis extraction pass on your current project. Count the number of distinct facts extractable — function signatures, docstrings, high-signal comments, test descriptions. That number approximates the size of the code-knowledge corpus you could index.

---

## Chapter 6: Knowledge Graphs for Code Understanding

A list of facts about a codebase is useful. A graph of how those facts relate to each other is significantly more useful. Knowledge graphs add the connective tissue — the edges between nodes — that allow a system to reason about relationships rather than just retrieve isolated facts.

For code understanding specifically, the graph structure is not optional. Code is inherently relational: modules depend on modules, functions call functions, classes inherit from classes, services communicate with services. The moment you flatten code knowledge into independent facts, you lose the structural information that often matters most.

### What Goes in a Code Knowledge Graph

A code knowledge graph has entities (nodes) and relationships (edges). The right entities depend on the codebase and the use cases, but a practical baseline includes:

**Entities:**
- Files and modules
- Classes and interfaces
- Functions and methods
- Configuration values and constants
- External dependencies and services
- Data types and schemas
- Team members (for blame/ownership)

**Relationships:**
- `IMPORTS` / `DEPENDS_ON` — module dependencies
- `CALLS` — function call relationships
- `INHERITS_FROM` / `IMPLEMENTS` — class hierarchies
- `DEFINED_IN` — function/class to file
- `OWNED_BY` — file to team or person
- `DOCUMENTED_BY` — code to test or documentation
- `MODIFIED_RECENTLY` — temporal relationship for staleness
- `FREQUENTLY_CHANGED_TOGETHER` — co-change relationships from git history

The co-change relationship is worth highlighting. Files that frequently change together are likely coupled — either explicitly (they share an interface) or implicitly (they implement related logic). This relationship isn't visible from a static snapshot of the code; it emerges from git history. But it's valuable for predicting blast radius: if you're changing file A, and file A frequently changes with file B, you should probably look at file B too.

### Building the Graph

Graph construction starts with static analysis (covered in the previous chapter) and adds relationship extraction.

```python
from typing import Generator
import networkx as nx

class CodeKnowledgeGraph:
    def __init__(self):
        self.graph = nx.DiGraph()

    def build_from_project(self, project_path: str) -> None:
        extractor = CodeKnowledgeExtractor()

        # Build nodes from all source files
        for py_file in pathlib.Path(project_path).rglob("*.py"):
            knowledge = extractor.extract_from_file(str(py_file))
            self._add_file_nodes(knowledge)

        # Build edges from import analysis
        self._build_import_edges(project_path)

        # Build call graph edges
        self._build_call_graph(project_path)

        # Build co-change edges from git history
        miner = GitHistoryMiner()
        self._build_cochange_edges(project_path, miner)

    def _build_import_edges(self, project_path: str) -> None:
        for py_file in pathlib.Path(project_path).rglob("*.py"):
            source = py_file.read_text()
            tree = ast.parse(source)

            for node in ast.walk(tree):
                if isinstance(node, ast.Import):
                    for alias in node.names:
                        self.graph.add_edge(
                            str(py_file),
                            alias.name,
                            relation="IMPORTS"
                        )
                elif isinstance(node, ast.ImportFrom):
                    if node.module:
                        self.graph.add_edge(
                            str(py_file),
                            node.module,
                            relation="IMPORTS_FROM"
                        )

    def get_impact_radius(
        self,
        file_path: str,
        max_depth: int = 3
    ) -> set[str]:
        """
        Return all files that depend on this file, up to max_depth hops.
        Used to predict blast radius of changes.
        """
        affected = set()
        queue = [(file_path, 0)]

        while queue:
            current, depth = queue.pop(0)
            if depth >= max_depth:
                continue

            # Find all files that import from current
            for predecessor in self.graph.predecessors(current):
                if predecessor not in affected:
                    affected.add(predecessor)
                    queue.append((predecessor, depth + 1))

        return affected
```

**Key Insight:** Impact radius analysis — knowing which files are likely affected by a change in a given file — is one of the most practically useful outputs of a code knowledge graph. It turns "I'm editing this function" into "here are 12 other files that probably need review."

### Graph Embeddings for Semantic Search

Raw graph traversal answers structural questions: what imports this module, what does this function call. Semantic questions — "what part of the codebase handles retry logic" — require a different approach.

One approach is embedding the graph: creating a numeric representation of each node that captures both its content and its position in the graph. Node2Vec and similar graph embedding algorithms walk the graph to generate training data, then learn embeddings that place similar nodes (by structure and content) near each other in the embedding space.

A simpler approach that works well in practice: embed the content of each node (the function's code, docstring, and name) using a standard text embedding model, then augment the retrieval result with graph neighbors. When a retrieval returns a function, automatically include its callers, callees, and sibling functions in the same class as additional context.

```python
class GraphAugmentedRetrieval:
    def __init__(self, vector_store, knowledge_graph):
        self.vector_store = vector_store
        self.graph = knowledge_graph

    def retrieve(
        self,
        query: str,
        top_k: int = 5,
        graph_hops: int = 1
    ) -> list[RetrievalResult]:
        # Semantic retrieval from vector store
        base_results = self.vector_store.query(query, top_k=top_k)

        # Augment with graph neighbors
        augmented = set()
        for result in base_results:
            augmented.add(result.node_id)

            # Add direct neighbors
            neighbors = self.graph.get_neighbors(
                result.node_id,
                max_hops=graph_hops
            )
            augmented.update(neighbors)

        # Fetch full content for all nodes
        return [
            self._fetch_node_content(node_id)
            for node_id in augmented
        ]
```

This graph-augmented retrieval gives the model context about a function plus the functions that call it and the functions it calls. For debugging and refactoring tasks, this surrounding context is often as important as the function itself.

### Schema Graphs for Data Understanding

Code knowledge graphs typically focus on functions and modules. Data schemas deserve their own treatment.

In any non-trivial system, data models are the load-bearing structure. A query that touches the wrong table, a response that uses the wrong field name, an assumption about data types that doesn't match the actual schema — these mistakes are common and expensive. A tool that knows the schema produces fewer of these mistakes.

Schema knowledge includes: table and column definitions, relationships (foreign keys, joins), validation constraints, nullable vs. required fields, and index definitions (which often tell you a lot about query patterns).

```python
# Extract schema knowledge from SQLAlchemy models
import sqlalchemy
from sqlalchemy import inspect

class SchemaKnowledgeExtractor:
    def extract_from_engine(
        self,
        engine: sqlalchemy.Engine
    ) -> list[SchemaFact]:
        inspector = inspect(engine)
        facts = []

        for table_name in inspector.get_table_names():
            columns = inspector.get_columns(table_name)
            foreign_keys = inspector.get_foreign_keys(table_name)
            indexes = inspector.get_indexes(table_name)

            # Table-level fact
            facts.append(SchemaFact(
                entity=table_name,
                fact_type="table_structure",
                content={
                    "columns": columns,
                    "foreign_keys": foreign_keys,
                    "indexes": indexes
                }
            ))

            # Column-level facts for non-obvious columns
            for col in columns:
                if col.get("comment") or not col["nullable"]:
                    facts.append(SchemaFact(
                        entity=f"{table_name}.{col['name']}",
                        fact_type="column_constraint",
                        content=col
                    ))

        return facts
```

### Temporal Graphs: How the Graph Changes Over Time

A snapshot of the code knowledge graph at one point in time is useful. The history of how the graph has changed is more useful for understanding why things are the way they are.

Temporal graph analysis can answer questions like: which parts of the codebase are actively changing versus stable? Which modules have had the most structural churn? Which functions have had the most bug-related commits?

This is best approached not as a full temporal graph database but as a lightweight overlay: record graph diffs at each indexed commit, and use those diffs to annotate nodes with stability metrics.

```python
def compute_stability_score(
    node_id: str,
    graph_history: list[GraphSnapshot],
    lookback_days: int = 90
) -> float:
    """
    Stability score: 1.0 = never changed, 0.0 = changes constantly.
    Based on structural change frequency over the lookback period.
    """
    recent_snapshots = [
        s for s in graph_history
        if s.age_days <= lookback_days
    ]

    if not recent_snapshots:
        return 1.0  # No history = assume stable

    change_count = sum(
        1 for s in recent_snapshots
        if node_id in s.changed_nodes
    )

    return 1.0 - (change_count / len(recent_snapshots))
```

Stability scores feed back into memory management — high-stability nodes can be indexed less frequently, and their memories are less likely to become stale. High-churn nodes need frequent re-indexing and their associated memories need more aggressive expiry.

**Warning:** Large codebases produce large graphs. A million-node graph is not unusual for an enterprise codebase. At that scale, in-memory graph processing becomes impractical. Plan for disk-backed graph storage (graph databases, or at minimum serialized adjacency lists) before you hit memory limits.

---

**Key Takeaways**

1. Knowledge graphs add relationships between facts — making it possible to reason about code structure, not just retrieve isolated facts.
2. Co-change relationships from git history surface implicit coupling that static analysis misses.
3. Graph-augmented retrieval (semantic search + graph neighbors) provides better context for debugging and refactoring than either approach alone.
4. Schema knowledge is essential for avoiding data-model mistakes in suggested code.
5. Stability scores from temporal graph analysis inform memory refresh priorities.

**Try This:** For your current project, build a simple import graph using Python's AST or your language's equivalent. Visualize it. The clusters of highly-connected modules reveal your actual architecture — often different from what you'd describe on a whiteboard.

---

## Chapter 7: Retrieval-Augmented Memory

Retrieval-augmented generation (RAG) was developed to let language models answer questions about information that wasn't in their training data. The same technique, applied to developer tool memory, is what makes the memory system actually usable at inference time. Without retrieval, you have a storage system. With retrieval, you have a memory system.

### The Retrieval Problem for Developer Tools

Retrieval for developer tools has requirements that differ from typical RAG applications.

Speed matters more. A document retrieval system that takes two seconds to return results is acceptable. A coding assistant that pauses for two seconds after every keystroke is not. The retrieval pipeline must be fast enough to run on every query without the user noticing.

Precision matters more than recall. In most RAG applications, it's acceptable to return ten relevant documents and let the model figure out which ones matter. In a coding context, irrelevant code in the context window doesn't just fail to help — it actively misleads the model. A retrieved function from the wrong part of the codebase can cause the model to produce code that looks plausible but doesn't fit.

Code-specific understanding matters. Semantic similarity for code is not the same as semantic similarity for natural language. Two functions can be semantically similar as English prose (they both "handle payment processing") while being quite different in their actual code structure and relevance to a specific query.

### Hybrid Search: Combining Dense and Sparse Retrieval

Dense retrieval uses vector embeddings and cosine similarity. It's good at finding semantically related content — things that mean the same thing expressed differently. It's bad at finding exact matches for specific identifiers: the function `process_payment`, the class `TokenRefreshManager`, the configuration key `MAX_RETRY_ATTEMPTS`.

Sparse retrieval (BM25 or TF-IDF) uses keyword matching. It's excellent at finding exact identifiers but doesn't generalize to semantic similarity. If the query says "token refresh" but the code says "access token renewal," sparse retrieval may miss it.

Hybrid search combines both. Reciprocal Rank Fusion (RRF) is the standard method for combining rankings from multiple retrievers without needing to calibrate score scales between them.

```python
def reciprocal_rank_fusion(
    rankings: list[list[str]],
    k: int = 60
) -> list[tuple[str, float]]:
    """
    Combine multiple ranked lists using Reciprocal Rank Fusion.

    Args:
        rankings: List of ranked document ID lists (from different retrievers)
        k: Constant to prevent very high scores for top-ranked documents

    Returns:
        Merged ranking as (doc_id, score) tuples, sorted by score descending
    """
    scores: dict[str, float] = {}

    for ranking in rankings:
        for rank, doc_id in enumerate(ranking, start=1):
            if doc_id not in scores:
                scores[doc_id] = 0.0
            scores[doc_id] += 1.0 / (k + rank)

    return sorted(scores.items(), key=lambda x: x[1], reverse=True)

class HybridRetriever:
    def __init__(self, vector_store, bm25_index):
        self.vector_store = vector_store
        self.bm25 = bm25_index

    def retrieve(
        self,
        query: str,
        top_k: int = 10,
        candidate_pool: int = 50
    ) -> list[RetrievalResult]:
        # Dense retrieval
        dense_results = self.vector_store.query(
            query, top_k=candidate_pool
        )
        dense_ids = [r.id for r in dense_results]

        # Sparse retrieval
        sparse_results = self.bm25.query(
            query, top_k=candidate_pool
        )
        sparse_ids = [r.id for r in sparse_results]

        # Fuse rankings
        fused = reciprocal_rank_fusion([dense_ids, sparse_ids])

        # Take top_k from fused ranking
        top_ids = [doc_id for doc_id, _ in fused[:top_k]]

        # Fetch full content
        return self._fetch_results(top_ids)
```

**Key Insight:** For developer tool memory retrieval, hybrid search is almost always better than either dense or sparse alone. Developers query with a mix of semantic intent and specific identifiers. Only hybrid handles both well.

### Contextual Retrieval: What Query to Use

A naive retrieval implementation uses the user's message as the query. For simple questions ("how does the payment service handle retries?"), this works fine. For complex conversational contexts, the message alone is a poor query.

"It keeps failing on the second try" is a real developer message during a debugging session. Out of context, it's nearly unsearchable. In context — the conversation is about the retry logic in the payment service, and the developer has been describing a specific error — it should retrieve the retry implementation, recent bug fixes in that area, and any memory records about known issues with the payment gateway.

Query rewriting — using the model to reformulate the raw message into a better search query — improves retrieval quality significantly.

```python
async def rewrite_query_for_retrieval(
    user_message: str,
    conversation_context: str,
    llm_client
) -> list[str]:
    """
    Generate multiple search queries from a user message.
    Returns 2-3 queries that together cover the search space.
    """
    prompt = f"""
    You are helping a developer get relevant code context.

    Conversation context (last few turns):
    {conversation_context}

    Current message: {user_message}

    Generate 2-3 specific search queries that would retrieve relevant code
    and memory records. Focus on:
    - Specific function/class names that might be involved
    - The technical concept being discussed
    - Any error patterns or behaviors mentioned

    Return only the queries, one per line. No explanations.
    """

    response = await llm_client.complete(prompt, max_tokens=200)
    queries = [q.strip() for q in response.strip().split("\n") if q.strip()]
    return queries[:3]  # Cap at 3 queries
```

Running retrieval against multiple rewritten queries, then fusing the results with RRF, consistently outperforms single-query retrieval. The cost is 2-3x more embedding lookups — usually still well within latency budgets.

### Re-Ranking: Filtering the Candidate Set

First-stage retrieval (vector search or BM25) optimizes for recall — finding candidates that might be relevant. Re-ranking optimizes for precision — identifying which candidates are actually most relevant to the specific query in context.

Cross-encoder re-rankers are the standard approach: a model that takes a (query, document) pair and outputs a relevance score. Cross-encoders are slower than bi-encoders (which is what vector search uses) because they process the pair jointly rather than pre-computing document embeddings. But for a candidate set of 20-50 documents, even a relatively slow cross-encoder is fast enough.

```python
class CrossEncoderReranker:
    def __init__(self, model_name: str = "cross-encoder/ms-marco-MiniLM-L-6-v2"):
        from sentence_transformers import CrossEncoder
        self.model = CrossEncoder(model_name)

    def rerank(
        self,
        query: str,
        candidates: list[RetrievalResult],
        top_k: int = 5
    ) -> list[RetrievalResult]:
        if not candidates:
            return []

        # Prepare (query, document) pairs
        pairs = [(query, candidate.content) for candidate in candidates]

        # Score all pairs
        scores = self.model.predict(pairs)

        # Sort by score and take top_k
        scored = sorted(
            zip(candidates, scores),
            key=lambda x: x[1],
            reverse=True
        )

        return [candidate for candidate, _ in scored[:top_k]]
```

For code retrieval specifically, re-ranking benefits from code-aware features. A candidate that contains the exact function name mentioned in the query should score higher than one that's semantically similar but refers to different code. Adding a simple exact-match bonus to the re-ranking score handles this without needing a code-specific re-ranker.

### Memory Routing: Querying the Right Store

Not every query should hit every memory store. Routing queries to the appropriate store — or stores — reduces latency and noise.

A query about a specific architectural decision probably wants episodic and semantic memory. A query about how to write a test probably wants procedural memory and code examples. A query about a specific function probably wants code memory and the knowledge graph.

Simple rule-based routing handles most cases:

```python
class MemoryRouter:
    def route(self, query: str, query_type: QueryType) -> list[MemoryStore]:
        stores = []

        # Always include semantic memory (stable facts are always relevant)
        stores.append(StoreType.SEMANTIC)

        if query_type == QueryType.DEBUGGING:
            # Debugging wants recent episodic memory + code
            stores.extend([StoreType.EPISODIC, StoreType.CODE])

        elif query_type == QueryType.IMPLEMENTATION:
            # Implementation wants procedural patterns + code
            stores.extend([StoreType.PROCEDURAL, StoreType.CODE])

        elif query_type == QueryType.ARCHITECTURE:
            # Architecture queries want episodic decisions + semantic facts
            stores.append(StoreType.EPISODIC)

        elif query_type == QueryType.EXPLANATION:
            # Explanation wants all memory types
            stores.extend([StoreType.EPISODIC, StoreType.PROCEDURAL, StoreType.CODE])

        return list(set(stores))  # Deduplicate

    def classify_query(self, query: str) -> QueryType:
        # Simple rule-based classification
        debug_signals = ["bug", "error", "fail", "broken", "wrong", "why doesn't"]
        impl_signals = ["how to", "implement", "write", "create", "add"]
        arch_signals = ["architecture", "design", "decision", "approach", "structure"]

        query_lower = query.lower()

        if any(s in query_lower for s in debug_signals):
            return QueryType.DEBUGGING
        elif any(s in query_lower for s in arch_signals):
            return QueryType.ARCHITECTURE
        elif any(s in query_lower for s in impl_signals):
            return QueryType.IMPLEMENTATION
        else:
            return QueryType.EXPLANATION
```

**Warning:** Don't make routing too aggressive. A query that's misclassified hits the wrong stores and misses relevant memories. When in doubt, include more stores with a higher relevance threshold rather than strictly limiting to one store.

### Assembling the Final Context

Retrieved memories must be assembled into a coherent context block that the model can use. This is not just concatenation — the order and formatting of retrieved memories affects how the model uses them.

Current practice: put the most relevant memory at the bottom (closest to the query), put higher-level background context earlier, and separate memories clearly so the model understands each one as a distinct piece of context.

```python
def assemble_memory_context(
    memories: list[RetrievalResult],
    query: str,
    max_tokens: int = 2000
) -> str:
    """
    Assemble retrieved memories into a context block.
    Most relevant memory goes last (closest to query).
    """
    # Sort by relevance (least to most relevant, so most relevant is last)
    sorted_memories = sorted(memories, key=lambda m: m.relevance_score)

    context_parts = []
    token_count = 0

    for memory in sorted_memories:
        memory_text = format_memory_for_context(memory)
        memory_tokens = estimate_tokens(memory_text)

        if token_count + memory_tokens > max_tokens:
            break  # Stop adding if we'd exceed budget

        context_parts.append(memory_text)
        token_count += memory_tokens

    if not context_parts:
        return ""

    header = "## Relevant Context from Project Memory\n\n"
    return header + "\n\n---\n\n".join(context_parts)
```

---

**Key Takeaways**

1. Hybrid search (dense + sparse, combined with RRF) outperforms either alone for developer tool retrieval.
2. Query rewriting — using the model to reformulate raw user messages into better search queries — significantly improves retrieval quality.
3. Cross-encoder re-ranking filters the candidate set from first-stage retrieval, improving precision without sacrificing recall.
4. Memory routing directs queries to the appropriate stores, reducing latency and noise.
5. Assembly order matters: most relevant memory closest to the query, background context earlier.

**Try This:** On a representative sample of developer queries, run retrieval with and without query rewriting. Measure the overlap between retrieved results and the results a human would identify as relevant. The gap tells you how much rewriting is helping.

---

## Chapter 8: Memory Decay and Refresh

Memory systems accumulate information over time. Without active management, they accumulate noise: stale facts that no longer reflect reality, obsolete code patterns that were replaced during a refactor, episodic records about bugs that were fixed months ago. Stale memory is worse than no memory — it causes the tool to confidently assert things that are no longer true.

Memory decay and refresh is the least glamorous part of memory system engineering and one of the most operationally important.

### Why Memories Go Stale

Code changes. That's the entire story, but the implications are numerous.

A fact entered into semantic memory — "the system uses Redis for session storage" — becomes stale the day the team migrates to database-backed sessions. A procedural pattern — "use X library for HTTP clients" — becomes stale when the team standardizes on a different library. An episodic memory about a bug — "fixed the N+1 query in user listing" — becomes stale if the user listing is later rewritten and the N+1 reintroduced in a different form.

The codebase is the ground truth. When the codebase and the memory system disagree, the codebase is right. The memory system needs mechanisms to detect and resolve these disagreements continuously.

### Types of Staleness

Different kinds of staleness require different detection strategies.

**Factual staleness** is when a stored fact no longer matches reality. "The system uses PostgreSQL" is false after a migration to MySQL. Detection requires comparing stored facts against current codebase state.

**Structural staleness** is when a memory references code that no longer exists in the same form. A memory about the `UserAuthService` class is stale if that class was renamed to `IdentityService` or deleted. Detection requires resolving memory references against the current code knowledge graph.

**Temporal staleness** is when a memory is old enough that it may no longer be relevant, even if it hasn't been explicitly invalidated. An episodic memory about a decision made two years ago, in a codebase that has since been substantially rewritten, may no longer be applicable. Detection is time-based with configurable decay rates.

**Confidence staleness** is when a memory was captured with moderate confidence and hasn't been reinforced by subsequent interactions. A fact that was uncertain when first captured and hasn't been confirmed since should have its confidence decay over time.

### Implementing Decay

Time-based decay can be implemented as a simple exponential decay function applied to memory confidence scores.

```python
import math
from datetime import datetime, timedelta

def compute_decayed_confidence(
    original_confidence: float,
    captured_at: datetime,
    memory_type: MemoryType,
    reinforcement_count: int = 0,
    now: datetime = None
) -> float:
    """
    Compute current confidence after time-based decay and reinforcement.

    Decay rates by type:
    - EPISODIC: half-life = 90 days (events become less relevant)
    - SEMANTIC: half-life = 180 days (facts decay slower)
    - PROCEDURAL: half-life = 365 days (patterns are most durable)
    """
    if now is None:
        now = datetime.utcnow()

    half_lives = {
        MemoryType.EPISODIC: 90,
        MemoryType.SEMANTIC: 180,
        MemoryType.PROCEDURAL: 365,
    }

    half_life_days = half_lives[memory_type]
    age_days = (now - captured_at).days

    # Exponential decay
    decay_factor = math.pow(0.5, age_days / half_life_days)

    # Reinforcement bonus: each confirmation extends effective lifetime
    reinforcement_bonus = min(reinforcement_count * 0.1, 0.3)

    decayed = original_confidence * decay_factor + reinforcement_bonus
    return min(decayed, 1.0)  # Cap at 1.0
```

Memories whose confidence falls below a threshold should be flagged for review or removed from the active retrieval index. They can be archived rather than deleted — useful for audit trails and for re-examining when context changes.

**Key Insight:** Decay should be gradual and reversible, not binary. A memory that decays to low confidence but hasn't been explicitly invalidated shouldn't be deleted — it should be deprioritized in retrieval and flagged for verification.

### Structural Validation: Checking References Against Code

Any memory that references a specific code entity — a class name, a function, a file path, a database table — can be validated by checking whether that entity still exists.

This validation can run as a background process, checking stored references against the current code index.

```python
class MemoryValidator:
    def __init__(self, code_index: CodeKnowledgeGraph, memory_store: MemoryStore):
        self.code_index = code_index
        self.store = memory_store

    async def validate_all_memories(self) -> ValidationReport:
        all_memories = await self.store.get_all_active()

        stale = []
        valid = []

        for memory in all_memories:
            references = self._extract_code_references(memory)

            if not references:
                valid.append(memory)  # No references to validate
                continue

            broken_refs = []
            for ref in references:
                if not self.code_index.entity_exists(ref):
                    broken_refs.append(ref)

            if broken_refs:
                stale.append(StaleMemory(
                    memory=memory,
                    broken_references=broken_refs,
                    staleness_reason=StalenessReason.REFERENCE_NOT_FOUND
                ))
            else:
                valid.append(memory)

        return ValidationReport(stale=stale, valid=valid)

    def _extract_code_references(self, memory: Memory) -> list[CodeReference]:
        """
        Extract function names, class names, file paths from memory content.
        Uses heuristics: CamelCase = likely class, snake_case() = likely function.
        """
        references = []
        content = memory.content

        # Class-like references (CamelCase)
        class_pattern = r'\b[A-Z][a-zA-Z]+(?:Service|Manager|Handler|Repository|Client)\b'
        for match in re.finditer(class_pattern, content):
            references.append(CodeReference(
                name=match.group(),
                ref_type=ReferenceType.CLASS
            ))

        # Function-like references (snake_case with context)
        func_pattern = r'\b([a-z_][a-z0-9_]+)\(\)'
        for match in re.finditer(func_pattern, content):
            references.append(CodeReference(
                name=match.group(1),
                ref_type=ReferenceType.FUNCTION
            ))

        # File path references
        path_pattern = r'[a-zA-Z0-9_/]+\.py\b'
        for match in re.finditer(path_pattern, content):
            references.append(CodeReference(
                name=match.group(),
                ref_type=ReferenceType.FILE
            ))

        return references
```

### Refresh Triggers

Passive decay handles gradual staleness. Active refresh is needed when codebase changes make memories obsolete immediately.

The trigger for active refresh is the same as the trigger for incremental code indexing: file changes. When a file changes, any memories that reference that file or entities defined in that file should be flagged for re-validation.

```python
# Example: Change-triggered memory invalidation
class ChangeTriggeredInvalidator:
    def on_file_changed(self, file_path: str) -> None:
        """
        Called when a file is modified. Invalidates affected memories.
        """
        # Find all memories that reference this file
        affected_memories = self.store.find_memories_referencing(file_path)

        # Also find memories that reference entities defined in this file
        defined_entities = self.code_index.get_entities_defined_in(file_path)
        for entity in defined_entities:
            entity_memories = self.store.find_memories_referencing(entity.name)
            affected_memories.extend(entity_memories)

        # Mark affected memories for re-validation
        for memory in affected_memories:
            self.store.flag_for_revalidation(
                memory_id=memory.id,
                reason=f"Referenced file {file_path} was modified"
            )
```

### Resolving Stale Memories

When a memory is flagged as stale, there are three options: update it, archive it, or delete it.

**Update** when the memory's core value is still valid but the specific references need updating. "We use event sourcing for the order service" might reference a class that was renamed — the architectural fact is still true, the reference needs updating.

**Archive** when the memory is no longer current but might be historically interesting. Bug fixes, superseded architectural decisions, and resolved incidents fall into this category.

**Delete** when the memory has no value in any form. Exploratory notes from sessions that went nowhere, transient debugging hypotheses, and low-confidence facts that were never confirmed — delete these once they decay below threshold.

```python
async def resolve_stale_memory(
    memory: Memory,
    updated_code_context: str,
    llm_client
) -> MemoryResolution:
    """
    Use LLM to determine how to handle a stale memory.
    """
    prompt = f"""
    A memory record has been flagged as potentially stale.

    Memory content:
    {memory.content}

    Reason flagged: {memory.staleness_reason}

    Current codebase context:
    {updated_code_context}

    Determine the best action:
    - UPDATE: The core fact is still valid, update to reflect current state
    - ARCHIVE: The memory was valid historically, archive for reference
    - DELETE: The memory has no current or historical value

    If UPDATE, provide the updated content.

    Respond in JSON: {{"action": "UPDATE|ARCHIVE|DELETE", "updated_content": "..."}}
    """

    response = await llm_client.complete(prompt)
    return MemoryResolution.from_json(response)
```

**Warning:** Automated deletion of stale memories is irreversible. Build in a review step for any automated deletion pipeline, at least initially. False positives in staleness detection will delete valid memories if there's no check. Archive first, delete after a confirmation period.

### Scheduled Maintenance

In addition to event-driven refresh, schedule periodic maintenance passes.

A weekly maintenance job should: run structural validation against all memories, apply time-based decay to confidence scores, archive or delete memories below confidence thresholds, and generate a report of memory health metrics (how many memories are active, how many were stale, what's the average confidence across types).

```python
# Scheduled maintenance pipeline
async def run_weekly_maintenance(
    memory_store: MemoryStore,
    code_index: CodeKnowledgeGraph
) -> MaintenanceReport:
    validator = MemoryValidator(code_index, memory_store)

    # Validate all memories
    validation_report = await validator.validate_all_memories()

    # Apply decay to all memories
    decayed_count = 0
    for memory in await memory_store.get_all_active():
        new_confidence = compute_decayed_confidence(
            original_confidence=memory.confidence,
            captured_at=memory.created_at,
            memory_type=memory.type,
            reinforcement_count=memory.reinforcement_count
        )

        if abs(new_confidence - memory.confidence) > 0.01:
            await memory_store.update_confidence(memory.id, new_confidence)
            decayed_count += 1

    # Archive memories below threshold
    archived_count = await memory_store.archive_below_threshold(
        confidence_threshold=0.2
    )

    return MaintenanceReport(
        stale_found=len(validation_report.stale),
        decayed_updated=decayed_count,
        archived=archived_count,
        timestamp=datetime.utcnow()
    )
```

---

**Key Takeaways**

1. Stale memory is worse than no memory — it produces confident, incorrect outputs. Active management is non-negotiable.
2. Staleness has multiple types: factual, structural, temporal, and confidence-based. Each requires different detection.
3. Time-based decay should be gradual and type-dependent. Episodic memories decay faster than semantic; semantic faster than procedural.
4. Structural validation — checking that memory references still exist in the codebase — catches the most impactful staleness immediately.
5. Archive before deleting. Automated deletion pipelines without human review will occasionally delete valid memories.

**Try This:** Run a manual audit of the information you've given an AI assistant in the past 30 days. How much of it is still accurate? How much has changed? That ratio is your staleness rate — and it's probably higher than you expect.

---

## Chapter 9: Evaluation: Does the Tool Actually Remember?

Everything in the preceding chapters is infrastructure. None of it matters if the tool doesn't actually remember in ways that developers notice and value. Evaluation is how you close the loop — how you know whether the system is working, what specifically is failing, and whether changes you make improve or degrade the experience.

Evaluating memory systems is harder than evaluating factual question-answering. The ground truth is fuzzy. Whether a developer would have found a retrieved memory relevant is a judgment call. Whether the absence of a memory actually hurt a specific interaction is rarely observable. And the most important outcomes — does the tool feel more useful over time? — are long-horizon and highly subjective.

That said, there are concrete metrics worth tracking and concrete evaluations worth running.

### Metric 1: Retrieval Precision and Recall

The most direct measure of memory system health is retrieval quality: when the system retrieves memories, are they the right ones?

To measure this, you need labeled data — a set of queries with known-relevant memories. Building this dataset is itself the work. The practical approach is to create a small evaluation set from real interactions: take 50-100 developer queries, manually annotate which memories in the store are relevant to each, and use this as the benchmark for retrieval evaluation.

```python
# Retrieval evaluation against labeled benchmark
class RetrievalEvaluator:
    def evaluate(
        self,
        benchmark: list[BenchmarkQuery],
        retriever: HybridRetriever,
        k: int = 5
    ) -> RetrievalMetrics:
        precision_scores = []
        recall_scores = []
        ndcg_scores = []

        for query in benchmark:
            retrieved = retriever.retrieve(query.text, top_k=k)
            retrieved_ids = {r.id for r in retrieved}
            relevant_ids = set(query.relevant_memory_ids)

            # Precision@k: what fraction of retrieved are relevant?
            precision = len(retrieved_ids & relevant_ids) / k
            precision_scores.append(precision)

            # Recall@k: what fraction of relevant were retrieved?
            if relevant_ids:
                recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
                recall_scores.append(recall)

            # NDCG: position-weighted measure of ranking quality
            ndcg = self._compute_ndcg(retrieved, query.relevant_memory_ids)
            ndcg_scores.append(ndcg)

        return RetrievalMetrics(
            precision_at_k=sum(precision_scores) / len(precision_scores),
            recall_at_k=sum(recall_scores) / len(recall_scores),
            ndcg=sum(ndcg_scores) / len(ndcg_scores)
        )
```

Target metrics depend on context, but for developer tools, precision matters more than recall. False positives (irrelevant memories injected into context) actively degrade output quality. A precision@5 above 0.7 is a reasonable target; above 0.85 is excellent.

### Metric 2: Memory Coverage

Memory coverage measures whether the important things the developer has told the system (or that the system has learned from the codebase) are actually captured and available.

This is harder to measure directly, but a proxy works: create a set of factual questions about the project that a properly-functioning memory system should be able to answer, and check whether the retrieval system surfaces the relevant memories.

```python
# Coverage evaluation: can the system recall specific facts?
class CoverageEvaluator:
    def evaluate_coverage(
        self,
        fact_questions: list[FactQuestion],
        memory_store: MemoryStore,
        threshold: float = 0.6
    ) -> CoverageReport:
        """
        For each question, check whether the correct memory is retrievable
        with sufficient confidence.
        """
        results = []

        for question in fact_questions:
            memories = memory_store.retrieve(
                query=question.query_text,
                top_k=10
            )

            # Check if the expected memory is in top-10 results
            expected_found = any(
                m.id == question.expected_memory_id
                for m in memories
            )

            # Check confidence of expected memory if found
            if expected_found:
                expected_memory = next(
                    m for m in memories
                    if m.id == question.expected_memory_id
                )
                covered = expected_memory.confidence >= threshold
            else:
                covered = False

            results.append(CoverageResult(
                question=question,
                covered=covered,
                found_in_top_k=expected_found
            ))

        coverage_rate = sum(1 for r in results if r.covered) / len(results)

        return CoverageReport(
            coverage_rate=coverage_rate,
            results=results,
            uncovered=[r for r in results if not r.covered]
        )
```

**Key Insight:** Coverage evaluation doubles as a diagnostic tool. When coverage is low on a category of questions, it tells you exactly what the memory system is failing to capture or retain.

### Metric 3: Staleness Rate

The fraction of active memories that are stale measures memory system hygiene. This should be checked periodically and tracked as a time series.

```python
def compute_staleness_rate(
    memory_store: MemoryStore,
    code_index: CodeKnowledgeGraph
) -> float:
    """
    Fraction of active memories with at least one broken code reference.
    """
    all_memories = memory_store.get_all_active()
    stale_count = 0

    for memory in all_memories:
        references = extract_code_references(memory.content)
        if any(not code_index.entity_exists(ref) for ref in references):
            stale_count += 1

    return stale_count / len(all_memories) if all_memories else 0.0
```

A staleness rate above 20% is a signal that the refresh pipeline isn't keeping up with codebase change velocity. Investigate which types of memories are going stale fastest and tune the refresh triggers accordingly.

### Metric 4: Context Utilization

When memories are retrieved and injected into context, does the model actually use them? Context utilization measures whether retrieved memories appear in the model's outputs.

This is imprecise but trackable: run a classifier on model outputs that checks whether the output references concepts from the injected memory context.

```python
async def measure_context_utilization(
    interactions: list[Interaction],
    llm_client
) -> float:
    """
    Fraction of interactions where the model's response references
    content from the injected memory context.
    """
    utilized_count = 0

    for interaction in interactions:
        if not interaction.injected_memories:
            continue

        prompt = f"""
        Memory context provided to the model:
        {interaction.injected_memories_text}

        Model response:
        {interaction.model_response}

        Does the model's response reference or build on any specific information
        from the memory context? Answer yes or no.
        """

        response = await llm_client.complete(prompt, max_tokens=5)
        if "yes" in response.lower():
            utilized_count += 1

    return utilized_count / len(
        [i for i in interactions if i.injected_memories]
    )
```

Low context utilization (below 50%) suggests that the retrieved memories aren't actually relevant to the queries they're being paired with. This points back to retrieval quality issues.

### Qualitative Evaluation: Developer Experience

The quantitative metrics above measure system components. They don't measure whether the overall experience of using the tool has actually improved.

Qualitative evaluation requires talking to developers who use the tool seriously. Not in a structured survey — in a conversation. The questions that reveal the most:

- When did you notice the tool knowing something you hadn't told it in this session?
- When did the tool say something that was wrong because it was relying on outdated information?
- Have you stopped re-explaining things you used to re-explain? What things?
- Does the tool feel like it knows your codebase, or does it still feel generic?

These are not rigorous metrics, but they reveal signal that no metric captures: whether the memory system has crossed the threshold from "technically functional" to "genuinely useful."

**Warning:** Don't rely exclusively on quantitative metrics. A system with good precision@5 and coverage rates can still feel useless if the memories it captures are the wrong ones. Developer experience interviews catch this; retrieval benchmarks don't.

### A/B Testing Memory System Changes

When making changes to the memory architecture — new retrieval strategies, different decay rates, changed capture heuristics — A/B testing provides clean signal about whether the change helped.

The challenge is that memory quality compounds over time. A change that improves memory capture today produces better retrievable context next week. The measurement horizon needs to match the compounding time.

Practical approach: run the A/B test over at least two weeks, use held-out developer teams rather than alternating individual developers (to avoid contamination between conditions), and measure both retrieval metrics and developer satisfaction at the end.

### Building an Evaluation Dataset

Everything above requires data. Most teams don't have it. Building it is a prerequisite for serious memory system evaluation.

The evaluation dataset should include:

- A representative sample of developer queries with labeled relevant memories (for retrieval evaluation)
- A set of factual questions about the project with known correct answers (for coverage evaluation)
- A set of interactions with human-labeled judgments about response quality with and without memory context

Building this dataset takes time — probably 20-40 hours for a thorough initial version — but it pays off immediately in the ability to make changes with confidence, and it grows more valuable over time as the system evolves.

```python
# Evaluation dataset builder
class EvalDatasetBuilder:
    def sample_representative_queries(
        self,
        interaction_log: list[Interaction],
        n_samples: int = 100
    ) -> list[LabeledQuery]:
        """
        Sample diverse queries from interaction log for annotation.
        Stratified by query type to ensure coverage.
        """
        query_types = [QueryType.DEBUGGING, QueryType.IMPLEMENTATION,
                      QueryType.ARCHITECTURE, QueryType.EXPLANATION]

        per_type = n_samples // len(query_types)
        sampled = []

        for query_type in query_types:
            type_interactions = [
                i for i in interaction_log
                if classify_query(i.query) == query_type
            ]

            sample = random.sample(
                type_interactions,
                min(per_type, len(type_interactions))
            )

            sampled.extend([
                LabeledQuery(
                    query=i.query,
                    query_type=query_type,
                    relevant_memory_ids=[]  # To be annotated by human
                )
                for i in sample
            ])

        return sampled
```

---

**Key Takeaways**

1. Quantitative metrics — retrieval precision/recall, memory coverage, staleness rate, context utilization — measure system health. None of them measure whether the system is actually useful.
2. Coverage evaluation is the most practical diagnostic: specify what the system should know, then check whether it can retrieve it.
3. A staleness rate above 20% indicates the refresh pipeline isn't keeping up with codebase velocity.
4. Context utilization below 50% indicates retrieval is returning irrelevant memories that the model correctly ignores.
5. Qualitative developer experience evaluation catches what quantitative metrics miss. Do it regularly.

**Try This:** Write down ten things your AI coding assistant should know about your project that it currently doesn't. That list is your initial coverage evaluation dataset. Run retrieval against each item and measure how many are actually in the memory store. The answer is usually sobering.

---

## Conclusion

Memory systems for AI developer tools are not a single feature. They're a discipline — an engineering practice that spans capture, organization, retrieval, and maintenance, applied continuously across the lifecycle of a developer's relationship with a tool.

The chapters of this book trace a path from the problem (stateless tools that impose a hidden tax on developers) through the components of a solution (memory types, session management, cross-session persistence, code understanding, graph structures, retrieval systems) and into the operational concerns (decay, refresh, evaluation) that determine whether the theory holds up in production.

Some things are worth emphasizing as you take these ideas into practice.

**Memory is not a database problem.** The storage is trivial compared to the hard parts: deciding what's worth capturing, building retrieval that returns what's actually relevant, keeping the system honest as the codebase evolves. Teams that treat memory as a storage engineering problem build systems that hold information but don't remember.

**Retrieval quality is the ceiling.** A memory system is only as useful as its worst retrieval. You can have perfectly captured, perfectly organized memories and still fail the developer if retrieval surfaces the wrong ones at the wrong time. Invest in retrieval quality — hybrid search, query rewriting, re-ranking — before investing in capture breadth.

**The codebase is underused.** Most AI developer tools treat the codebase as content to analyze on demand, not as persistent memory to mine continuously. The git history, the test suite, the inline comments, the schema definitions — all of it is externalized knowledge that a well-designed system can turn into an indexed, retrievable memory store. Teams that do this effectively build tools that genuinely know their users' codebases in a way that tools relying only on conversation history cannot match.

**Staleness is inevitable and manageable.** Any memory that references code will eventually go stale. Designing for this from the beginning — change-triggered validation, confidence decay, scheduled maintenance — is the difference between a memory system that degrades gracefully and one that quietly becomes a liability.

**Evaluation must happen.** Building memory infrastructure without evaluating whether it works is cargo-cult engineering. The evaluation frameworks in Chapter 9 are not sophisticated research-grade protocols — they're practical baselines that any team can implement. Run them. Find out what your system actually knows versus what it should know. Fix the gaps.

The tools that win over the next several years will be the ones that genuinely accumulate useful knowledge about how their users work. Not because of any single insight or architectural choice, but because they treat memory as a first-class engineering concern and invest in it consistently.

The developers who build those tools are the ones reading books like this.

---

## Appendix A: Glossary

**BM25** — A probabilistic ranking function used in information retrieval. Scores documents based on query term frequency and inverse document frequency. The standard baseline for keyword-based (sparse) retrieval.

**Chunking** — The process of splitting large text or code into smaller segments for embedding and indexing. Chunk size and overlap affect retrieval precision.

**ChromaDB** — An open-source vector database designed for AI applications. Commonly used for local development and smaller-scale production deployments.

**Context Window** — The maximum number of tokens a language model can process in a single call. Represents the model's working memory during inference.

**Cross-Encoder** — A neural model that processes a (query, document) pair jointly to produce a relevance score. Used for re-ranking retrieval candidates. Slower than bi-encoders but more accurate.

**Dense Retrieval** — Retrieval using vector embeddings and similarity search. Finds semantically similar content even when wording differs.

**Embedding** — A numeric vector representation of text (or code) that captures semantic meaning. Produced by embedding models; similar content produces similar embeddings.

**Episodic Memory** — Memory of specific events and interactions, temporally ordered. The record of what happened, when.

**Graph Database** — A database that stores entities and relationships as a graph (nodes and edges). Examples: Neo4j, Amazon Neptune. Efficient for traversal queries.

**Hybrid Search** — Combining dense (vector) and sparse (keyword) retrieval, typically using Reciprocal Rank Fusion to merge rankings.

**Knowledge Graph** — A graph structure representing entities and their relationships. For code, this includes modules, functions, classes, and their dependencies.

**Lost in the Middle** — The empirically observed tendency of language models to attend less reliably to content in the middle of long contexts. Content at the beginning and end of the context window receives stronger attention.

**Procedural Memory** — Memory of how to do things — patterns, conventions, and established practices rather than facts or events.

**RAG (Retrieval-Augmented Generation)** — A technique that retrieves relevant documents from an external store and includes them in the prompt before generation. Enables models to answer questions about information not in their training data.

**Re-Ranking** — A second-stage retrieval step that scores retrieved candidates using a more sophisticated model (typically a cross-encoder) to improve precision.

**Reciprocal Rank Fusion (RRF)** — An algorithm for combining multiple ranked lists into a single ranking. Used in hybrid search to merge dense and sparse retrieval results.

**Semantic Memory** — Memory of stable facts and concepts, abstracted from the specific interaction in which they were learned.

**Sparse Retrieval** — Retrieval using keyword matching (BM25 or TF-IDF). Precise for exact identifier matches; weak at semantic generalization.

**Staleness** — The condition of a memory record whose content no longer accurately reflects the current state of the codebase or project.

**Vector Store** — A database optimized for storing and querying vector embeddings. Examples: Chroma, Qdrant, Pinecone, Weaviate.

---

## Appendix B: Tools & Resources

### Vector Databases

**Chroma** — Open-source, embeds in-process or runs as a server. Best for development and smaller deployments. Strong Python SDK.

**Qdrant** — Open-source, written in Rust. Good performance characteristics, supports filtering and payload storage. Self-hosted or managed cloud.

**Weaviate** — Open-source with a strong managed offering. Built-in support for hybrid (dense + sparse) search. Good for production deployments.

**Pinecone** — Fully managed, no self-hosting. Simple API, scales without operational overhead. Not suitable for air-gapped or fully local deployments.

**pgvector** — A PostgreSQL extension that adds vector similarity search. The pragmatic choice if you already run PostgreSQL and want to minimize infrastructure surface area.

### Embedding Models

**OpenAI text-embedding-3-small / text-embedding-3-large** — Strong general-purpose embeddings. Small is fast and cheap; large is higher quality at higher cost. Requires API access.

**Cohere embed-v3** — Competitive quality, includes a multilingual model. Requires API access.

**BGE (BAAI General Embeddings)** — Strong open-source models available locally via Hugging Face. BGE-large-en-v1.5 is a solid choice for English-language code and documentation.

**E5 family** — Competitive open-source embeddings. E5-large-v2 and E5-mistral-7b-instruct cover different cost/quality points.

**CodeBERT / GraphCodeBERT** — Code-specific embedding models. Better than general models for pure code retrieval; weaker for mixed code/prose content.

### Graph Tools

**NetworkX** — Python library for graph analysis. Excellent for development and analysis; not suitable for production scale (in-memory only).

**Neo4j** — Full graph database with its own query language (Cypher). Mature, production-grade, significant operational complexity.

**Kuzu** — Embedded graph database. Low overhead, no server required. Good option when you want graph capabilities without Neo4j's complexity.

**DuckDB** — Not a graph database, but its array functions and graph extensions make it a practical choice for small-to-medium code graph queries.

### Retrieval Libraries

**sentence-transformers** — Python library for embedding models and cross-encoder re-ranking. The standard toolkit for building retrieval pipelines.

**rank-bm25** — Lightweight Python BM25 implementation. Good for sparse retrieval in hybrid search pipelines.

**LlamaIndex** — Framework for building RAG applications. Handles chunking, indexing, retrieval, and context assembly with opinionated defaults.

**LangChain** — Broader framework that includes retrieval primitives. More flexible than LlamaIndex, more configuration required.

### Code Analysis

**tree-sitter** — Parser library with support for many languages. Produces concrete syntax trees rather than Python's `ast` module's ASTs. Better for polyglot codebases.

**Python `ast` module** — Built-in. Good enough for Python-only codebases, zero external dependencies.

**Sourcegraph** — Enterprise code intelligence platform. Useful for understanding large, multi-repo codebases. Source of inspiration for code graph approaches even if the tool itself isn't adopted.

---

## Appendix C: Further Reading

### Foundational Papers

**Attention Is All You Need** (Vaswani et al., 2017) — The transformer architecture paper. Foundation for understanding how attention mechanisms work and why position in context matters.

**REALM: Retrieval-Augmented Language Model Pre-Training** (Guu et al., 2020) — Early influential work on integrating retrieval into language model inference.

**Lost in the Middle: How Language Models Use Long Contexts** (Liu et al., 2023) — The definitive paper on position bias in long-context language models. Required reading for anyone managing context window content.

**Improving Language Models by Retrieving from Trillions of Tokens** (Borgeaud et al., 2022) — RETRO paper from DeepMind. Demonstrates retrieval-augmented generation at scale.

**Dense Passage Retrieval for Open-Domain Question Answering** (Karpukhin et al., 2020) — Foundational paper on dense retrieval with bi-encoder models.

### Retrieval and RAG

**BEIR: A Heterogeneous Benchmark for Zero-shot Evaluation of Information Retrieval Models** (Thakur et al., 2021) — Standard benchmark for retrieval model evaluation. Useful for understanding how different retrieval approaches compare across domains.

**Precise Zero-Shot Dense Retrieval without Relevance Labels** (Gao et al., 2022) — HyDE paper. Generates a hypothetical document for a query, then retrieves using that document as a query. Improves dense retrieval for questions.

**Contextual Retrieval** (Anthropic, 2024) — Technical blog post. Demonstrates prepending chunk context before embedding to improve retrieval quality.

### Knowledge Graphs and Code Understanding

**Code2Vec: Learning Distributed Representations of Code** (Alon et al., 2019) — Influential work on code representation learning.

**A Survey of the State of Explainable AI for Natural Language Processing** (Danilevsky et al., 2020) — Background on extracting structured knowledge from unstructured text.

**Enriching Word Vectors with Subword Information** (Bojanowski et al., 2017) — Background on subword tokenization. Relevant for understanding why code identifiers are challenging for general-purpose embeddings.

### Memory Systems

**Cognitive Architectures for Language Agents** (Sumers et al., 2023) — Survey of how cognitive science memory frameworks apply to language agent design. The closest direct analog to the framework in this book.

**MemGPT: Towards LLMs as Operating Systems** (Packer et al., 2023) — Proposes treating the LLM as an operating system that manages its own memory hierarchy. Influential in framing the context window as working memory.

**Voyager: An Open-Ended Embodied Agent with Large Language Models** (Wang et al., 2023) — Implements a procedural memory (skill library) for an autonomous agent. Good reference for procedural memory design.

---

*Memory Systems for AI Developer Tools* — Version 1.0, April 2026

*David Kelly Price is the founder of Pyckle, a developer tools company focused on context optimization and semantic code understanding for AI-assisted engineering.*

---

---

*© 2026 Pyckle. All rights reserved. This guide may be shared freely for personal and educational use. Commercial reproduction or redistribution requires written permission. Contact kellyprice@pyckle.co.*

---

## Related Blog Posts

- [Search Is Commoditized. Memory Is the Moat.](https://pyckle.co/blog/search-is-commoditized-memory-is-the-moat.html)
- [Your Team's Knowledge Lives in Multiple Places](https://pyckle.co/blog/your-teams-knowledge-lives-in-multiple-places-and-your-ai-only-sees-one.html)
- [Why Some Tools Age and Others Compound](https://pyckle.co/blog/why-some-tools-age-and-others-compound.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*