---
title: "From RAG to Agents: Code-Aware Agentic Pipelines"
subtitle: "How Retrieval Feeds Agentic Loops and Why the Combination Changes What's Possible"
author: "David Kelly Price"
version: "1.0"
date: 2026-04-20
status: draft
type: ebook
target_audience: "Architects and senior ML engineers building AI systems that go beyond single-turn generation — designing pipelines where retrieval, planning, and action are interleaved"
estimated_pages: 80
chapters:
  - "The Limits of Single-Pass RAG"
  - "Agents: Planning, Tool Use, and Loops"
  - "Code Retrieval as an Agent Tool"
  - "Multi-Step Retrieval for Complex Tasks"
  - "Grounding Agents in Codebase Reality"
  - "Memory and State Across Agent Steps"
  - "Evaluation for Agentic Systems"
  - "Production Architecture and Safety"
tags:
  - pyckle
  - ebook
  - agentic-pipelines
  - rag
  - ai-agents
  - architecture
  - retrieval
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Figures:
- Captioned and numbered (Figure 1, Figure 2, etc.)
- Referenced by number in body text
-->

---

# From RAG to Agents: Code-Aware Agentic Pipelines

## How Retrieval Feeds Agentic Loops and Why the Combination Changes What's Possible

**By David Kelly Price**

Version 1.0 — April 2026

---

## Table of Contents

- About This Guide
- Chapter 1: The Limits of Single-Pass RAG
- Chapter 2: Agents: Planning, Tool Use, and Loops
- Chapter 3: Code Retrieval as an Agent Tool
- Chapter 4: Multi-Step Retrieval for Complex Tasks
- Chapter 5: Grounding Agents in Codebase Reality
- Chapter 6: Memory and State Across Agent Steps
- Chapter 7: Evaluation for Agentic Systems
- Chapter 8: Production Architecture and Safety
- Conclusion
- Appendix A: Glossary
- Appendix B: Tools & Resources
- Appendix C: Further Reading

---

## About This Guide

This book is for engineers who have already built something with retrieval-augmented generation and found themselves staring at its ceiling. You got the basic architecture working. You chunked documents, embedded them, stood up a vector store, and wired a language model to the retrieval results. And it worked — right up until users asked questions that required more than one step.

The moment a question has genuine complexity — "refactor this service so it no longer depends on the legacy auth module" or "find all the places where we're not handling this error type and explain the risk pattern" — single-pass RAG runs out of runway. The architecture isn't wrong. It's just incomplete.

What this book covers is the next layer: agentic pipelines where retrieval is a repeatable action rather than a one-time setup step, where a model can decide what to retrieve based on what it already knows, and where the combination of planning, tool use, and grounded context creates systems that can handle tasks that used to require a senior engineer's full attention.

The focus is specifically on code. Code is where agentic retrieval is hardest and most valuable. Code has structure, dependencies, implicit contracts between modules, and a kind of semantic density that flat document retrieval handles poorly. Getting code retrieval right inside an agentic loop is a different problem than getting it right in a single-turn Q&A system, and this book treats it as such.

Each chapter builds on the previous one. You can read it front-to-back or jump to the chapter that addresses the problem in front of you, though the later chapters assume familiarity with the concepts from the earlier ones.

The code examples are in Python. The architectural patterns are language-agnostic.

---

## Chapter 1: The Limits of Single-Pass RAG

Retrieval-augmented generation solved a real problem. Before it, large language models were useful but brittle: they hallucinated facts, they couldn't reference private knowledge, and their training cutoffs meant anything recent was a gamble. RAG addressed all three. You retrieve relevant documents, stuff them in the context window alongside the question, and the model generates an answer that's grounded in actual source material.

The pattern is simple enough that it became the default. Ask a question, retrieve chunks, generate an answer. Repeat. For a large class of tasks, that's genuinely sufficient.

But spend enough time with RAG systems in production and you start noticing the failure modes cluster around a specific shape of problem: tasks that require the system to know what to look for before it can look for it.

### The Single-Pass Assumption

Every traditional RAG pipeline makes the same structural assumption: the query you start with is sufficient to retrieve the context you need. The sequence is fixed. Query comes in, retrieval runs once, model generates once, answer goes out. There's no feedback. No iteration. No ability for the model to say "actually, now that I've read these chunks, I realize I need something else."

This assumption holds for lookup tasks. "What does this config parameter do?" "Find the function that handles OAuth tokens." "Summarize what this module is responsible for." One question, one retrieval, one answer. The query is self-contained.

It breaks down for reasoning tasks. "Why is this service slow?" isn't a lookup. It's an investigation. The first retrieval might surface the API handler. But the bottleneck might be in the database query it calls, which calls a utility that does something unexpected, which depends on a shared cache that's configured somewhere else entirely. Getting to the actual answer requires following a chain of evidence — and you can't write that chain before you start following it.

> **Key Insight**
> The fundamental limit of single-pass RAG isn't the retrieval quality — it's the assumption that the first query is the only query. Any task that requires discovering what to look for next cannot be solved in a single pass.

### What Breaks and Why

There are four failure categories worth naming precisely, because they inform the architecture that comes later.

**Query-answer mismatch at retrieval time.** The model receives chunks that are topically related to the query but not actually relevant to answering it. A question about authentication failure modes retrieves documentation about the authentication module's API, not the error handling code buried two layers deeper. The model tries to answer from what it has and either hallucinates the missing piece or gives a partial answer it presents as complete.

**Dependency blindness.** Code is a graph, not a collection of independent documents. A function's behavior depends on what it calls, what calls it, what assumptions it inherits from shared utilities, and what configuration governs its execution. Single-pass retrieval typically returns the function and maybe its immediate neighbors. It misses the transitive dependencies that are often where the actual answer lives.

**Context window exhaustion without coverage.** You can't just retrieve more chunks to compensate. At some chunk size and retrieval breadth, you hit the context window limit before you've actually covered the relevant surface area. You end up with a context that looks full but isn't actually complete — which is worse than a context that's obviously incomplete, because the model has no signal that it's missing something.

**Semantic drift between query intent and document content.** The query is phrased in terms of the problem. The relevant code is phrased in terms of the implementation. These don't always embed near each other. "Fix the race condition" doesn't naturally retrieve `threading.Lock()` usage or the specific flag that governs concurrency behavior, because those terms don't appear in the query. The semantic gap is bridgeable with good hybrid search, but it can't be fully closed with a single embedding-based retrieval step.

### The Evaluation Problem Hides the Architecture Problem

Here's what makes this particularly tricky in practice: naive evaluation of RAG systems doesn't surface these failure modes. If you evaluate your system on questions where single-pass retrieval is sufficient — which is most benchmark datasets — you'll get good numbers. The system looks like it's working.

The failure modes only appear on the hard queries: the multi-hop questions, the dependency-chain investigations, the refactoring tasks that require understanding impact across multiple modules. Those queries are underrepresented in standard benchmarks because they're harder to construct ground truth for.

So teams ship systems that perform well in evaluation and encounter real-world failure in production, often without understanding why. The architecture isn't wrong for easy queries. It's just wrong for hard ones.

> **Warning**
> Evaluation datasets built from FAQ-style questions systematically underrepresent the failure modes of single-pass RAG. If your system scores well on standard benchmarks but underperforms on real user queries, this is the likely explanation. Build your evaluation set from actual hard queries.

### What the Architecture Would Need to Do Instead

The solution isn't more retrieval at the start. It's retrieval on demand throughout the process. A system that can recognize "I need more context before I can answer this well" and then go get that context — repeatedly, if necessary — can handle the hard queries that break single-pass pipelines.

That's an agent. Not in the academic sense, not in the sci-fi sense, but in the practical systems sense: a language model that can decide to use tools, use retrieval as one of those tools, and loop until the task is complete.

The rest of this book is about how to build that system correctly. Before getting into architecture, though, it helps to understand what "agent" actually means at the code level — which is what Chapter 2 covers.

### Key Takeaways

1. Single-pass RAG works for lookup tasks. It fails for reasoning tasks that require discovering what to look for next.
2. The four main failure modes are query-answer mismatch, dependency blindness, context window exhaustion without coverage, and semantic drift.
3. Standard evaluation datasets don't surface these failures because they're built on easy queries.
4. The architectural fix is retrieval on demand throughout a loop, not more retrieval at the start.
5. An agent is the system pattern that enables this — a model that can decide to retrieve, retrieve, and decide again.

> **Try This**
> Take your existing RAG system and run ten real user queries through it — queries that came from actual users, not from your test set. For each query that gets a partial or wrong answer, trace back to the retrieval step: what chunks were retrieved? What would have needed to be retrieved to answer the question correctly? Could you have written the right retrieval query before seeing the answer? If not, you've found a query that requires an agentic loop.

---

## Chapter 2: Agents: Planning, Tool Use, and Loops

The word "agent" has accumulated enough baggage to be nearly useless without qualification. In academic AI it means something specific and historically loaded. In popular usage it means almost anything. For building production systems, neither of those meanings is particularly helpful.

Here's a working definition that's useful for system design: an agent is a language model in a loop that can take actions. The loop is the part people underestimate. The actions are the part people over-engineer.

### The Anatomy of an Agent

Strip away the abstraction and an agent is three things operating in sequence: perceive, decide, act.

**Perceive** means receiving the current state — what's in the context window. That includes the original task, the history of what's been done so far, any tool outputs that have come back, and any retrieved context. The model's "perception" is whatever is in the context at the moment it generates.

**Decide** means producing an output that either answers the task or specifies an action to take. In practice this looks like a structured output: either the final answer or a function call with parameters.

**Act** means executing that function call, getting a result, and feeding it back into the context for the next perception step.

That loop runs until the model decides it has enough information to produce a final answer. Or until you hit a step limit. Or until something errors out. The control flow is straightforward — what makes it interesting is the quality of the decisions in the middle.

### Tool Use Is the Interface

The mechanism that makes agents extensible is tool use. Tools are functions the model can call, defined by a schema the model understands, executed by external code. The model doesn't run the tool — it decides to call it, specifies the parameters, and waits for the result.

A minimal tool definition looks like this:

```python
def search_codebase(query: str, top_k: int = 5) -> list[dict]:
    """
    Search the indexed codebase for relevant code chunks.

    Args:
        query: Natural language description of what to find
        top_k: Number of results to return

    Returns:
        List of chunks with file_path, content, and score
    """
    return vector_store.search(query, top_k=top_k)
```

The function signature and docstring together form the schema. The model sees the name, the parameters, the types, and the description. It decides whether to call the tool, what arguments to pass, and what to do with the result.

This is important because it separates capability from control. Adding a new capability means adding a new tool. The model decides how and when to use it. You don't need to hardcode decision trees or routing logic — the model handles that, and you can adjust its behavior by changing the tool descriptions.

> **Key Insight**
> Tool descriptions are the primary control surface for agent behavior. A well-written tool description shapes how and when the model uses the tool far more than the underlying implementation. Invest in descriptions the way you'd invest in an API contract.

### Planning: What It Is and What It Isn't

Planning in agentic systems is often presented as something sophisticated — a model that reasons about a multi-step task, decomposes it, and executes each step in sequence. The reality is more prosaic.

Language model "planning" is next-token prediction that happens to produce a plausible sequence of steps. It works not because the model has a true world model, but because it was trained on enough examples of structured problem-solving that it produces outputs that look like plans and, often, are useful as plans.

What this means practically: don't over-rely on the model's planning ability for complex, novel tasks. Do rely on it for decomposing familiar task types into known steps. A model that's been prompted well and has access to the right tools can handle "investigate this bug" competently not because it reasons about bugs from first principles, but because it's seen enough bug investigation in training to know the shape of the process.

The practical implication is that planning quality is heavily dependent on prompt design. A prompt that makes the task structure explicit and gives the model concrete examples of how to decompose similar tasks will produce dramatically better plans than a prompt that just describes the end goal.

```python
AGENT_SYSTEM_PROMPT = """
You are a code investigation agent. You have access to tools for searching a codebase and reading files.

When given a task:
1. Identify what you need to find to complete the task
2. Use search_codebase to locate relevant code
3. Use read_file to examine specific files
4. If your search returns unexpected results, refine your query and search again
5. When you have enough context to answer confidently, produce your final response

Do not guess about code behavior. If you haven't read the relevant code, read it before answering.
"""
```

This prompt isn't magic. But it gives the model a concrete step sequence to follow, a clear signal about when to keep going versus when to stop, and a constraint that prevents confident-sounding guesses.

### The Loop Implementation

The loop itself is simple. Here's a minimal but complete implementation:

```python
import anthropic
import json

client = anthropic.Anthropic()

def run_agent(task: str, tools: list, max_steps: int = 20) -> str:
    messages = [{"role": "user", "content": task}]

    for step in range(max_steps):
        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            system=AGENT_SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        )

        # Model produced a final answer
        if response.stop_reason == "end_turn":
            return extract_text(response)

        # Model wants to use a tool
        if response.stop_reason == "tool_use":
            tool_results = []

            for block in response.content:
                if block.type == "tool_use":
                    result = execute_tool(block.name, block.input)
                    tool_results.append({
                        "type": "tool_result",
                        "tool_use_id": block.id,
                        "content": json.dumps(result),
                    })

            # Add assistant response and tool results to history
            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

    return "Max steps reached without completing task."

def execute_tool(name: str, params: dict) -> any:
    tool_map = {
        "search_codebase": search_codebase,
        "read_file": read_file,
        "list_directory": list_directory,
    }
    if name in tool_map:
        return tool_map[name](**params)
    raise ValueError(f"Unknown tool: {name}")
```

The structure is the same regardless of what tools you attach. The loop runs until the model signals it's done or until the step limit fires. Each iteration the model sees the full history of what it's done, which gives it the context to make the next decision.

### When to Stop

Step limits matter more than they look like they should. Without them, a model with a broken tool or an ambiguous task can loop indefinitely. With a limit that's too low, legitimate complex tasks get cut off. The right limit depends on the task distribution.

For code investigation tasks — the main focus of this book — a limit of 15-25 steps covers the vast majority of legitimate cases. If a task genuinely requires more than 25 retrieval and read operations to answer, either the task is too coarse-grained or the retrieval quality is poor enough that the model is spinning. Both of those are signals to address at the architecture level, not by raising the step limit.

There are also conditions other than step count that should terminate a loop: repeated identical tool calls (the model is stuck), tool errors on every step (something is broken upstream), context window approaching the limit (the model is about to lose early context in a long conversation and its behavior will degrade).

> **Warning**
> A step limit is not a safety guarantee. A model that runs a destructive action on step 1 and then loops until step 25 has already caused the damage by the time the limit fires. Safety constraints belong at the tool execution layer, not only at the loop control layer. Every tool that can modify state needs its own validation logic.

### ReAct: The Pattern That Clarifies Everything

Most production agentic systems, whether they call it this or not, implement something close to the ReAct pattern (Reasoning + Acting). The model alternates between writing out its reasoning and calling a tool. The reasoning trace serves two purposes: it helps the model maintain coherent state across steps, and it gives you visibility into what the model is "thinking" before you see the action it takes.

```
Thought: I need to find where the authentication middleware is defined before I can understand
         how the token validation works. Let me search for it.

Action: search_codebase(query="authentication middleware definition", top_k=3)

Observation: [
  {"file": "middleware/auth.py", "content": "class AuthMiddleware:\n    def __init__..."},
  {"file": "config/middleware.py", "content": "AUTH_MIDDLEWARE_CLASS = 'middleware.auth.AuthMiddleware'"},
  ...
]

Thought: The middleware is in middleware/auth.py. I need to read the full implementation to
         understand how it validates tokens — especially the part that handles expiry.

Action: read_file(path="middleware/auth.py")
```

The "Thought" step isn't wasted tokens. It's working memory. Without it, the model would need to derive its reasoning purely from the tool outputs, which is harder and less reliable for multi-step tasks.

In practice, you implement this by prompting the model to think before acting, or by structuring the tool call protocol to include a reasoning field. The exact implementation varies by model and framework. What matters is that the reasoning happens inside the loop, not just at the beginning.

### Key Takeaways

1. An agent is a language model in a loop that can take actions. The loop is the essential ingredient.
2. Tools are the interface between model decisions and external capabilities. Tool descriptions are the primary control surface.
3. Planning quality depends on prompt design, not just model capability. Explicit task structure in prompts produces better plans.
4. The loop implementation is simple. Correctness comes from step limits, error handling, and careful tool design.
5. The ReAct pattern — alternating reasoning and action — improves reliability for multi-step tasks.

> **Try This**
> Implement the minimal agent loop above and attach a single tool: a codebase search function against a small repository you know well. Give it five real questions that require following a dependency chain to answer. Watch the reasoning trace. Note where the model makes good decisions about what to retrieve next versus where it gets lost. Those failure points are where you'll focus your tool design and prompt work.

---

## Chapter 3: Code Retrieval as an Agent Tool

General document retrieval and code retrieval share a surface-level similarity: both chunk content, embed it, and find nearest neighbors by query. Below that surface they're solving different problems.

Documents are relatively flat. A paragraph about database indexing is mostly self-contained. Pull it out of context and you lose some surrounding structure, but the paragraph still means roughly the same thing. Code is a graph. A function without its dependencies, its callers, its type imports, and its configuration context is often not just incomplete — it's actively misleading.

Getting code retrieval right inside an agent loop requires understanding what makes code different and building retrieval infrastructure that accounts for it.

### Why Code Requires Different Chunking

The standard approach to chunking documents is to split on size — fixed-size chunks, maybe with some overlap, maybe respecting sentence boundaries. Apply that to code and you get chunks that cut through function bodies, split class definitions, and sever import blocks from the code that uses them.

The result isn't just aesthetically unpleasant. It degrades retrieval quality because the semantic unit of code isn't a fixed-size window — it's a syntactic construct. A function is a unit. A class is a unit. An import block is a unit. A module-level constant definition is a unit.

Chunk at syntactic boundaries instead:

```python
import ast
from pathlib import Path

def chunk_python_file(file_path: str) -> list[dict]:
    """Parse a Python file and chunk at function/class boundaries."""
    source = Path(file_path).read_text()
    tree = ast.parse(source)
    chunks = []

    for node in ast.walk(tree):
        if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef, ast.ClassDef)):
            # Only take top-level and class-level definitions
            start_line = node.lineno
            end_line = node.end_lineno

            # Extract the source lines for this node
            lines = source.splitlines()
            chunk_source = "\n".join(lines[start_line - 1:end_line])

            # Extract docstring if present
            docstring = ast.get_docstring(node) or ""

            chunks.append({
                "file_path": file_path,
                "name": node.name,
                "type": type(node).__name__,
                "start_line": start_line,
                "end_line": end_line,
                "source": chunk_source,
                "docstring": docstring,
                "imports": extract_imports(tree),
            })

    return chunks

def extract_imports(tree: ast.Module) -> list[str]:
    """Extract all import statements from the module."""
    imports = []
    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            imports.extend(alias.name for alias in node.names)
        elif isinstance(node, ast.ImportFrom):
            imports.append(f"{node.module}")
    return imports
```

This produces chunks that align with how developers actually think about code — and, more importantly, how a model reasons about code. When the model retrieves a function, it gets the complete function definition, not a fragment.

> **Key Insight**
> Syntactic chunking isn't just cleaner than fixed-size chunking — it changes what the model can do with the retrieved content. A complete function with its signature, body, and docstring gives the model enough context to understand it. A fragment of a function often doesn't.

### Embedding Code: What Works and What Doesn't

Standard text embedding models were trained primarily on natural language. They can embed code, but they were not trained to understand code semantics the way they understand prose. The nearest-neighbor relationships they learn for code are less reliable than those for natural language.

This matters less than you'd expect for queries phrased in natural language. "Find the function that validates JWT tokens" will retrieve near the right result even with a text-only embedding model, because function names, docstrings, and comments provide enough natural language signal.

It matters more for queries phrased in code. "Find functions that call `requests.get`" is not a natural language query and will not embed near functions that contain `requests.get` in any reliable way.

The practical answer is hybrid search: combine dense embedding search (good for semantic similarity) with sparse keyword search like BM25 (good for exact term matching). Reciprocal Rank Fusion is a simple way to merge the ranked lists:

```python
def hybrid_search(query: str, top_k: int = 10) -> list[dict]:
    # Dense semantic search
    embedding = embed(query)
    dense_results = vector_store.search(embedding, top_k=top_k * 2)

    # Sparse keyword search
    sparse_results = bm25_index.search(query, top_k=top_k * 2)

    # Reciprocal Rank Fusion
    scores = {}
    k = 60  # RRF constant

    for rank, result in enumerate(dense_results):
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    for rank, result in enumerate(sparse_results):
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)

    # Sort by combined score and return top_k
    sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [get_chunk(doc_id) for doc_id in sorted_ids[:top_k]]
```

This consistently outperforms either dense or sparse search alone on code retrieval tasks, particularly for queries that mix natural language intent with specific code terms.

### The Metadata Problem

A chunk of code without metadata is a chunk of code. You need to know where it lives in the repository, what it depends on, and what depends on it. Without that information, the model can retrieve a function but can't navigate to its dependencies or understand its role in the system.

The minimum useful metadata for a code chunk:

```python
{
    "file_path": "src/auth/middleware.py",
    "module": "auth.middleware",
    "name": "validate_token",
    "type": "FunctionDef",
    "start_line": 47,
    "end_line": 89,
    "imports": ["jwt", "datetime", "typing"],
    "calls": ["jwt.decode", "datetime.utcnow", "_get_secret"],
    "called_by": ["AuthMiddleware.process_request"],
    "parent_class": "AuthMiddleware",
    "decorators": ["@staticmethod"],
}
```

This metadata is extractable statically at index time. The `calls` and `called_by` fields require a bit more work — you need to build a call graph — but they're the entries that make the difference between a retrieval system that returns relevant chunks and one that returns the specific chunk the agent needs to follow the chain.

### The Call Graph as a Navigation Layer

The call graph is what makes code retrieval actually agentic. Without it, the model has to guess what to look at next based on source code text alone. With it, the agent can navigate dependencies explicitly.

```python
def build_call_graph(repo_path: str) -> dict:
    """
    Build a call graph mapping each function to its callees and callers.
    Returns: {
        "function_fqn": {
            "calls": ["other.function", ...],
            "called_by": ["another.function", ...]
        }
    }
    """
    graph = {}

    for py_file in Path(repo_path).rglob("*.py"):
        try:
            source = py_file.read_text()
            tree = ast.parse(source)

            for node in ast.walk(tree):
                if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
                    fqn = f"{py_file.stem}.{node.name}"
                    graph[fqn] = {"calls": [], "called_by": []}

                    # Find all Call nodes within this function
                    for child in ast.walk(node):
                        if isinstance(child, ast.Call):
                            callee = extract_call_name(child)
                            if callee:
                                graph[fqn]["calls"].append(callee)
        except SyntaxError:
            continue

    # Build reverse edges (called_by)
    for fn, data in graph.items():
        for callee in data["calls"]:
            if callee in graph:
                graph[callee]["called_by"].append(fn)

    return graph
```

Expose this as an agent tool:

```python
def get_function_neighbors(function_name: str) -> dict:
    """
    Returns the immediate callers and callees of a function.
    Use this when you've found a relevant function and need to understand
    what it calls or what calls it.

    Args:
        function_name: Fully-qualified function name (e.g., 'auth.middleware.validate_token')

    Returns:
        dict with 'calls' and 'called_by' lists
    """
    return call_graph.get(function_name, {"calls": [], "called_by": []})
```

Now the agent can do something qualitatively different: it can follow dependency chains. "Find the authentication failure" becomes searchable not just by keyword proximity, but by actually tracing through the call graph from the entry point to the point of failure.

### Structuring Tools for Agent Use

The specific tools you expose to the agent matter as much as the underlying retrieval quality. Three tools form the core of a code-aware agent's retrieval toolkit:

**Search** — semantic + keyword search over the indexed codebase. The agent uses this when it doesn't know where something is.

**Read** — read the full content of a specific file or a line range within a file. The agent uses this when it knows where something is and needs the complete source.

**Navigate** — given a function or class, return its callers, callees, and import neighbors. The agent uses this to follow dependency chains without guessing.

A fourth tool that often proves valuable is **list_directory**, which lets the agent explore the repository structure when it doesn't know where to start searching.

```python
TOOLS = [
    {
        "name": "search_codebase",
        "description": "Search the indexed codebase using natural language or code terms. "
                       "Returns relevant code chunks with file paths and line numbers. "
                       "Use this when you're looking for something but don't know where it is.",
        "input_schema": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "What to search for. Can be natural language or code terms."
                },
                "top_k": {
                    "type": "integer",
                    "description": "Number of results to return. Default 5, max 20.",
                    "default": 5
                }
            },
            "required": ["query"]
        }
    },
    {
        "name": "read_file",
        "description": "Read the contents of a specific file. Optionally specify line range. "
                       "Use this when you know the file path and need to see its full content.",
        "input_schema": {
            "type": "object",
            "properties": {
                "path": {"type": "string", "description": "File path relative to repo root"},
                "start_line": {"type": "integer", "description": "First line to read (1-indexed)"},
                "end_line": {"type": "integer", "description": "Last line to read (inclusive)"}
            },
            "required": ["path"]
        }
    },
    {
        "name": "get_neighbors",
        "description": "Get the callers and callees of a function. Use this to follow "
                       "dependency chains — to find what a function calls, or what calls it.",
        "input_schema": {
            "type": "object",
            "properties": {
                "function_name": {
                    "type": "string",
                    "description": "Fully-qualified function name (module.ClassName.method_name)"
                }
            },
            "required": ["function_name"]
        }
    }
]
```

Notice that the descriptions are specific about when to use each tool, not just what the tool does. This is the primary lever for guiding agent behavior, and it's worth spending time on.

### Key Takeaways

1. Code retrieval requires syntactic chunking, not fixed-size chunking. Chunk at function and class boundaries.
2. Hybrid search — dense embeddings plus sparse BM25, fused with RRF — outperforms either alone for code.
3. Metadata is as important as the chunk content. File path, imports, and especially call graph edges enable navigation.
4. The call graph as a navigation tool lets agents follow dependency chains rather than guessing.
5. Tool descriptions shape agent behavior. Specificity about when to use a tool matters as much as what the tool does.

> **Try This**
> Build the three-tool toolkit (search, read, navigate) against a real repository. Write ten queries of increasing complexity: start with "find the function that does X" (one retrieval step) and end with "trace the full execution path from HTTP request to database write" (many steps). Log every tool call the agent makes. At what query complexity does the agent start making poor tool selection decisions? That boundary tells you where to focus your tool description and prompt work.

---

## Chapter 4: Multi-Step Retrieval for Complex Tasks

A single retrieval step answers a lookup question. Two retrieval steps can answer a follow-up. At some point — and the threshold varies by task — retrieval becomes an iterative investigation, not a single operation.

Multi-step retrieval is the pattern that turns an agent with code search tools into something that can handle genuinely complex tasks: full impact analyses, cross-module refactoring plans, root-cause investigations across a large codebase. The mechanics are the same as single-step retrieval. What changes is the strategy for how steps connect to each other.

### The Three Retrieval Strategies

In practice, multi-step retrieval in agentic systems follows three distinct strategies. Most real tasks use some combination of all three.

**Sequential refinement.** Each retrieval step uses the results of the previous step to form a better query. The agent searches for authentication middleware, reads the result, discovers that token validation is handled by a separate utility, searches for that utility, reads it, discovers it depends on a secret manager, and so on. Each step is informed by the previous one. This is the natural mode for dependency tracing.

**Parallel expansion.** The agent identifies multiple aspects of a task that can be investigated independently and issues several retrieval queries. This is useful for tasks like "identify all the places in the codebase that handle X" or "collect everything relevant to a proposed API change." The results are then synthesized together.

**Hypothesis testing.** The agent forms a hypothesis about the code ("the bug is probably in the validation layer because..."), retrieves evidence to test it, updates the hypothesis based on what it finds, and retrieves again. This is the mode that most closely resembles how an experienced engineer investigates an unfamiliar bug.

These strategies aren't mutually exclusive and the agent doesn't need to explicitly choose between them. A well-prompted agent with good tools will naturally employ all three as the task demands. Your job as the system designer is to make sure the retrieval infrastructure supports each strategy efficiently.

### Sequential Refinement in Practice

Sequential refinement is the most common multi-step pattern because most code investigation tasks are fundamentally sequential: you start at an entry point, follow references, and keep going until you reach the answer.

The key to making this work well is making sure each retrieval result gives the agent enough information to form the next query. That means chunks need to include their dependencies, and the call graph navigation tool needs to be fast enough to use freely.

Here's what a sequential refinement trace looks like for a real task:

```
Task: "Why does user login fail when the email contains a plus sign?"

Step 1: search_codebase("login email validation")
→ Returns: auth/views.py:LoginView.post, auth/forms.py:LoginForm.clean_email

Step 2: read_file("auth/forms.py", start_line=34, end_line=67)
→ Returns: LoginForm.clean_email — normalizes email with lower(), strips whitespace

Step 3: get_neighbors("auth.forms.LoginForm.clean_email")
→ Returns: called_by=["auth.views.LoginView.post"]
           calls=["validators.validate_email_format"]

Step 4: search_codebase("validate_email_format implementation")
→ Returns: validators.py:validate_email_format

Step 5: read_file("validators.py", start_line=12, end_line=29)
→ Returns: validate_email_format uses re.match with pattern r'^[\w\.-]+@[\w\.-]+\.\w+$'
           Note: \w does not match '+', so plus signs fail validation

Answer: The regex in validators.py:validate_email_format (line 15) uses a character class
        that doesn't include '+'. Email validation fails before authentication is attempted.
        Fix: update pattern to r'^[\w\.\+\-]+@[\w\.-]+\.\w+$'
```

Five steps from task to specific line-level answer. A senior engineer investigating this manually might follow the same path, but it would take much longer because they'd be navigating a code editor, reading surrounding context, and holding the chain in working memory. The agent does it systematically and doesn't lose track.

### Parallel Expansion for Impact Analysis

Impact analysis is the canonical use case for parallel retrieval: "if I change this function, what breaks?" Following every call chain sequentially would be slow. Running the initial search breadth-first and then reading the results in parallel is much faster.

This requires explicit orchestration. The agent model itself won't naturally parallelize retrieval calls — it generates one tool call at a time. To get parallel behavior, you need either a multi-agent architecture where sub-agents handle individual branches, or an explicit batching mechanism where you collect multiple tool calls from the model and execute them concurrently.

A pragmatic middle ground is to implement a batch search tool:

```python
def search_codebase_batch(queries: list[str], top_k_per_query: int = 5) -> dict[str, list]:
    """
    Run multiple search queries simultaneously.
    Use this when you need to search for several independent things at once.

    Args:
        queries: List of search queries to run in parallel
        top_k_per_query: Results per query

    Returns:
        Dict mapping each query to its results
    """
    import concurrent.futures

    with concurrent.futures.ThreadPoolExecutor() as executor:
        future_map = {
            executor.submit(search_codebase, q, top_k_per_query): q
            for q in queries
        }
        results = {}
        for future in concurrent.futures.as_completed(future_map):
            query = future_map[future]
            results[query] = future.result()

    return results
```

The agent can now request multiple retrieval operations in a single step when it knows what it needs. For impact analysis tasks, this typically cuts step count by 40-60%.

> **Key Insight**
> Parallel retrieval via batch tools is often the highest-leverage optimization for multi-step retrieval tasks. A task that takes 20 sequential steps can often be restructured to take 8 steps when some of them can run concurrently. The model will use batch retrieval if you provide the tool and describe it clearly.

### Managing Context Growth

Multi-step retrieval creates a new problem that single-step RAG doesn't have: the context window fills up across steps. Each tool result gets added to the conversation history. After ten steps of reading code files, the context can contain 20,000-30,000 tokens of retrieved content. That's before the model's own reasoning and the original task.

Two approaches to managing this:

**Summarization.** After each set of tool calls, the agent produces a brief synthesis of what it found. Rather than accumulating raw tool outputs, you keep the summary and discard the raw outputs. This requires either prompting the model to summarize after each step or inserting a summarization step explicitly in the loop.

**Selective context.** Instead of keeping all retrieved content, the loop tracks which files have been read and stores them in a side channel. The model receives a lightweight manifest of what's been retrieved, reads new content on demand, and revisits stored content by referencing the manifest rather than having the full content in the main context.

The summarization approach is simpler to implement. The selective context approach preserves more information but requires more infrastructure. For most tasks under 30 steps, summarization is sufficient.

```python
def run_agent_with_compression(task: str, tools: list, max_steps: int = 30) -> str:
    messages = [{"role": "user", "content": task}]
    step_summaries = []

    for step in range(max_steps):
        # If context is getting long, compress old tool results
        if estimate_tokens(messages) > 60000:
            messages = compress_history(messages, step_summaries)

        response = client.messages.create(
            model="claude-opus-4-7",
            max_tokens=4096,
            system=AGENT_SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        )

        if response.stop_reason == "end_turn":
            return extract_text(response)

        if response.stop_reason == "tool_use":
            # Execute tools
            tool_results = execute_all_tools(response)

            # Summarize this step
            step_summary = f"Step {step + 1}: Used {get_tool_names(response)}. "
            step_summary += summarize_results(tool_results)
            step_summaries.append(step_summary)

            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

    return "Max steps reached."

def compress_history(messages: list, summaries: list) -> list:
    """Replace old tool results with a summary to free context space."""
    summary_message = {
        "role": "user",
        "content": "Previous investigation summary:\n" + "\n".join(summaries[:-5])
    }
    # Keep system message, task, summary, and last 5 steps
    recent_messages = messages[-10:]
    return [messages[0], summary_message] + recent_messages
```

This isn't perfect — compression loses information. But losing some information from early steps gracefully is better than running out of context and failing the task entirely.

### Knowing When to Stop

Multi-step retrieval needs a clear completion criterion. Without one, an agent will often keep retrieving — there's always one more function to read, one more dependency to trace. This is especially true in large codebases where the dependency graph is deep.

Good stopping criteria:

- The agent's reasoning makes a confident claim that doesn't require further evidence ("The issue is in file X at line Y because...")
- The agent has retrieved the same information twice without learning anything new (diminishing returns)
- The task question can be answered with what's been gathered (the agent should check this explicitly at each step)

Prompt the agent to self-assess:

```
After each retrieval step, before deciding whether to continue, ask yourself:
- Can I answer the original question with what I have now? If yes, answer it.
- Is there specific missing information I know I need? If yes, get it.
- Am I retrieving more out of thoroughness rather than necessity? If yes, stop.
```

This self-assessment prompt significantly reduces unnecessary retrieval steps without sacrificing answer quality for tasks where more retrieval is genuinely needed.

### Key Takeaways

1. Multi-step retrieval follows three main strategies: sequential refinement, parallel expansion, and hypothesis testing. Real tasks use all three.
2. Sequential refinement is the natural mode for dependency tracing. Call graph metadata is essential.
3. Parallel expansion via batch retrieval tools can cut step counts significantly for breadth-first tasks.
4. Context growth is a real constraint. Summarization-based compression is the pragmatic solution.
5. Explicit stopping criteria and self-assessment prompts reduce unnecessary retrieval without sacrificing quality.

> **Try This**
> Design an impact analysis task: pick a function in a medium-sized repository and ask the agent "if I change the signature of X function, what else would need to change?" Log every retrieval step. Count how many steps were necessary versus informative-but-not-required versus clearly redundant. Then add the self-assessment prompt and run the same task. Compare step counts. The reduction in redundant steps is the self-assessment prompt's contribution.

---

## Chapter 5: Grounding Agents in Codebase Reality

An agent that retrieves code and reasons about it can still go wrong in a specific way: it can produce answers that are semantically coherent but structurally incorrect. The code it describes doesn't actually exist in the form it describes. The function it says to call doesn't have that signature. The module it recommends importing isn't available in the environment.

This is distinct from hallucination in the general sense. The agent isn't making things up entirely — it's reasoning from retrieved context that's either incomplete or subtly outdated, and it's filling gaps in ways that sound plausible but aren't accurate. Grounding is the set of techniques that prevent this.

### The Grounding Problem in Code Contexts

Language models have seen enormous amounts of code in training. They have strong priors about how code tends to be structured, what common library APIs look like, how authentication is typically implemented, and so on. These priors are usually helpful. But in an agentic code context, they can mislead.

The specific failure mode: the agent retrieves a function, reads its signature, and then reasons about how it's probably used based on training priors rather than what the codebase actually shows. If the codebase uses a slightly unusual calling convention, or if the function has been refactored and the signature has changed, the agent's answer will reflect the prior, not the reality.

The practical consequence is that agents produce code suggestions that are wrong in ways that are hard to catch without running the code. The reasoning looks correct. The answer has the right shape. But it doesn't match the actual codebase.

> **Warning**
> Agents grounded in retrieved context are more accurate than agents working from memory alone — but grounding reduces hallucination, it doesn't eliminate it. Every suggestion the agent makes about code it hasn't explicitly retrieved is a potential point of failure. Design your verification layer accordingly.

### Grounding Through Explicit Read Confirmation

The first line of defense is a prompt constraint: the agent should not make claims about code it hasn't read.

This sounds obvious but it's not the default behavior. Without an explicit constraint, agents will freely combine retrieved context with training priors to fill gaps. The resulting answers often look better than answers based purely on retrieved context — they're more fluent and complete — but they're less accurate.

```python
GROUNDING_PROMPT = """
You must not make claims about code you haven't explicitly read during this session.

If you need to describe how a function works, read it first with read_file.
If you need to reference an import, verify it exists in the actual file.
If you're unsure whether something exists, search for it before asserting it does.

Uncertainty is acceptable. "I haven't read that file yet" is a valid statement.
Confident claims about unread code are not acceptable.
"""
```

Adding this constraint reliably reduces the gap between agent claims and codebase reality, at the cost of more retrieval steps. That's the right tradeoff for any task where the output gets acted on rather than just read.

### Schema and Signature Verification

For tasks where the agent produces code (refactoring suggestions, new function implementations, API integrations), there's a specific grounding check that's worth running before the agent's output goes anywhere: verify that every function call, class instantiation, and import in the generated code matches what's actually in the codebase.

This can be done statically with the AST:

```python
import ast
from typing import Set

def verify_generated_code(generated_code: str, codebase_index: dict) -> list[str]:
    """
    Check generated code against the actual codebase for structural correctness.
    Returns list of grounding errors found.
    """
    errors = []

    try:
        tree = ast.parse(generated_code)
    except SyntaxError as e:
        return [f"Syntax error: {e}"]

    for node in ast.walk(tree):
        if isinstance(node, ast.Import):
            for alias in node.names:
                if not module_exists_in_codebase(alias.name, codebase_index):
                    errors.append(f"Import '{alias.name}' not found in codebase")

        elif isinstance(node, ast.Call):
            func_name = extract_call_name(node)
            if func_name and is_local_function(func_name, codebase_index):
                actual_sig = get_function_signature(func_name, codebase_index)
                if actual_sig and not call_matches_signature(node, actual_sig):
                    errors.append(
                        f"Call to '{func_name}' doesn't match signature: {actual_sig}"
                    )

    return errors
```

Feed these errors back to the agent as a tool result:

```python
def validate_code(code: str) -> dict:
    """
    Validate generated code against the actual codebase.
    Use this before finalizing any code suggestion.

    Returns: {'valid': bool, 'errors': list[str]}
    """
    errors = verify_generated_code(code, codebase_index)
    return {
        "valid": len(errors) == 0,
        "errors": errors
    }
```

This creates a self-correcting loop: the agent generates code, validates it, receives any errors, corrects them, and validates again. In practice, one or two correction cycles is usually sufficient.

### Version and Staleness Awareness

Code changes. An index built yesterday may not reflect what was merged this morning. An agent working from a stale index can give accurate-sounding answers based on code that no longer exists in the form it was indexed.

Three approaches to staleness:

**Index freshness metadata.** Track when each chunk was last indexed and surface this as part of the retrieval result. The agent can factor in staleness when deciding how confident to be.

**Git-aware retrieval.** Index from HEAD at query time rather than from a snapshot. This is slower but guarantees freshness. For most production systems, a hybrid approach works: full re-index periodically, incremental updates on commit.

**Explicit staleness checking.** Before finalizing any answer that references specific file locations, confirm the file still exists and the relevant lines haven't changed significantly:

```python
def check_file_freshness(file_path: str, expected_hash: str) -> dict:
    """
    Verify that a file's content matches the indexed version.

    Args:
        file_path: Path to the file
        expected_hash: Hash from the index at indexing time

    Returns:
        {'fresh': bool, 'changed': bool, 'current_hash': str}
    """
    import hashlib
    from pathlib import Path

    current_content = Path(file_path).read_text()
    current_hash = hashlib.sha256(current_content.encode()).hexdigest()[:8]

    return {
        "fresh": current_hash == expected_hash,
        "changed": current_hash != expected_hash,
        "current_hash": current_hash,
    }
```

For long-running agents on active codebases, staleness checking on high-confidence claims before finalizing output is worth the overhead.

### Symbol Resolution

One grounding failure that's easy to miss: the agent retrieves a function called `process_event` and references it in its answer — but the codebase has three functions with that name in different modules. The agent synthesizes an answer based on the one it retrieved, but the actual behavior the user cares about is in a different one.

Symbol resolution — making sure the agent knows which exact symbol it's talking about — requires fully-qualified names in your retrieval results. File path plus module path plus class name plus function name. Not just the function name.

Build this into your chunk metadata and enforce it in tool responses:

```python
def format_chunk_for_agent(chunk: dict) -> str:
    """Format a chunk with full context for agent consumption."""
    return f"""
File: {chunk['file_path']}
Symbol: {chunk['module']}.{chunk.get('parent_class', '')}.{chunk['name']}
Lines: {chunk['start_line']}-{chunk['end_line']}

{chunk['source']}
"""
```

The agent that sees fully-qualified names makes fewer symbol resolution errors. The overhead is minor and the reliability gain is real.

### Codebase Topology as Ground Truth

For complex architectural questions, retrieval alone isn't sufficient grounding. The agent also needs to understand the codebase's structure: which modules are at the core, which are peripheral, what the main dependency directions are, where the system's boundaries are.

This high-level topology can be provided as a stable system prompt addition:

```
Repository structure overview:
- src/core/ — Core domain logic. No external dependencies.
- src/api/ — HTTP layer. Depends on core, not the reverse.
- src/infra/ — Database, cache, external services. Core depends on interfaces, not implementations.
- tests/ — Unit and integration tests. Follows the same structure.
- config/ — Environment-specific configuration. Loaded at startup.

Architectural constraints:
- Core modules must not import from api/ or infra/
- All external service access goes through infra/
- No circular imports
```

An agent with this context makes structurally sound suggestions. Without it, the agent's suggestions may be semantically correct but architecturally incorrect — importing an infrastructure module into a core module, for example, or suggesting a change that creates a circular dependency.

### Key Takeaways

1. Grounding reduces hallucination but doesn't eliminate it. Every claim about unread code is a risk.
2. Explicit constraints ("don't claim what you haven't read") reliably reduce grounding failures at the cost of more retrieval steps.
3. Signature verification on generated code catches structural errors before they reach a developer.
4. Index staleness is a real problem for active codebases. Incremental updates and freshness metadata mitigate it.
5. Fully-qualified symbol names in retrieval results prevent symbol resolution errors.

> **Try This**
> Take a code suggestion from your agent — one it produced with apparent confidence — and verify it manually against the actual codebase. Check every function call signature, every import, every method name. Count the errors. Then add the grounding prompt constraint and the code validation tool, and run the same task again. The reduction in errors is your grounding improvement. If the errors don't reduce, the constraint isn't being respected — check whether the tool description makes its purpose clear enough.

---

## Chapter 6: Memory and State Across Agent Steps

Within a single agent run, the conversation history is the memory. Everything the model needs to know about what it has done is in the context window. This works for tasks that complete in a reasonable number of steps. It breaks down for tasks that span multiple sessions, tasks that build on previous investigations, and cases where the same agent is repeatedly asked about the same codebase.

Memory in agentic systems is a spectrum from purely in-context to fully externalized. Where a system sits on that spectrum depends on what kind of continuity it needs.

### What Needs to Persist

Not everything worth remembering is worth persisting across sessions. The distinction matters because external memory adds complexity, latency, and failure modes that in-context memory avoids.

**Worth persisting:** Findings about the codebase that required significant computation to produce. If it took 20 retrieval steps and 10 minutes to understand why a particular subsystem behaves the way it does, that finding should be available to the next agent run without re-doing the work. Architectural insights. Known gotchas. The results of past investigations.

**Not worth persisting:** The full transcript of what the agent did to produce those findings. The intermediate retrieval results. The step-by-step reasoning. These are path-dependent and rebuilding from scratch is usually faster than trying to resume from a detailed transcript.

**Worth tracking but not persisting in a traditional memory store:** Which files were recently read (recency for re-indexing priority), which functions were most frequently retrieved (hotspot identification), which queries produced no useful results (search quality signals).

### Session-Level Memory: Working Notes

The most useful form of memory for a code-aware agent is something like a working notes document: structured prose that captures what the agent has established about the codebase and what questions remain open.

```python
class WorkingNotes:
    def __init__(self):
        self.established_facts = []
        self.open_questions = []
        self.architectural_observations = []
        self.investigated_files = set()
        self.gotchas = []

    def add_fact(self, fact: str, source: str):
        """Record a confirmed finding with its source."""
        self.established_facts.append({
            "fact": fact,
            "source": source,
            "timestamp": datetime.utcnow().isoformat()
        })

    def mark_investigated(self, file_path: str):
        self.investigated_files.add(file_path)

    def to_context_block(self) -> str:
        """Format notes for injection into agent context."""
        lines = ["## Investigation Notes"]

        if self.established_facts:
            lines.append("\n### Established Facts")
            for item in self.established_facts:
                lines.append(f"- {item['fact']} (source: {item['source']})")

        if self.open_questions:
            lines.append("\n### Open Questions")
            for q in self.open_questions:
                lines.append(f"- {q}")

        if self.investigated_files:
            lines.append(f"\n### Files Investigated ({len(self.investigated_files)} total)")
            lines.append(", ".join(sorted(self.investigated_files)))

        return "\n".join(lines)
```

Inject working notes at the start of each agent run and update them through the session. This gives the model a structured view of what's already been established, preventing duplicate work and providing a foundation for incremental investigation.

### Cross-Session Memory: The Knowledge Store

For persistent memory across sessions, the agent needs an external store. The right design depends on query patterns:

- If the agent will be asked questions like "what did we find about the auth system last week?" — it needs retrievable text, so a vector store with semantic search is appropriate.
- If the agent needs to track the state of an ongoing investigation — it needs structured state, so a document store or key-value store is more appropriate.
- If the agent needs to reason about what it knows at a system level — it needs a knowledge graph.

For most practical code-aware agents, a simple hybrid works: a vector store for past investigation findings, a structured document for active investigation state.

```python
def save_investigation_to_memory(
    investigation_title: str,
    findings: list[str],
    codebase_path: str,
    relevant_files: list[str],
    memory_store: VectorStore
):
    """Persist investigation results for future retrieval."""
    document = {
        "title": investigation_title,
        "findings": findings,
        "codebase": codebase_path,
        "files": relevant_files,
        "timestamp": datetime.utcnow().isoformat(),
    }

    # Embed the combined findings for semantic retrieval
    combined_text = f"{investigation_title}\n\n" + "\n".join(findings)
    embedding = embed(combined_text)

    memory_store.upsert(
        id=generate_id(investigation_title, codebase_path),
        embedding=embedding,
        metadata=document,
        document=combined_text
    )

def recall_relevant_investigations(
    current_task: str,
    codebase_path: str,
    memory_store: VectorStore,
    top_k: int = 3
) -> list[dict]:
    """Retrieve past investigations relevant to the current task."""
    embedding = embed(current_task)
    results = memory_store.search(
        embedding=embedding,
        filter={"codebase": codebase_path},
        top_k=top_k
    )
    return [r.metadata for r in results]
```

Make this available as an agent tool:

```python
def recall_past_findings(query: str) -> list[dict]:
    """
    Search for past investigation findings relevant to the current task.
    Use this at the start of an investigation to check whether similar
    work has already been done.

    Args:
        query: Description of what you're investigating

    Returns:
        List of past investigation summaries
    """
    return recall_relevant_investigations(query, CODEBASE_PATH, memory_store)
```

The agent can now start a new investigation by checking what's already known, avoiding re-investigation of already-solved problems and building on established findings rather than starting from scratch.

> **Key Insight**
> Memory in agentic systems is primarily a cost optimization, not a quality improvement. The agent can re-derive most findings by re-running retrieval. Memory makes that re-derivation unnecessary. Design memory around the most expensive computations to reproduce, not around everything the agent has ever done.

### State Management for Long-Running Tasks

Some tasks genuinely span multiple sessions: a large refactoring that requires investigating twenty subsystems, or a security audit that needs to cover every access control point in the codebase. These tasks can't complete in a single run, and their intermediate state needs to survive between sessions.

A task state schema:

```python
from dataclasses import dataclass, field
from typing import Optional
from datetime import datetime
import json

@dataclass
class AgentTaskState:
    task_id: str
    task_description: str
    status: str  # "in_progress", "blocked", "complete"

    # What's been done
    completed_subtasks: list[str] = field(default_factory=list)
    investigated_files: list[str] = field(default_factory=list)
    established_facts: list[dict] = field(default_factory=list)

    # What's left to do
    pending_subtasks: list[str] = field(default_factory=list)
    open_questions: list[str] = field(default_factory=list)

    # Blocking state
    blocked_reason: Optional[str] = None

    # Metadata
    created_at: str = field(default_factory=lambda: datetime.utcnow().isoformat())
    last_updated: str = field(default_factory=lambda: datetime.utcnow().isoformat())

    def to_context_summary(self) -> str:
        """Generate a context-efficient summary for injection into agent context."""
        lines = [
            f"Task: {self.task_description}",
            f"Status: {self.status}",
            f"Completed subtasks ({len(self.completed_subtasks)}): {', '.join(self.completed_subtasks[:5])}",
            f"Pending subtasks ({len(self.pending_subtasks)}): {', '.join(self.pending_subtasks)}",
            f"Files investigated: {len(self.investigated_files)}",
        ]
        if self.established_facts:
            lines.append("\nKey findings:")
            for fact in self.established_facts[-5:]:  # Last 5 to keep context small
                lines.append(f"  - {fact['fact']}")
        return "\n".join(lines)

    def save(self, storage_path: str):
        with open(f"{storage_path}/{self.task_id}.json", "w") as f:
            json.dump(self.__dict__, f, indent=2)

    @classmethod
    def load(cls, task_id: str, storage_path: str) -> "AgentTaskState":
        with open(f"{storage_path}/{task_id}.json") as f:
            data = json.load(f)
        return cls(**data)
```

When the agent resumes a task, it loads the state, injects the summary into context, and continues from where it left off. The state update happens incrementally during the session, ensuring that a crash or timeout loses at most the current step's work.

### Context Window as Working Memory

The context window itself is a form of memory — and the most accessible one. For single-session tasks, the entire history of retrieval, reasoning, and decisions is available throughout the run. This is often sufficient.

The limit is size. As covered in Chapter 4, context grows with each step. Managing context effectively means being intentional about what stays versus what gets summarized.

A useful heuristic: the model needs full-fidelity context for the last 3-5 steps (the working set of the current reasoning chain) and summary context for earlier steps. Implementing this as a sliding window with summarization at the back end handles most cases:

```python
def maintain_context_window(
    messages: list,
    max_tokens: int = 80000,
    preserve_recent_n: int = 6
) -> list:
    """
    Keep the context window manageable by summarizing old tool results.
    Always preserves the system message, the original task, and the most recent steps.
    """
    if estimate_tokens(messages) < max_tokens:
        return messages

    # Identify what to compress: everything except the first 2 messages and last N
    if len(messages) <= 2 + preserve_recent_n:
        return messages  # Nothing to compress

    compressible = messages[2:-preserve_recent_n]

    # Generate a summary of the compressible range
    summary_prompt = [
        {"role": "user", "content": "Summarize what has been investigated so far "
         "and what key facts have been established, in bullet points."},
    ] + compressible

    summary_response = client.messages.create(
        model="claude-haiku-4-5-20251001",  # Fast + cheap for summarization
        max_tokens=1000,
        messages=summary_prompt,
    )

    summary = extract_text(summary_response)
    summary_message = {
        "role": "assistant",
        "content": f"[Investigation summary up to this point]\n{summary}"
    }

    return messages[:2] + [summary_message] + messages[-preserve_recent_n:]
```

This keeps context manageable without losing the thread of the investigation.

### Key Takeaways

1. Not everything worth knowing is worth persisting externally. Design persistence around the most expensive computations to reproduce.
2. Working notes — structured facts, open questions, investigated files — are more useful than full transcripts.
3. Cross-session memory requires an external store. A vector store for findings, a document for active state.
4. Long-running tasks need serializable state that survives session boundaries.
5. The context window is the primary working memory. Manage it actively with summarization for long runs.

> **Try This**
> Build a simple WorkingNotes implementation and integrate it into your agent loop. Run an investigation that takes 15+ steps. After the investigation, read the generated notes. Are they sufficient to resume the investigation if the context were cleared? If not, what's missing? Add tools that let the agent explicitly update its notes ("I've confirmed that X is true because I read file Y at line Z") and see how that changes the quality of the notes.

---

## Chapter 7: Evaluation for Agentic Systems

Evaluating single-pass RAG is already harder than it looks. Evaluating agentic systems is harder still, because the thing you care about is not just the output quality but the quality of the process that produced it.

An agent that gets to the right answer via a tortured path — unnecessary retrievals, redundant reads, wrong hypotheses corrected only by accident — is not a good agent even if the output looks correct. Conversely, an agent that follows a tight, well-reasoned investigation path but reaches a wrong conclusion may be revealing a retrieval gap rather than a reasoning failure.

Untangling these requires evaluation at three levels: task success, process quality, and failure mode diagnosis.

### Task-Level Evaluation

Task success is the most obvious dimension. Did the agent answer the question correctly? Did the code it generated actually work? Did the impact analysis it produced match the actual impact of the change?

The challenge is building ground truth for complex, multi-step tasks. For simple lookup tasks, ground truth is easy: there's a correct answer and you can check against it. For complex tasks, ground truth requires a human expert to establish what the correct answer would be, and that's expensive.

A pragmatic approach for code tasks: use the repository itself as ground truth. If the task is "find all the callers of function X," the ground truth is whatever `grep` returns. If the task is "identify functions without error handling," the ground truth is whatever a static analysis tool finds. If the task is "suggest a refactoring to eliminate the dependency between module A and module B," the ground truth is whether the suggested change passes tests when applied.

This gives you objective ground truth for a useful subset of tasks without requiring human expert annotation.

```python
def evaluate_task_success(
    task_type: str,
    agent_output: str,
    ground_truth: dict,
    repo_path: str
) -> dict:
    """Evaluate agent task success against objective ground truth."""

    if task_type == "find_callers":
        # Ground truth: actual callers via static analysis
        actual_callers = get_actual_callers(ground_truth["function"], repo_path)
        claimed_callers = extract_callers_from_output(agent_output)

        precision = len(set(claimed_callers) & set(actual_callers)) / len(claimed_callers or [1])
        recall = len(set(claimed_callers) & set(actual_callers)) / len(actual_callers or [1])

        return {"precision": precision, "recall": recall, "f1": 2 * precision * recall / (precision + recall + 1e-10)}

    elif task_type == "code_suggestion":
        # Apply suggestion to a copy, run tests
        result = apply_and_test(agent_output, repo_path)
        return {
            "applies_cleanly": result["applies"],
            "tests_pass": result["tests_pass"],
            "syntax_valid": result["syntax_valid"]
        }

    elif task_type == "impact_analysis":
        # Compare to actual git diff after making the change
        actual_impact = get_actual_impact(ground_truth["change"], repo_path)
        claimed_impact = extract_impact_from_output(agent_output)
        return compute_overlap_metrics(actual_impact, claimed_impact)

    return {"error": f"Unknown task type: {task_type}"}
```

### Process-Level Evaluation

Process quality is about whether the agent took a good path to the answer, independent of whether the answer was correct. A good process:

- Retrieves relevant content before making claims about it
- Doesn't repeat the same retrieval twice without reason
- Follows dependency chains to appropriate depth rather than stopping too early or going too deep
- Uses the right tool for each step (search when location is unknown, read when it is)
- Produces a reasoning trace that makes its chain of inference legible

These properties can be measured automatically:

```python
def evaluate_process_quality(trace: list[dict]) -> dict:
    """
    Evaluate the quality of an agent's retrieval and reasoning process.

    trace: List of step records, each with 'tool', 'params', 'result', 'reasoning'
    """
    metrics = {}

    # Redundant retrieval: same query issued more than once
    queries = [s["params"].get("query", "") for s in trace if s["tool"] == "search_codebase"]
    metrics["redundant_query_rate"] = (len(queries) - len(set(queries))) / max(len(queries), 1)

    # Redundant reads: same file read more than once
    reads = [s["params"].get("path", "") for s in trace if s["tool"] == "read_file"]
    metrics["redundant_read_rate"] = (len(reads) - len(set(reads))) / max(len(reads), 1)

    # Tool appropriateness: search used when file path unknown, read when known
    search_steps = [s for s in trace if s["tool"] == "search_codebase"]
    appropriate_searches = sum(
        1 for s in search_steps
        if not has_explicit_file_path(s["reasoning"])
    )
    metrics["search_appropriateness"] = appropriate_searches / max(len(search_steps), 1)

    # Total step count efficiency
    metrics["total_steps"] = len(trace)

    # Claims without reads: assertions about code not read in this session
    metrics["ungrounded_claim_rate"] = count_ungrounded_claims(trace)

    return metrics
```

Track these metrics over time and across task types. Rising redundant retrieval rates usually indicate that the context window is getting too large for the model to remember what it's already retrieved. Rising ungrounded claim rates indicate that the grounding prompt needs reinforcement.

> **Key Insight**
> Process metrics are leading indicators. A rising redundant retrieval rate today predicts declining task success rates next week, before the failures are obvious in user-facing metrics. Build process monitoring early.

### Failure Mode Diagnosis

When an agent fails a task, there are three root causes worth distinguishing, because they have different fixes:

**Retrieval failure.** The relevant information wasn't retrieved. Either it wasn't in the index, the query didn't surface it, or the retrieval results didn't contain enough context. Fix: improve chunking, index coverage, or query formulation.

**Reasoning failure.** The relevant information was retrieved but the model drew the wrong conclusion from it. Fix: improve the prompt, provide more structured context, or try a more capable model.

**Task specification failure.** The task as given was ambiguous or under-specified. The agent made a reasonable interpretation, but not the intended one. Fix: improve the task formulation pipeline or add clarification steps.

Attributing failures correctly requires logging enough information to trace back through the process:

```python
def log_agent_run(
    task: str,
    trace: list[dict],
    output: str,
    success_metrics: dict,
    process_metrics: dict,
    run_id: str
):
    """Log a complete agent run for post-hoc analysis."""
    log_entry = {
        "run_id": run_id,
        "timestamp": datetime.utcnow().isoformat(),
        "task": task,
        "output": output,
        "success_metrics": success_metrics,
        "process_metrics": process_metrics,
        "trace": trace,  # Full step-by-step record
        "retrieved_files": list({
            s["params"].get("path", s["result"].get("file_path", ""))
            for s in trace
            if s["tool"] in ("read_file", "search_codebase")
        }),
    }

    # Classify failure mode if task failed
    if not output_looks_successful(output, success_metrics):
        log_entry["failure_mode"] = classify_failure_mode(trace, output, task)

    write_to_log_store(log_entry)

def classify_failure_mode(trace: list, output: str, task: str) -> str:
    """Simple heuristic failure mode classification."""
    retrieved_files = get_retrieved_files(trace)

    # If nothing was retrieved, likely a retrieval failure
    if not retrieved_files:
        return "retrieval_failure"

    # If files were retrieved but output contradicts them, likely reasoning failure
    if retrieved_content_contradicts_output(retrieved_files, output):
        return "reasoning_failure"

    # If task was completed differently than expected, might be spec failure
    return "task_specification_failure"
```

Build a dashboard that shows the distribution of failure modes over time. If retrieval failures dominate, invest in retrieval infrastructure. If reasoning failures dominate, invest in prompt engineering and model selection. If specification failures dominate, invest in the user-facing task formulation layer.

### Human Evaluation for the Hard Cases

Automated evaluation covers a lot of ground but can't cover everything. Some tasks don't have objective ground truth. Some failures are subtle enough that automated classification gets them wrong.

Build a human evaluation workflow for the hard cases. Not for every run — that doesn't scale — but for a sample, especially for failure cases and for tasks in new domains where the automated metrics haven't been calibrated yet.

The most useful human evaluation question isn't "is this answer correct?" — humans often can't evaluate correctness faster than they could answer the question themselves. The most useful question is "would you trust this answer enough to act on it without double-checking?" That's a judgment about reliability, not correctness, and it's a judgment humans can make quickly.

Rate on a 3-point scale: would not act on this (needs rework), would verify before acting, would act on this directly. The distribution of these ratings over time is a better measure of production utility than any individual correctness metric.

### Regression Testing for Agentic Behavior

Changes to prompts, tools, retrieval configurations, or models can all affect agent behavior in unexpected ways. A prompt change that improves one task type can degrade another. A new tool that helps with complex tasks can confuse the agent on simple ones.

Maintain a regression test suite of representative tasks with known-good outputs. Run it on every significant change. The test cases don't need to be exhaustive — 20-30 representative tasks across your task type distribution gives meaningful signal. What matters is that it runs automatically and produces a clear pass/fail before any change ships.

```python
def run_regression_suite(agent, test_cases: list[dict]) -> dict:
    """
    Run agent against a set of regression test cases.
    Each test case: {task, expected_output_pattern, success_criteria}
    """
    results = []

    for test in test_cases:
        output = agent.run(test["task"])
        success = evaluate_against_criteria(output, test["success_criteria"])

        results.append({
            "task": test["task"],
            "success": success,
            "output_snippet": output[:200],
        })

    pass_rate = sum(1 for r in results if r["success"]) / len(results)
    failed = [r for r in results if not r["success"]]

    return {
        "pass_rate": pass_rate,
        "total_cases": len(results),
        "failed_cases": failed,
    }
```

### Key Takeaways

1. Evaluate at three levels: task success, process quality, and failure mode diagnosis.
2. Use the repository itself as objective ground truth for code tasks — callers, test results, static analysis outputs.
3. Process metrics (redundant retrieval, ungrounded claims) are leading indicators for task success degradation.
4. Attribute failures to the right root cause: retrieval, reasoning, or task specification.
5. Regression test suites catch regressions from prompt and configuration changes before they hit production.

> **Try This**
> Build a 10-task regression suite for your agent. Include at least three task types (lookup, investigation, code suggestion). Run it against your current system. Then make one change — update a tool description, adjust the system prompt, or change the chunking strategy — and run the suite again. Compare pass rates. What improved? What regressed? This is your baseline evaluation loop for ongoing development.

---

## Chapter 8: Production Architecture and Safety

Building a working agent in a development environment and running one in production are different problems. The development environment is forgiving: you can restart when things go wrong, inspect every step, and accept latency that would be unacceptable in production. Production is not forgiving.

The architecture decisions that matter most for production agents are about failure handling, safety constraints, cost management, and operational observability. These aren't optional concerns to address after the agent "works" — they determine whether the agent is deployable at all.

### Request Handling and Latency

Agent runs take time. A 15-step investigation with retrieval, reading, and generation at each step can easily take 60-90 seconds end-to-end. For most use cases, that's acceptable as long as the user gets feedback during the wait.

Streaming the agent's output as it generates is the most effective way to make long runs feel acceptable. Stream the reasoning trace in real time, surface tool calls as they happen, and give the user a sense of progress rather than a spinner.

```python
async def run_agent_streaming(task: str, tools: list, websocket) -> None:
    """Run agent with streaming output to a WebSocket connection."""
    messages = [{"role": "user", "content": task}]

    for step in range(MAX_STEPS):
        async with client.messages.stream(
            model="claude-opus-4-7",
            max_tokens=4096,
            system=AGENT_SYSTEM_PROMPT,
            tools=tools,
            messages=messages,
        ) as stream:
            tool_calls = []
            current_text = []

            async for event in stream:
                if event.type == "content_block_delta":
                    if hasattr(event.delta, "text"):
                        # Stream reasoning text immediately
                        await websocket.send_json({
                            "type": "reasoning",
                            "content": event.delta.text
                        })
                        current_text.append(event.delta.text)

            final_message = await stream.get_final_message()

            if final_message.stop_reason == "end_turn":
                await websocket.send_json({
                    "type": "complete",
                    "content": "".join(current_text)
                })
                return

            # Execute tool calls and signal to user
            for block in final_message.content:
                if block.type == "tool_use":
                    await websocket.send_json({
                        "type": "tool_call",
                        "tool": block.name,
                        "params": block.input
                    })
                    result = execute_tool(block.name, block.input)
                    await websocket.send_json({
                        "type": "tool_result",
                        "tool": block.name,
                        "result_summary": summarize_result(result)
                    })
                    tool_calls.append({"block": block, "result": result})

            # Add to message history and continue
            messages.append({"role": "assistant", "content": final_message.content})
            messages.append({"role": "user", "content": format_tool_results(tool_calls)})
```

Streaming also creates a natural audit trail: every step is visible in real time, which is valuable both for user trust and for debugging production issues.

### Safety at the Tool Layer

Any agent that can take actions needs safety constraints. The level of constraint depends on what actions are possible:

**Read-only agents** (retrieval and analysis only) have a small blast radius. The main risks are cost (runaway loops) and information exposure (returning sensitive code to unauthorized users). Both are manageable with standard controls.

**Write-capable agents** (code generation, file modification, branch creation) have a larger blast radius. Bad code that gets applied to a codebase can require significant effort to untangle. These need explicit approval gates before any write action executes.

**Execution-capable agents** (running tests, deploying services, making API calls) have the largest blast radius. Mistakes here can have consequences outside the development environment. These need the strictest controls and the most careful architecture.

For code-aware agents, a common pattern is to separate the investigation phase from the action phase. The agent runs through its analysis in read-only mode, produces a specific, reviewable plan, and then waits for human approval before executing any write operations.

```python
class SafeCodeAgent:
    """An agent that separates investigation from action, requiring approval for writes."""

    def __init__(self, codebase_path: str, approval_required: bool = True):
        self.codebase_path = codebase_path
        self.approval_required = approval_required
        self.proposed_changes = []

    def investigate(self, task: str) -> str:
        """Run investigation phase — read-only."""
        tools = [SEARCH_TOOL, READ_TOOL, NAVIGATE_TOOL]  # No write tools
        return run_agent(task, tools)

    def plan(self, investigation_result: str) -> list[dict]:
        """Generate a specific, reviewable action plan."""
        plan_task = f"""
        Based on this investigation:
        {investigation_result}

        Produce a specific list of file changes needed, in this exact format:
        FILE: path/to/file.py
        ACTION: modify | create | delete
        CHANGE: exact description of what to change
        ---
        """
        plan_output = run_agent(plan_task, [READ_TOOL])
        self.proposed_changes = parse_action_plan(plan_output)
        return self.proposed_changes

    def execute(self, approved_changes: list[dict]) -> list[str]:
        """Execute only the explicitly approved changes."""
        results = []
        for change in approved_changes:
            if change not in self.proposed_changes:
                raise SecurityError("Attempting to execute unapproved change")
            result = apply_change(change, self.codebase_path)
            results.append(result)
        return results
```

This pattern makes the agent's actions auditable and reversible — the proposed changes can be reviewed before any of them are applied, and if something looks wrong, it can be rejected before it touches the codebase.

> **Warning**
> Don't trust the model's self-certification of safety. A model that says "this change is safe" is expressing a probabilistic judgment, not running a proof. Verify write operations structurally (does the generated code parse? do tests pass?) rather than relying on the model's assessment of its own output.

### Cost Control

Token costs for agentic systems are significantly higher than for single-pass RAG. A 20-step agent run might consume 50,000-100,000 tokens per task, versus 5,000-10,000 for a single-pass RAG response. At scale, this adds up fast.

Three cost control mechanisms worth implementing:

**Per-run token budgets.** Set a hard limit on tokens per run. When the budget is approached, the agent is instructed to synthesize what it has and produce its best answer from current context rather than continuing to retrieve.

**Model tiering.** Use a less expensive model for straightforward steps (simple reads, summarization, planning) and reserve the most capable model for complex reasoning steps. The reasoning trace generation, in particular, can often be done at lower cost.

**Caching.** Tool results for the same inputs should be cached within a run (and potentially across runs for stable data). Reading the same file twice in one agent run is wasteful; returning the cached result on the second read is free.

```python
class CostAwareAgent:
    def __init__(self, token_budget: int = 100_000):
        self.token_budget = token_budget
        self.tokens_used = 0
        self.tool_cache = {}

    def execute_tool_cached(self, name: str, params: dict) -> any:
        """Execute tool with caching and cost tracking."""
        cache_key = f"{name}:{json.dumps(params, sort_keys=True)}"

        if cache_key in self.tool_cache:
            return self.tool_cache[cache_key]  # Free cache hit

        result = execute_tool(name, params)
        self.tool_cache[cache_key] = result
        return result

    def check_budget(self, response) -> bool:
        """Check if we're approaching the token budget."""
        self.tokens_used += response.usage.input_tokens + response.usage.output_tokens
        return self.tokens_used < self.token_budget * 0.85  # 85% threshold

    def run(self, task: str, tools: list) -> str:
        messages = [{"role": "user", "content": task}]

        for step in range(MAX_STEPS):
            response = client.messages.create(
                model="claude-opus-4-7",
                messages=messages,
                tools=tools,
                system=AGENT_SYSTEM_PROMPT,
                max_tokens=4096,
            )

            if not self.check_budget(response):
                # Budget nearly exhausted — force completion
                messages.append({"role": "assistant", "content": response.content})
                messages.append({
                    "role": "user",
                    "content": "Token budget nearly exhausted. Synthesize your findings "
                               "and provide your best answer based on what you've gathered."
                })
                final = client.messages.create(
                    model="claude-opus-4-7",
                    messages=messages,
                    max_tokens=2000,
                )
                return extract_text(final)

            if response.stop_reason == "end_turn":
                return extract_text(response)

            # Execute tools with caching
            tool_results = []
            for block in response.content:
                if block.type == "tool_use":
                    result = self.execute_tool_cached(block.name, block.input)
                    tool_results.append(format_tool_result(block.id, result))

            messages.append({"role": "assistant", "content": response.content})
            messages.append({"role": "user", "content": tool_results})

        return "Step limit reached."
```

### Observability

An agent system without observability is a black box that occasionally produces results. You need to be able to answer these questions in production:

- How long did each run take?
- How many tokens did each run consume?
- What was the step-by-step trace for any given run?
- What retrieval queries were issued?
- What files were most frequently accessed?
- What's the distribution of failure modes?

Instrument everything:

```python
import time
from dataclasses import dataclass, field

@dataclass
class RunMetrics:
    run_id: str
    task_hash: str
    start_time: float = field(default_factory=time.time)
    end_time: float = 0.0
    total_tokens: int = 0
    input_tokens: int = 0
    output_tokens: int = 0
    step_count: int = 0
    tool_call_counts: dict = field(default_factory=dict)
    retrieved_files: list = field(default_factory=list)
    success: bool = False
    failure_mode: str = ""

    @property
    def duration_seconds(self) -> float:
        return self.end_time - self.start_time

    def record_tool_call(self, tool_name: str, file_path: str = ""):
        self.tool_call_counts[tool_name] = self.tool_call_counts.get(tool_name, 0) + 1
        if file_path:
            self.retrieved_files.append(file_path)

    def finalize(self, success: bool, failure_mode: str = ""):
        self.end_time = time.time()
        self.success = success
        self.failure_mode = failure_mode
```

Aggregate these metrics over time and surface them in a dashboard. The patterns they reveal — hot files, expensive task types, common failure modes — directly inform where to invest in system improvement.

### Deployment Architecture

A production code-aware agent system has several components beyond the model and retrieval:

```
                    ┌─────────────────────────────────────────────┐
                    │              API Gateway / Auth              │
                    └─────────────────────┬───────────────────────┘
                                          │
                    ┌─────────────────────▼───────────────────────┐
                    │              Task Queue (Redis/SQS)          │
                    │         Rate limiting, priority, dedup       │
                    └─────────────────────┬───────────────────────┘
                                          │
              ┌───────────────────────────▼───────────────────────┐
              │                   Agent Workers                    │
              │    ┌──────────┐   ┌──────────┐   ┌──────────┐    │
              │    │ Worker 1 │   │ Worker 2 │   │ Worker 3 │    │
              │    └────┬─────┘   └────┬─────┘   └────┬─────┘    │
              └─────────┼──────────────┼──────────────┼──────────┘
                        │              │              │
              ┌──────────▼──────────────▼──────────────▼──────────┐
              │                   Tool Layer                       │
              │  ┌───────────┐  ┌──────────┐  ┌───────────────┐  │
              │  │  Vector   │  │  File    │  │  Call Graph   │  │
              │  │  Store    │  │  Server  │  │  Service      │  │
              │  └───────────┘  └──────────┘  └───────────────┘  │
              └────────────────────────────────────────────────────┘
                        │
              ┌──────────▼────────────────────────────────────────┐
              │                  Data Layer                        │
              │  ┌───────────┐  ┌──────────┐  ┌───────────────┐  │
              │  │ Chroma /  │  │  Git     │  │  Run Logs     │  │
              │  │ Qdrant    │  │  Repo    │  │  & Metrics    │  │
              │  └───────────┘  └──────────┘  └───────────────┘  │
              └────────────────────────────────────────────────────┘
```

*Figure 1: Production agent system architecture. The task queue provides backpressure and rate limiting; multiple workers process concurrently; the tool layer is shared and independently scalable.*

Key deployment decisions:

**Worker concurrency.** Agent runs are CPU-light on the worker side (mostly waiting for API and tool responses) but memory-heavy (each run carries a growing context). Tune worker count based on memory available, not CPU.

**Tool layer isolation.** The vector store, file server, and call graph service should be independent services, not libraries embedded in the worker. This allows independent scaling and prevents a slow retrieval query from blocking other workers.

**Git integration.** The file server should serve from the actual git repository, not a copy. This ensures freshness and avoids the synchronization problem. Use git commands for line-range reads and file existence checks rather than filesystem reads, so you get correct behavior with any branch or commit reference.

**Graceful degradation.** When the vector store is slow, fall back to keyword search. When the call graph service is unavailable, fall back to text-only navigation. Design every tool to have a degraded mode that still returns something useful.

### Key Takeaways

1. Stream agent output in production — it makes long runs feel acceptable and creates a natural audit trail.
2. Safety constraints belong at the tool layer, not just at the step limit. Write operations need approval gates.
3. Token costs for agentic systems are significantly higher than for single-pass RAG. Budget, tier, and cache.
4. Instrument everything from the start. Observability is not optional for production systems.
5. The tool layer should be independent services, not embedded libraries. Independent scaling matters at load.

> **Try This**
> Instrument your current agent system with per-run metrics. Let it run for a week on real tasks. Review the resulting data: what's the distribution of step counts? What tools are called most frequently? What's the token cost per task? What percentage of runs succeed? These numbers are your baseline for every future improvement. Without them, you're flying blind.

---

## Conclusion

The argument this book has made is not complicated: single-pass RAG has a ceiling, and that ceiling is lower than where most interesting problems live. Agentic pipelines with iterative retrieval raise that ceiling substantially — not infinitely, not without new problems, but enough to handle the class of tasks that actually matter for engineering work.

The path from RAG to agents is incremental. The retrieval infrastructure you have already built is the foundation. Adding an agent loop on top of it is the next step. Hardening that loop for production — grounding, memory, evaluation, safety — is the work that makes it deployable.

What makes code specifically interesting as the domain for this work is that code is already structured as a graph. Functions call functions. Modules depend on modules. Dependencies are explicit, navigable, and indexable. That structure gives an agentic retrieval system something to follow — a topology that makes multi-step investigation tractable in ways it isn't for flat document corpora.

The systems described in this book are not hypothetical. The patterns are in production in various forms across the industry. The tooling to build them — vector stores, call graph analysis, language model APIs — is mature enough to use. What's less mature is the understanding of how to combine these components well, how to evaluate whether the system is actually working, and how to operate it safely at scale.

That's where the real work is. Not in the implementation of any individual component, but in the integration: knowing which retrieval strategy fits which task type, understanding when the agent is spinning versus making progress, catching grounding failures before they surface in user-facing errors, and maintaining evaluation pressure as the system evolves.

Build small first. A one-tool agent against a small codebase is the right starting point. Get the evaluation loop working early — you'll need it before you need sophisticated multi-step retrieval. Add complexity incrementally as you demonstrate that the simpler system handles its task class well.

The goal is not an agent that can do anything. It's an agent that can reliably do the specific things your users need, with predictable cost and behavior, that fails gracefully when it can't succeed. That's a harder bar than "demo works." It's the bar that matters.

---

## Appendix A: Glossary

**Agent loop** — The iterative structure of perceive-decide-act that defines an agentic system. The model generates, tool calls are executed, results are fed back into context, and the model generates again until the task is complete or the step limit fires.

**BM25** — Best Match 25. A sparse retrieval algorithm based on term frequency and inverse document frequency. Effective for exact term matching; used alongside dense embeddings in hybrid search.

**Call graph** — A directed graph representing function call relationships in a codebase. Edges point from caller to callee. Enables navigation across dependency chains without guessing.

**Chunk** — A unit of text (or code) that is indexed and retrieved as a single item. In code retrieval, ideally corresponds to a syntactic unit: function, class, or method.

**Dense retrieval** — Retrieval based on vector embeddings in a high-dimensional semantic space. Effective for semantic similarity; less effective for exact term matching.

**Fully-qualified name (FQN)** — A complete, unambiguous identifier for a symbol, including all enclosing namespaces. Example: `auth.middleware.AuthMiddleware.validate_token`.

**Grounding** — The practice of anchoring model claims to specific, retrieved evidence rather than training priors. A grounded agent doesn't claim what it hasn't read.

**Hybrid search** — A retrieval strategy that combines dense and sparse retrieval, typically fused with Reciprocal Rank Fusion. Generally outperforms either alone for code retrieval.

**Impact analysis** — Determining what would be affected by a change to a specific component. In code, typically involves following the call graph forward (what calls this) and backward (what this calls).

**Index staleness** — The condition where the search index no longer reflects the current state of the codebase due to changes that occurred after the last index update.

**RAG (Retrieval-Augmented Generation)** — An architecture that retrieves relevant documents or code chunks and provides them as context for language model generation.

**ReAct** — A prompting strategy (Reasoning + Acting) that interleaves natural language reasoning with tool use. The reasoning trace functions as working memory for multi-step tasks.

**Reciprocal Rank Fusion (RRF)** — An algorithm for merging ranked lists from multiple retrieval sources. For each document, RRF score is the sum of `1 / (k + rank)` across all lists, where `k` is a constant (typically 60).

**Single-pass RAG** — A RAG architecture where retrieval happens once, at the start of a request. Adequate for lookup tasks; insufficient for multi-step reasoning tasks.

**Sparse retrieval** — Retrieval based on term matching, such as BM25. Effective for exact term matching; less effective for semantic similarity.

**Step limit** — A hard cap on the number of agent loop iterations per run. Prevents runaway loops; should be tuned to the expected step count for the task distribution.

**Syntactic chunking** — Splitting code at language-defined boundaries (function definitions, class bodies) rather than at fixed character or token counts.

**Token budget** — A hard limit on total token consumption for an agent run. Used for cost control and as a forcing function for synthesis when context becomes large.

**Tool** — A function that the agent can call during its loop. Defined by a name, parameter schema, and description. Executed by external code; results are returned to the model.

**Working notes** — A structured document maintained across agent steps recording established facts, open questions, and investigated files. Functions as working memory across context compression events.

---

## Appendix B: Tools & Resources

### Vector Stores
**ChromaDB** — Embedded vector database, suitable for development and small production deployments. Python-native, easy to set up, no external server required for the default configuration.

**Qdrant** — Production-ready vector database with filtering, hybrid search support, and a clean REST API. Suitable for larger codebases and multi-tenant deployments.

**Weaviate** — Vector database with built-in hybrid search, module system, and GraphQL query interface. More complex to set up but flexible for custom retrieval pipelines.

**pgvector** — Vector similarity extension for PostgreSQL. Useful when you're already running Postgres and don't want to introduce another service.

### Embedding Models
**text-embedding-3-small / text-embedding-3-large (OpenAI)** — General-purpose text embeddings. text-embedding-3-small is fast and cost-effective; large is higher quality for complex semantic matching.

**voyage-code-2 (Voyage AI)** — Code-specific embedding model. Outperforms general-purpose models on code retrieval tasks, particularly for cross-language queries.

**nomic-embed-text** — Open-source embedding model with strong performance on retrieval benchmarks. Suitable for on-premises deployments where API costs are a concern.

### Retrieval Infrastructure
**rank-bm25 (Python)** — Simple, effective BM25 implementation for Python. Good for hybrid search when you need the sparse component and want to avoid additional services.

**Elasticsearch / OpenSearch** — Full-featured search engines with strong BM25 implementations, filter support, and production-grade operational tooling.

### Code Analysis
**tree-sitter** — Language-aware parsing library with support for most major programming languages. Better than Python's `ast` module for multi-language codebases.

**Sourcegraph** — Code search and navigation platform. Provides a hosted option and an API; relevant for organizations already running it.

**Semgrep** — Static analysis tool that doubles as a code search engine. Pattern-based queries can supplement semantic search for structural code patterns.

### Agent Frameworks
**Anthropic Claude API** — Direct API access for building custom agent loops. Most control; no opinionated abstractions.

**LangGraph** — Graph-based agent orchestration from LangChain. Useful for complex agent topologies; adds abstraction overhead.

**LlamaIndex** — RAG-first framework with agent support. Good retrieval infrastructure built in; well-suited for document-heavy use cases.

### Evaluation
**RAGAS** — Evaluation framework for RAG systems with metrics for faithfulness, relevance, and answer quality.

**Phoenix (Arize)** — Observability platform for LLM applications. Tracing, evaluation, and monitoring in one.

**LangSmith** — Tracing and evaluation platform from LangChain. Well-integrated with LangChain and LangGraph.

---

## Appendix C: Further Reading

### Retrieval-Augmented Generation
**"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"** — Lewis et al., 2020. The original RAG paper. Establishes the basic architecture and motivates the retrieval-augmented approach.

**"Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection"** — Asai et al., 2023. Introduces the idea of retrieval as a conditional action rather than a fixed step — directly relevant to agentic retrieval design.

**"Dense Passage Retrieval for Open-Domain Question Answering"** — Karpukhin et al., 2020. The foundational paper on dense retrieval. Establishes the bi-encoder architecture that underlies most vector search systems.

### Agents and Tool Use
**"ReAct: Synergizing Reasoning and Acting in Language Models"** — Yao et al., 2022. Introduces the ReAct prompting pattern. Required reading for anyone building agentic systems.

**"Toolformer: Language Models Can Teach Themselves to Use Tools"** — Schick et al., 2023. Shows that language models can learn to use tools through self-supervised training. Context for understanding why tool use works so well in modern models.

**"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"** — Wei et al., 2022. The foundational paper on chain-of-thought. Directly relevant to the reasoning trace patterns used in agentic systems.

### Code Intelligence
**"CodeBERT: A Pre-Trained Model for Programming and Natural Languages"** — Feng et al., 2020. Establishes the effectiveness of pre-training on code-natural language pairs. Background for understanding why code-specific embedding models outperform general-purpose ones.

**"Evaluating Large Language Models Trained on Code"** — Chen et al., 2021 (Codex paper). Comprehensive evaluation of code language models. The HumanEval benchmark it introduces is still used.

**"RepoCoder: Repository-Level Code Completion Through Iterative Retrieval and Generation"** — Zhang et al., 2023. Directly relevant — iterative retrieval for code completion is the same architectural pattern as the multi-step retrieval covered in this book.

### Evaluation
**"RAGAS: Automated Evaluation of Retrieval Augmented Generation"** — Es et al., 2023. Introduces automated evaluation metrics for RAG systems. A practical starting point for building your own evaluation pipeline.

**"Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena"** — Zheng et al., 2023. Evaluating language model outputs with other language models. Relevant for cases where objective ground truth isn't available.

### Production Systems
**"Building LLM Applications for Production"** — Chip Huyen, 2023 (blog post). Comprehensive treatment of production considerations for LLM systems: latency, cost, reliability, and safety.

**"Patterns for Building LLM-based Systems & Products"** — Eugene Yan, 2023 (blog post). Practical patterns from an engineer who has built these systems. Strong on the gap between research patterns and production requirements.

---

*From RAG to Agents: Code-Aware Agentic Pipelines*
*By David Kelly Price — Version 1.0, April 2026*
*© 2026 Pyckle. All rights reserved.*



---

## Related Blog Posts

- [Apple Brings Agentic Coding to Xcode](https://pyckle.co/blog/apple-brings-agentic-coding-to-xcode-the-real-question-is-what-happens-next.html)
- [Semantic Routing Explained](https://pyckle.co/blog/semantic-routing-explained.html)
- [Semantic Routing: The Decision Layer AI Coding Tools Actually Need](https://pyckle.co/blog/semantic-routing-the-decision-layer-ai-coding-tools-actually-need.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*
