# The Complete Guide to AI-Assisted Development with Pyckle

*Version 1.0 — April 2026*

---

The promise of AI pair programming has always been seductive: an infinitely patient collaborator who knows your entire codebase, never forgets context, and can surface relevant code instantly. The reality has been different. Every developer who has used Copilot, ChatGPT, or Claude Code for serious work knows the pattern: the AI suggests code that looks plausible but references functions that do not exist, imports modules with the wrong names, or hallucinates APIs based on outdated training data.

The root cause is not model capability — it is context. Large language models are stateless by design. Each request starts from zero. Without access to your actual code, the model has no choice but to guess. Some tools attempt to solve this by dumping files into the context window, but this approach trades one problem for another: token waste, cost explosion, and the model drowning in irrelevant context.

Pyckle solves the context problem with a two-layer architecture. The first layer is Pyckle MCP, a local semantic search engine that indexes your codebase and exposes it via the Model Context Protocol. The second layer is Pyckle Agent, an intelligent router that dynamically retrieves relevant context before each model call. Together, they give AI tools precise, relevant, cost-efficient access to your code — without requiring you to manually paste files or hope the model remembers previous conversations.

This guide walks through both layers in detail, from fundamental concepts to advanced configuration.

---

## Chapter 1: The Context Problem

### The Search-Hope-Hallucinate Cycle

Watch any developer use an AI coding assistant on a real codebase and you will see the same loop repeat:

1. Developer asks a question that requires codebase context
2. AI produces a plausible-sounding answer
3. Answer references code that does not exist, or exists differently than described
4. Developer manually finds the correct code and pastes it
5. AI finally produces useful output

This is the search-hope-hallucinate cycle. The model cannot search your codebase, so it hopes its training data contains something similar. When that hope fails — which is most of the time for proprietary code — it hallucinates. The developer becomes a context shuttle, manually ferrying code between their editor and the AI.

### The Token Window Is Not a Solution

The naive fix is to give the model more context. Dump the relevant files into the prompt. Some tools automate this by including every file in the current directory, or every file that matches a glob pattern.

This approach has three problems:

**Cost scales linearly.** Claude 3.5 Sonnet charges $3 per million input tokens. A medium-sized codebase — 50,000 lines of TypeScript — is roughly 500,000 tokens. At $3/M, that is $1.50 per query just for input tokens. A team of five developers making 100 queries per day would spend $750/month on input tokens alone, before counting output tokens.

**Relevance is inverse to volume.** When you dump 500K tokens of code, maybe 5K tokens are relevant to the current question. The model must sift through 100x more noise than signal. Attention mechanisms dilute across irrelevant content. The model's effective reasoning capacity drops.

**Context windows have hard limits.** Claude's 200K window sounds large until you need to fit the codebase, the conversation history, the system prompt, and leave room for output. In practice, you cannot dump more than 50-100K tokens of code before crowding out everything else.

### The Cost of Context Waste

Consider a concrete example. A developer using Claude Code makes 80 queries per day. Without smart retrieval, each query includes 50K tokens of "just in case" context — every file in the current directory, recent git diffs, the entire README.

| Metric | Calculation | Monthly Cost |
|--------|-------------|--------------|
| Input tokens per query | 50,000 | — |
| Queries per day | 80 | — |
| Days per month | 22 | — |
| Total input tokens | 88M | — |
| Cost at $3/M | — | $264 |

Now consider the same developer with Pyckle, which retrieves only the 5K tokens actually relevant to each query:

| Metric | Calculation | Monthly Cost |
|--------|-------------|--------------|
| Input tokens per query | 5,000 | — |
| Queries per day | 80 | — |
| Days per month | 22 | — |
| Total input tokens | 8.8M | — |
| Cost at $3/M | — | $26.40 |

The difference is $237.60 per developer per month — or $14,256 per year for a team of five. And that is before counting the productivity gains from faster responses and more accurate outputs.

### The Two-Layer Solution

Pyckle addresses context with two complementary layers:

**Pyckle MCP (retrieval layer):** A local server that indexes your codebase into a vector database, exposes semantic search via the Model Context Protocol, and tracks your session activity to maintain "warm" context. It runs on your machine, never sends code to external servers, and integrates directly with Claude Code, Cursor, and Windsurf.

**Pyckle Agent (routing layer):** A lightweight API server that wraps language model calls with automatic context retrieval. Before sending your prompt to the model, the agent queries Pyckle MCP, injects relevant code snippets, and routes to the appropriate model tier based on query complexity. It handles the orchestration so you do not have to manually invoke search tools.

The rest of this guide explains how to use both layers effectively.

---

## Chapter 2: Semantic Search Fundamentals

### How Embedding Models Work for Code

An embedding model converts text into a fixed-length vector of floating-point numbers — typically 384 to 1536 dimensions. These vectors live in a high-dimensional space where semantic similarity corresponds to geometric proximity. Two code snippets that do similar things will have vectors that point in roughly the same direction, even if they use different variable names or syntax.

The conversion process is straightforward: the model tokenizes the input, runs it through a transformer encoder, and pools the final hidden states into a single vector. This vector captures the "meaning" of the input in a way that supports mathematical operations like cosine similarity.

```python
from pyckle import embed

# Embed a code snippet
vector = embed("def authenticate_user(username, password):")
# vector is a list of 384 floats, e.g., [0.023, -0.118, 0.042, ...]

# Embed a natural language query
query_vector = embed("user login function")

# Cosine similarity tells us how related they are
similarity = cosine_similarity(vector, query_vector)  # e.g., 0.847
```

### Why General-Purpose Embeddings Fail at Code

Models like OpenAI's `text-embedding-3-small` or Cohere's `embed-english-v3.0` are trained primarily on natural language text: Wikipedia articles, web pages, books. They understand that "authentication" and "login" are related concepts, but they struggle with code-specific patterns.

Consider this function:

```python
def proc_usr_creds(u, p):
    h = hashlib.sha256(p.encode()).hexdigest()
    return db.users.find_one({"username": u, "pwd_hash": h})
```

A general-purpose embedding model sees `proc`, `usr`, `creds`, `u`, `p`, `h` — tokens with no semantic weight in its English vocabulary. It does not know that `proc_usr_creds` means "process user credentials" or that this function implements authentication. A developer searching for "user authentication" would not find this code.

Code-native embedding models are trained on code-to-code retrieval tasks. They learn that abbreviated variable names follow conventions (`u` for user, `p` for password, `h` for hash), that function names are compressed descriptions of behavior, and that import statements signal domain context.

### PyckLM: Code-Native Embeddings

PyckLM is Pyckle's embedding model, fine-tuned specifically for code retrieval. The training process uses triplets of (query, positive example, negative example) drawn from real codebases:

- **Query:** Natural language description or related code snippet
- **Positive:** Code that correctly answers the query
- **Negative:** Code that is syntactically similar but semantically different

Through contrastive learning, the model learns to place queries and their positive examples close together in vector space while pushing negative examples away.

Benchmarks on internal code retrieval datasets show PyckLM achieving 91% recall@10, compared to 67% for general-purpose models on the same tasks.

### Hybrid Search: Semantic + BM25 Fusion

Semantic search alone has a blind spot: exact matches. If you search for `processUserCredentials` and a function is named exactly that, semantic search might rank it below a function called `authenticateUser` because the embeddings are trained to favor semantic similarity over lexical identity.

BM25 is a classical keyword-matching algorithm that excels at exact and partial matches. It uses term frequency and inverse document frequency to score documents by how well they match query terms.

Pyckle combines both approaches using Reciprocal Rank Fusion (RRF):

```
RRF_score(d) = sum(1 / (k + rank_semantic(d)), 1 / (k + rank_bm25(d)))
```

Where `k` is a constant (typically 60) that controls how much to favor top-ranked results. A document that ranks #1 in both lists gets a higher combined score than a document that ranks #1 in one list and #100 in the other.

In practice, hybrid search catches cases that either approach alone would miss:
- Semantic handles "find the auth middleware" even if the code never uses the word "auth"
- BM25 handles "find AuthenticationMiddleware" when the class is named exactly that

### Threshold Calibration

Not all codebases are alike. A codebase with highly descriptive function names will produce tighter similarity clusters than one with terse, abbreviated names. A fixed similarity threshold (e.g., "return results above 0.7") would over-filter the terse codebase and under-filter the descriptive one.

Pyckle's calibration system measures the similarity distribution in your specific codebase during indexing. It identifies natural breakpoints in the distribution and sets thresholds that maximize precision without sacrificing recall. This calibration is stored alongside the index and applied automatically to every search.

### Chunking Strategy

Before embedding, code must be split into chunks. The chunking strategy determines what unit of code each vector represents.

| Strategy | Pros | Cons |
|----------|------|------|
| **File-level** | Captures full context, good for small files | Large files exceed embedding model limits, dilutes signal |
| **Function-level** | Natural semantic unit, good for most codebases | Misses file-level context like imports |
| **Logical blocks** | Handles classes with methods, nested functions | Complex to implement, edge cases |
| **Sliding window** | Uniform chunks, no boundary detection needed | Splits functions mid-body, loses coherence |

Pyckle uses function-level chunking by default, with special handling for:
- Classes: chunked as class signature + individual methods
- Long functions: split at logical breakpoints (blank lines, comments)
- Configuration files: chunked by top-level keys
- Markdown: chunked by headings

The chunker also preserves context by including the file path, import statements, and parent class/function names in each chunk's metadata.

### Embedding Model Performance Considerations

Embedding generation is the most computationally intensive part of indexing. PyckLM is optimized for both CPU and GPU inference:

**GPU acceleration:** If CUDA is available, PyckLM runs on GPU with batched inference. A 50K-line codebase (roughly 6,500 chunks) indexes in under 2 minutes on an RTX 3080.

**CPU fallback:** On CPU-only machines, indexing is slower but still practical. The same 6,500 chunks take approximately 8-10 minutes on a modern laptop CPU with AVX2 support.

**Memory requirements:** PyckLM requires approximately 400MB of RAM for the model weights. During indexing, peak memory usage depends on batch size — the default configuration keeps memory under 2GB for most codebases.

**Incremental indexing:** After the initial index, subsequent runs only process changed files. Pyckle tracks file modification times and content hashes to detect changes. Re-indexing after a typical commit takes seconds, not minutes.

---

## Chapter 3: Setting Up Pyckle

### Free vs Pro

Pyckle MCP is open-source and free to run locally. The free tier includes:

- Unlimited local indexing
- Full semantic and hybrid search
- Session memory and warm routing
- Graph-based navigation
- Integration with Claude Code, Cursor, Windsurf

The Pro tier ($29/month) adds:

- Pyckle Agent with multi-provider routing
- GitHub/GitLab webhook integration
- Notion OAuth sync
- Diff review in CI
- Custom PyckLM fine-tuning
- Priority support

### Installation

Pyckle MCP requires Python 3.10+ and runs on macOS, Linux, and Windows (WSL recommended).

```bash
pip install pyckle-mcp
```

After installation, authenticate with your Pyckle account (required for Pro features, optional for free tier):

```bash
pyckle auth
```

This opens a browser window for OAuth. The resulting token is stored in `~/.pyckle/config.json`.

### Index Your First Codebase

Indexing parses your codebase, chunks the code, generates embeddings, and stores everything in a local ChromaDB database.

```bash
pyckle index /path/to/your/codebase
```

Or via the MCP tool:

```python
index_codebase("/path/to/your/codebase")
```

What happens under the hood:

1. **Discovery:** Walks the directory tree, respects `.gitignore` and `.pyckleignore`
2. **Parsing:** Uses tree-sitter grammars to parse 20+ languages
3. **Chunking:** Splits code into function-level chunks with metadata
4. **Embedding:** Batches chunks through PyckLM (GPU-accelerated if available)
5. **Storage:** Writes vectors and metadata to ChromaDB at `~/.pyckle/indexes/{codebase_hash}/`

For a typical 50K-line codebase, indexing takes 2-5 minutes on a modern laptop. Subsequent re-indexes are incremental — only changed files are re-processed.

### Connecting to AI Tools

Pyckle MCP exposes tools via the Model Context Protocol. Add it to your tool's MCP configuration:

**Claude Code** (`~/.claude/mcp.json`):
```json
{
  "mcpServers": {
    "pyckle": {
      "command": "pyckle",
      "args": ["mcp", "serve"],
      "env": {}
    }
  }
}
```

**Cursor** (`.cursor/mcp.json` in workspace root):
```json
{
  "mcpServers": {
    "pyckle": {
      "command": "pyckle",
      "args": ["mcp", "serve"]
    }
  }
}
```

**Windsurf** (`~/.windsurf/mcp.json`):
```json
{
  "servers": {
    "pyckle": {
      "command": "pyckle",
      "args": ["mcp", "serve"]
    }
  }
}
```

Restart your editor after adding the configuration. The MCP server starts automatically when the editor launches.

### Your First Search

With the index built and the MCP server running, you can search from your AI tool:

```
search_code("authentication middleware")
```

The response includes:

```json
{
  "results": [
    {
      "file": "/app/middleware/auth.py",
      "function": "authenticate_request",
      "score": 0.892,
      "excerpt": "def authenticate_request(request):\n    token = request.headers.get('Authorization')\n    if not token:\n        raise UnauthorizedError('Missing auth token')\n    ..."
    },
    {
      "file": "/app/utils/jwt.py",
      "function": "verify_jwt_token",
      "score": 0.847,
      "excerpt": "def verify_jwt_token(token: str) -> dict:\n    try:\n        payload = jwt.decode(token, settings.SECRET_KEY, algorithms=['HS256'])\n    ..."
    }
  ],
  "query_time_ms": 42
}
```

### Understanding Results

Each result includes:

- **file:** Absolute path to the source file
- **function:** Name of the function or class containing the match
- **score:** Hybrid similarity score (0-1), with calibrated threshold applied
- **excerpt:** The actual code, truncated for display

Results are sorted by score descending. The calibrated threshold filters out low-confidence matches automatically — you only see results that are likely relevant.

---

## Chapter 4: Session Memory and Warm Routing

### What the Action Graph Tracks

Pyckle maintains a session graph that models your development activity. Every interaction creates or strengthens edges in this graph:

- **File reads:** When you open or search for a file
- **File edits:** When you modify a file (automatically detected via MCP hooks)
- **Search queries:** What you searched for and which results you clicked
- **Time decay:** Older interactions fade, recent ones dominate

The graph is bidirectional: if you edit `auth.py` and then search for "user validation", the search results are boosted for files that import from or are imported by `auth.py`.

### Memory Auto-Capture

When Pyckle MCP is connected to your editor, edits are captured automatically via the `register_edit()` hook. You do not need to manually register files — the MCP client integration handles this.

The hook fires when:
- You save a file in the indexed codebase
- A tool (Claude Code, Cursor) writes to a file
- Git detects a staged change in the working tree

Each edit updates the session graph, increasing the warmth score of the edited file and its neighbors.

### The Warmth Score

Every file in the session has a warmth score from 0 to 1. The score is calculated as:

```
warmth = recency_weight * edit_count + search_hits + neighbor_boost
```

Where:
- **recency_weight:** Exponential decay based on time since last interaction
- **edit_count:** Number of edits in this session
- **search_hits:** Number of times this file appeared in search results
- **neighbor_boost:** Fraction of warmth from connected files (import graph)

Files with high warmth are prioritized in search results and automatically included in context when you start new queries.

### Resuming Work with session_continue()

When you return to a codebase after a break, `session_continue()` restores your context:

```python
session_continue("where did I leave off with the payment refactor")
```

Response:

```json
{
  "warm_files": [
    {"path": "/app/payments/refund.py", "warmth": 0.94, "last_edit": "2h ago"},
    {"path": "/app/payments/models.py", "warmth": 0.81, "last_edit": "2h ago"},
    {"path": "/app/tests/test_refund.py", "warmth": 0.73, "last_edit": "3h ago"}
  ],
  "recent_queries": [
    "refund validation logic",
    "payment state machine",
    "stripe webhook handler"
  ],
  "suggested_context": "You were working on the refund flow, specifically adding validation for partial refunds. The last edit was to refund.py, adding a check for refund amount <= original charge."
}
```

This response can be passed directly to the AI model as context, eliminating the "where was I?" warmup period.

### add_memory() vs Auto-Capture

Auto-capture handles most cases, but sometimes you want to explicitly store context that is not tied to a file:

```python
add_memory(
    content="The payments team decided to deprecate v1 API by Q3",
    tags=["payments", "deprecation", "decision"]
)
```

Use `add_memory()` for:
- Team decisions and rationale
- External documentation links
- Architectural context that spans multiple files
- TODO items and planned changes

### Querying Memory with search_memory()

Memory search is separate from code search — it queries your explicit memory entries:

```python
search_memory("payment API deprecation")
```

Response:

```json
{
  "memories": [
    {
      "content": "The payments team decided to deprecate v1 API by Q3",
      "tags": ["payments", "deprecation", "decision"],
      "created": "2026-04-15T14:32:00Z",
      "score": 0.91
    }
  ]
}
```

### Cross-Session Persistence

The session graph and memories persist in ChromaDB at `~/.pyckle/sessions/`. This means:

- Closing your editor does not lose warmth data
- Rebooting your machine preserves session state
- You can resume work days later and still get relevant context

Session data is tied to the codebase, not the editor. If you switch from Cursor to Claude Code, the same session state is available.

### Pupil Memory Bootstrap

When Pyckle Agent handles a request, it uses the Pupil system to pre-inject warm context before the first model call. The flow:

1. Agent receives user query
2. Agent calls `session_continue()` to get warm files
3. Agent calls `search_code()` with the query
4. Agent merges warm files + search results into a context block
5. Context block is prepended to the model prompt
6. Model sees relevant code immediately, no tool calls needed

This "pupil" (contextual pre-injection) reduces round-trips and improves first-response accuracy.

---

## Chapter 5: Dependency Graphs and Blast Radius

### How Pyckle Builds the Dependency Graph

During indexing, Pyckle's parser extracts import/export relationships:

- **Python:** `import x`, `from x import y`
- **JavaScript/TypeScript:** `import`, `require`, `export`
- **Go:** `import "package"`
- **Java:** `import package.Class`
- **Rust:** `use crate::module`

These relationships form a directed graph where nodes are files and edges are import dependencies. The graph is stored alongside the vector index and queried in constant time.

### graph_neighbors(): Direct Dependencies

`graph_neighbors()` returns the immediate import graph for a file:

```python
graph_neighbors("/app/payments/refund.py")
```

Response:

```json
{
  "file": "/app/payments/refund.py",
  "imports": [
    "/app/payments/models.py",
    "/app/payments/stripe_client.py",
    "/app/utils/validators.py"
  ],
  "imported_by": [
    "/app/api/routes/payments.py",
    "/app/tasks/process_refunds.py",
    "/app/tests/test_refund.py"
  ]
}
```

Use this when you need to understand what a file depends on and what depends on it.

### graph_impact(): Transitive Blast Radius

`graph_impact()` follows the dependency graph transitively to show all files that could be affected by a change:

```python
graph_impact("/app/payments/models.py", max_depth=3)
```

Response:

```json
{
  "file": "/app/payments/models.py",
  "impact_radius": {
    "depth_1": [
      "/app/payments/refund.py",
      "/app/payments/charge.py",
      "/app/payments/subscription.py"
    ],
    "depth_2": [
      "/app/api/routes/payments.py",
      "/app/api/routes/subscriptions.py",
      "/app/tasks/billing.py"
    ],
    "depth_3": [
      "/app/tests/test_payments.py",
      "/app/tests/test_billing.py"
    ]
  },
  "total_affected": 8
}
```

Use this before refactoring to understand how far-reaching a change might be.

### include_neighbors in search_code()

The `search_code()` tool accepts an `include_neighbors` parameter that automatically expands results with their graph neighbors:

```python
search_code("refund validation", include_neighbors=True)
```

This returns not just the matching functions, but also functions they call and functions that call them — useful for understanding the full context around a match.

### Cross-File Signature Visibility

When search results reference external functions, Pyckle can include the signatures of those functions in the result:

```json
{
  "file": "/app/payments/refund.py",
  "function": "process_refund",
  "excerpt": "def process_refund(charge_id, amount):\n    charge = get_charge(charge_id)\n    ...",
  "external_signatures": [
    {
      "name": "get_charge",
      "file": "/app/payments/stripe_client.py",
      "signature": "def get_charge(charge_id: str) -> StripeCharge"
    }
  ]
}
```

This gives the AI model enough context to understand the code without needing to fetch the entire external file.

### Decision Guide: When to Use Each Tool

| Scenario | Tool |
|----------|------|
| "What does this file depend on?" | `graph_neighbors()` |
| "What will break if I change this?" | `graph_impact()` |
| "Show me related code with context" | `search_code(include_neighbors=True)` |
| "I need to understand the full call chain" | `graph_impact()` then `Read` on key files |
| "Quick check before a small edit" | `graph_neighbors()` |
| "Planning a major refactor" | `graph_impact(max_depth=5)` |

### Code Examples

**Finding all callers of a utility function:**

```python
# Find who uses the validate_email function
neighbors = graph_neighbors("/app/utils/validators.py")
callers = neighbors["imported_by"]
# ["/app/api/routes/auth.py", "/app/api/routes/users.py", ...]
```

**Assessing refactor scope:**

```python
# Before renaming a model class
impact = graph_impact("/app/models/user.py", max_depth=4)
print(f"This change affects {impact['total_affected']} files")

if impact["total_affected"] > 20:
    print("Consider a staged migration approach")
```

**Expanding search context:**

```python
# Find error handling with surrounding context
results = search_code("error handling middleware", include_neighbors=True)
for r in results["results"]:
    print(f"{r['file']}: {r['function']}")
    for sig in r.get("external_signatures", []):
        print(f"  -> calls {sig['name']} from {sig['file']}")
```

---

## Chapter 6: Integrations

### GitHub and GitLab Webhooks

Pyckle Pro can index your issue tracker and pull requests, making them searchable alongside code. This is useful for finding:

- The PR that introduced a bug
- Issues related to a feature you are implementing
- Discussion history around architectural decisions

**Webhook Setup (GitHub):**

1. Go to your repository's Settings > Webhooks
2. Add webhook with URL: `https://api.pyckle.co/webhooks/github/{your_project_id}`
3. Select events: Issues, Pull requests, Issue comments, PR comments
4. Set the secret (available in your Pyckle dashboard)

**Webhook Setup (GitLab):**

1. Go to your project's Settings > Webhooks
2. Add webhook with URL: `https://api.pyckle.co/webhooks/gitlab/{your_project_id}`
3. Select triggers: Issues events, Merge request events, Note events
4. Set the secret token

**What Gets Indexed:**

- Issue titles and bodies
- PR/MR titles, descriptions, and diff summaries
- Comments and review comments
- Labels and milestone information

**Incremental Updates:**

After initial setup, use `fetch_incremental()` to pull new items without re-indexing everything:

```python
fetch_incremental(source="github", repo="myorg/myrepo")
```

**Bulk Backfill:**

For existing issues and PRs, use `index_git_issues()`:

```python
index_git_issues(
    repo="myorg/myrepo",
    source="github",
    since="2024-01-01"  # Optional: only index recent items
)
```

### Notion OAuth

Pyckle can index your Notion workspace, making engineering documentation searchable alongside code. This is particularly useful for:

- Design documents referenced in code comments
- Runbooks and operational procedures
- Meeting notes with technical decisions

**OAuth Flow:**

1. Run `pyckle notion connect`
2. Authorize Pyckle in the Notion OAuth popup
3. Select which pages/databases to index

The OAuth flow uses PKCE for security. Tokens are stored locally in `~/.pyckle/notion_token.json` and refreshed automatically.

**What Gets Indexed:**

- Page titles and content
- Database entries with properties
- Nested pages and linked databases
- Code blocks with language detection

**Multi-User Access:**

Each developer authenticates with their own Notion account. Pyckle respects Notion's permission model — you only see pages you have access to.

**Searching Across Sources:**

When you search with `search_code()`, results can include Notion pages if you add the `sources` parameter:

```python
search_code("authentication flow", sources=["code", "notion"])
```

Response:

```json
{
  "results": [
    {"type": "code", "file": "/app/auth/flow.py", "score": 0.89, ...},
    {"type": "notion", "page": "Auth System Design Doc", "score": 0.84, ...}
  ]
}
```

### Diff Review in CI

Pyckle's `review_diff()` tool analyzes pull request diffs using your codebase context. In CI, this catches issues that lint rules miss:

- Changed function calls a deprecated API
- New code duplicates existing utility function
- Modified logic contradicts documented behavior

**GitHub Actions Setup:**

```yaml
name: Pyckle Diff Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  review:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Install Pyckle
        run: pip install pyckle-mcp

      - name: Run Diff Review
        env:
          PYCKLE_API_KEY: ${{ secrets.PYCKLE_API_KEY }}
        run: |
          pyckle review-diff \
            --base ${{ github.event.pull_request.base.sha }} \
            --head ${{ github.event.pull_request.head.sha }} \
            --output pr-review.md

      - name: Post Review Comment
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const review = fs.readFileSync('pr-review.md', 'utf8');
            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: review
            });
```

**Severity Scoring:**

Each finding is scored as HIGH, MEDIUM, or LOW:

- **HIGH:** Breaking changes, security issues, data loss risks
- **MEDIUM:** Logic bugs, missing error handling, API misuse
- **LOW:** Style issues, minor inefficiencies, suggestions

**Non-Blocking Design:**

The review step always exits 0, even if issues are found. This is intentional — Pyckle provides information, not gates. Teams can choose to enforce thresholds via separate workflow logic.

**Review Output Format:**

The review output is structured Markdown suitable for PR comments:

```markdown
## Pyckle Diff Review

### HIGH Severity

**Deprecated API Usage** in `payments/charge.py:45`
> Changed code calls `stripe.Charge.create()` which was deprecated in Stripe API v2023-10.
> Use `stripe.PaymentIntent.create()` instead.
> Related: `/app/payments/stripe_client.py` already uses PaymentIntent

### MEDIUM Severity

**Missing Error Handling** in `api/routes/users.py:89`
> New endpoint `delete_user()` does not handle `UserNotFoundError`.
> Similar endpoints in this file catch this exception explicitly.

### LOW Severity

**Duplicate Utility** in `utils/validators.py:23`
> New function `validate_phone_number()` duplicates existing `phone_validator()`
> in `/app/utils/phone.py:12`. Consider reusing the existing implementation.

---
*Generated by Pyckle Diff Review v2.4*
```

**Customizing Review Behavior:**

You can configure which checks to run and their severity thresholds in `.pyckle/review.json`:

```json
{
  "checks": {
    "deprecated_api": {"enabled": true, "severity": "HIGH"},
    "missing_error_handling": {"enabled": true, "severity": "MEDIUM"},
    "duplicate_code": {"enabled": true, "severity": "LOW"},
    "security_patterns": {"enabled": true, "severity": "HIGH"}
  },
  "ignore_paths": ["vendor/", "generated/"],
  "max_findings": 20
}
```

---

## Chapter 7: Pyckle Agent

### Architecture Overview

Pyckle Agent is a FastAPI server that wraps language model calls with automatic context retrieval. Instead of calling Claude or GPT directly, you call the Agent, which:

1. Analyzes your query to determine context needs
2. Retrieves relevant code via Pyckle MCP
3. Injects context into the prompt
4. Routes to the appropriate model tier
5. Returns the model's response

The agent is stateless — all session state lives in Pyckle MCP. This means you can run multiple agent instances for high availability.

### Multi-Provider Routing

The Agent supports multiple LLM providers:

- **Anthropic:** Claude 4 Opus, Claude 4 Sonnet, Claude 4 Haiku
- **OpenAI:** GPT-4o, GPT-4o-mini
- **Local:** Ollama with any compatible model

Routing decisions are based on query intent classification:

| Intent | Default Route |
|--------|---------------|
| Simple lookup | Haiku / GPT-4o-mini |
| Code explanation | Sonnet / GPT-4o |
| Complex refactor | Opus / GPT-4o |
| Architecture discussion | Opus / GPT-4o |

You can override routing with the `model` parameter in your request.

### The MCP Tool Loop

When the model needs additional context, the Agent allows it to call MCP tools iteratively. The loop:

1. Model receives initial prompt + injected context
2. Model may request additional context via tool calls
3. Agent executes tool calls, returns results
4. Model continues with enriched context
5. Repeat until model produces final response (max 10 iterations)

This loop is invisible to you — the Agent returns only the final response.

### Cost-Aware Dynamic Routing

The Agent tracks token costs and can automatically downgrade routing to stay within budget:

```python
# In agent config
{
  "cost_limit_per_request": 0.10,  # Max $0.10 per request
  "cost_limit_daily": 50.00        # Max $50/day
}
```

When approaching limits, the Agent routes to cheaper models automatically. You receive a warning header in the response when this happens.

### Calling the API

**Endpoint:** `POST /agent/run`

**Request:**

```json
{
  "query": "How does the authentication middleware validate JWTs?",
  "codebase": "/path/to/your/codebase",
  "model": "balanced",
  "max_iterations": 5
}
```

**Response:**

```json
{
  "response": "The authentication middleware in `/app/middleware/auth.py` validates JWTs using the following flow:\n\n1. Extracts the token from the Authorization header...",
  "context_used": [
    {"file": "/app/middleware/auth.py", "function": "authenticate_request"},
    {"file": "/app/utils/jwt.py", "function": "verify_jwt_token"}
  ],
  "model_used": "claude-sonnet-4-20250514",
  "tokens": {"input": 4521, "output": 892},
  "cost": 0.023
}
```

### Model Tiers

| Tier | Description | Default Model |
|------|-------------|---------------|
| `fast` | Quick lookups, simple questions | Claude Haiku |
| `balanced` | Most development tasks | Claude Sonnet |
| `powerful` | Complex reasoning, architecture | Claude Opus |

Specify the tier in your request:

```json
{"query": "...", "model": "powerful"}
```

Or specify an exact model:

```json
{"query": "...", "model": "claude-opus-4-20250514"}
```

### When to Use Agent vs Direct MCP

| Use Case | Recommendation |
|----------|----------------|
| IDE integration (Cursor, Claude Code) | Direct MCP — the IDE handles orchestration |
| Custom application calling LLMs | Agent — handles retrieval automatically |
| CI/CD pipeline review | Agent — stateless, easy to deploy |
| Interactive CLI usage | Direct MCP — more control over tool calls |
| Automated workflows | Agent — reliable, handles retries |

### Deployment Options

**Fly.io (recommended for Pro):**

The Agent is pre-configured to run on Fly.io with auto-sleep for cost efficiency:

```bash
pyckle agent deploy --region ord
```

Cold starts take 2-3 seconds. The instance sleeps after 5 minutes of inactivity.

**Local with Ollama:**

For fully offline operation, run the Agent locally with Ollama as the LLM provider:

```bash
# Start Ollama with a code model
ollama serve
ollama pull codellama:34b

# Start the Agent pointing to Ollama
pyckle agent serve --provider ollama --model codellama:34b
```

This configuration keeps all data and computation on your machine.

### Error Handling and Retries

The Agent handles transient failures gracefully:

- **Model timeouts:** Automatically retries up to 3 times with exponential backoff
- **Rate limits:** Queues requests and retries after the rate limit window
- **Context retrieval failures:** Falls back to cached context if MCP is temporarily unavailable
- **Partial failures:** Returns partial results with a warning rather than failing entirely

You can configure retry behavior:

```json
{
  "max_retries": 3,
  "retry_delay_ms": 1000,
  "retry_backoff_multiplier": 2,
  "timeout_ms": 30000
}
```

### Observability

The Agent exposes Prometheus metrics at `/metrics`:

```
# Request metrics
pyckle_agent_requests_total{model="sonnet", status="success"} 1234
pyckle_agent_request_duration_seconds{quantile="0.99"} 2.3

# Token metrics
pyckle_agent_input_tokens_total 45678900
pyckle_agent_output_tokens_total 12345678

# Context retrieval
pyckle_agent_context_chunks_retrieved{source="search"} 8901
pyckle_agent_context_chunks_retrieved{source="warm"} 2345
```

For distributed tracing, the Agent supports OpenTelemetry. Set `OTEL_EXPORTER_OTLP_ENDPOINT` to enable trace export.

---

## Chapter 8: Advanced — Custom PyckLM Tuning

### When to Consider Tuning

Out-of-the-box PyckLM works well for most codebases. Consider custom tuning when:

- **Search recall is below 80%:** You regularly search for code and get irrelevant results
- **Non-standard naming conventions:** Your codebase uses domain-specific abbreviations (medical, financial, legacy systems)
- **Multi-language with unusual patterns:** Mixed languages with shared concepts that need cross-language retrieval
- **Large codebase (100K+ lines):** More data to learn from, tuning has higher impact

### What Fine-Tuning Does

Tuning adapts PyckLM to your codebase's semantic patterns. The process:

1. **Collects training data:** Your indexed codebase + search history provides positive examples
2. **Generates negatives:** Hard negatives are mined from similar-looking but semantically different code
3. **Fine-tunes the model:** Contrastive learning on your specific patterns
4. **Validates:** Holdout set confirms improvement before deployment

No manual labeling is required. The search history you generate through normal usage provides implicit feedback — results you clicked are positive, results you ignored are weaker signals.

### What to Expect

Typical improvements from custom tuning:

- **2-3x improvement in recall@10:** More relevant results in the top 10
- **50% reduction in "not found" searches:** Queries that returned nothing now return correct results
- **Better handling of abbreviations:** Model learns your naming conventions

Tuning takes 2-4 hours depending on codebase size. You receive an email when the tuned model is ready.

### How to Request Tuning

1. Ensure you have at least 1,000 chunks indexed (check with `index_stats()`)
2. Use Pyckle normally for at least a week to generate search history
3. Request tuning:

```bash
pyckle tune request
```

Or via the API:

```python
checkout("pycklm-tuning")
# Follow the prompts to confirm and upload
```

### Deployment

The tuned model is deployed to your Fly.io instance automatically. It does not replace the base PyckLM — it runs alongside it. You can switch between them:

```python
search_code("authentication", model="base")      # Original PyckLM
search_code("authentication", model="tuned")     # Your custom model
search_code("authentication")                    # Uses tuned if available
```

### Ownership

The tuned model weights are yours. They are stored on your Fly instance and never shared with other customers. You can:

- Export the weights for use in other systems
- Delete the tuned model at any time
- Retain the model even if you cancel Pro

Pyckle does not retain copies of tuned models beyond what is necessary for training.

### Evaluating Tuning Results

After receiving your tuned model, you can evaluate its performance against the base model:

```bash
pyckle tune evaluate --model tuned --queries 100
```

This runs 100 sample queries against both models and reports:

- **Recall@10:** Percentage of correct results in top 10
- **Mean Reciprocal Rank (MRR):** How high the first correct result ranks
- **Query latency:** Time to retrieve results

Typical improvements:

| Metric | Base PyckLM | Tuned PyckLM | Improvement |
|--------|-------------|--------------|-------------|
| Recall@10 | 78% | 92% | +18% |
| MRR | 0.65 | 0.84 | +29% |
| Latency | 42ms | 45ms | +7% (acceptable) |

The slight latency increase comes from the tuned model's additional parameters. In practice, this is imperceptible.

---

## Appendix: Quick Reference

### MCP Tool Reference

| Tool | Description |
|------|-------------|
| `search_code(query, [top_k], [include_neighbors], [sources])` | Semantic + BM25 hybrid search over indexed codebase |
| `graph_neighbors(file_path)` | Returns files that import and are imported by the target file |
| `graph_impact(file_path, max_depth)` | Transitive dependency analysis — files affected by changes |
| `index_codebase(path)` | Index a codebase into the local vector database |
| `index_stats()` | Returns chunk count, file count, last index time |
| `session_continue(query, [top_k])` | Resume work — returns warm files, recent queries, context |
| `session_summary()` | Current session state: files read/edited, queries, hottest files |
| `add_memory(content, [tags])` | Store explicit context (decisions, notes, external info) |
| `search_memory(query)` | Search over explicit memory entries |
| `register_edit(file_path)` | Manually register a file edit (usually auto-captured) |
| `get_coverage(file_path)` | Returns test coverage percentage for a file |
| `review_diff(base, head)` | Analyze diff between two commits with codebase context |
| `index_notion([page_ids])` | Index Notion pages into searchable memory |
| `index_git_issues(repo, source, [since])` | Backfill GitHub/GitLab issues and PRs |

### Common Patterns

**Start of session:**
```python
session_continue("continuing work on X")
```

**Find relevant code:**
```python
search_code("user authentication flow")
```

**Understand dependencies before refactoring:**
```python
graph_impact("/app/models/user.py", max_depth=3)
```

**Store a decision for future reference:**
```python
add_memory("Decided to use JWT instead of sessions for API auth", tags=["auth", "architecture"])
```

**Search across code and docs:**
```python
search_code("rate limiting", sources=["code", "notion"])
```

### Troubleshooting Common Issues

**"No results found" for queries that should match:**

1. Check index status: `index_stats()` — ensure the file is indexed
2. Verify the file is not in `.pyckleignore`
3. Try a more specific query with exact function/class names
4. If persistent, re-index: `index_codebase(path, force=True)`

**Slow search performance:**

1. Check chunk count: over 50K chunks may need more resources
2. Ensure GPU is being used if available: `pyckle info --hardware`
3. Consider indexing only relevant directories, not the entire repo

**Session memory not persisting:**

1. Verify ChromaDB has write permissions: `~/.pyckle/sessions/`
2. Check disk space — ChromaDB requires space for write-ahead logs
3. On network drives, latency can cause timeouts — use local storage

**MCP connection failures:**

1. Verify the server is running: `pyckle mcp status`
2. Check your editor's MCP configuration syntax
3. Restart the editor after configuration changes
4. Review logs: `~/.pyckle/logs/mcp.log`

### Performance Benchmarks

Real-world performance on representative codebases:

| Codebase Size | Index Time | Search Latency (p99) | Memory Usage |
|---------------|------------|----------------------|--------------|
| 10K lines | 30 seconds | 25ms | 200MB |
| 50K lines | 2.5 minutes | 40ms | 450MB |
| 200K lines | 10 minutes | 65ms | 1.2GB |
| 500K lines | 25 minutes | 95ms | 2.5GB |

Benchmarks measured on: MacBook Pro M2, 16GB RAM, PyckLM v2.1.

### Configuration Files

| File | Purpose |
|------|---------|
| `~/.pyckle/config.json` | API keys, default settings |
| `~/.pyckle/indexes/` | Local vector databases |
| `~/.pyckle/sessions/` | Session graphs and memory |
| `.pyckleignore` | Files to exclude from indexing (like .gitignore) |

---

*This guide covers Pyckle MCP v2.4 and Pyckle Agent v1.2. For the latest documentation, visit [pyckle.co/docs](https://pyckle.co/docs).*