---
title: "AI-Assisted Code Review"
subtitle: "Using Semantic Search and AI to Find What Human Reviewers Miss"
author: "David Kelly Price"
version: "1.0"
date: 2026-04-20
status: draft
type: ebook
target_audience: "Senior engineers and tech leads who run or participate in code review — looking to make review faster, more consistent, and higher signal"
estimated_pages: 75
chapters:
  - "What Code Review Actually Catches (and Misses)"
  - "Where AI Can Help and Where It Cannot"
  - "Semantic Search in the Review Workflow"
  - "Finding Similar Patterns and Prior Implementations"
  - "Flagging Anti-Patterns Automatically"
  - "Security-Focused Review with AI"
  - "Review for Architecture and Dependency Impact"
  - "Building a Hybrid Review Process"
  - "Metrics: Measuring Review Quality"
tags:
  - pyckle
  - ebook
  - code-review
  - ai-tools
  - semantic-search
  - quality
  - developer-workflow
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Figures:
- Captioned and numbered (Figure 1, Figure 2, etc.)
- Referenced by number in body text
-->

---

# AI-Assisted Code Review

## Using Semantic Search and AI to Find What Human Reviewers Miss

**By David Kelly Price**

Version 1.0 — April 2026

---

## Table of Contents

- About This Guide
- Chapter 1: What Code Review Actually Catches (and Misses)
- Chapter 2: Where AI Can Help and Where It Cannot
- Chapter 3: Semantic Search in the Review Workflow
- Chapter 4: Finding Similar Patterns and Prior Implementations
- Chapter 5: Flagging Anti-Patterns Automatically
- Chapter 6: Security-Focused Review with AI
- Chapter 7: Review for Architecture and Dependency Impact
- Chapter 8: Building a Hybrid Review Process
- Chapter 9: Metrics: Measuring Review Quality
- Conclusion
- Appendix A: Glossary
- Appendix B: Tools & Resources
- Appendix C: Further Reading

---

## About This Guide

Code review is one of those practices engineering teams take for granted. Everyone does it. Most teams will tell you it works. But if you press them on what it actually catches versus what slips through, the answer gets uncomfortable fast.

This guide is not about making developers better at reviewing code manually. There is already a substantial body of writing on that subject. This guide is about the specific ways AI — and semantic search in particular — can extend what a human reviewer can see, remember, and consistently check.

The intended reader is a senior engineer or tech lead who already runs a reasonable review process and wants to know where the leverage is. Not someone looking for a tool to replace thinking, but someone who understands that review quality degrades under time pressure, context loss, and scale, and wants a structural answer to that problem.

The examples throughout this guide are language-agnostic where possible. Where code is necessary, Python and TypeScript are used because they are common enough to be readable without much introduction. The concepts apply across stacks.

One important note on scope: this guide focuses on the use of AI as a supporting instrument in human-led review. Full automation of code review — the idea that you remove the human from the loop entirely — is a separate and much more complex topic. That is not what is argued here. What is argued is that the tools exist to make human reviewers significantly faster, more consistent, and more likely to catch the things that currently slip through.

That is a more modest claim, and it happens to be true.

---

## Chapter 1: What Code Review Actually Catches (and Misses)

The instinct is to think of code review as a quality gate — the place where bad code gets stopped before it reaches production. That framing is not wrong, but it is incomplete in a way that matters when you are trying to improve the process.

Code review is actually several different things happening simultaneously. It is knowledge transfer. It is style enforcement. It is a check on correctness. It is sometimes a design conversation that should have happened two weeks earlier. Understanding which of these functions is happening at any given moment — and which functions your current process actually performs well — is the starting point for any meaningful improvement.

### What Review Is Good At

Human reviewers are genuinely good at a specific category of problem: anything that requires understanding intent. When a reviewer reads a function and says "this doesn't do what I think it was supposed to do," that is the kind of judgment a static tool will not replicate reliably. It requires knowing what the code is trying to accomplish, and that knowledge often lives in a Jira ticket, a Slack conversation, or the reviewer's memory of a previous architectural decision.

Human reviewers also tend to catch surface-level style issues, naming problems, and logic errors that are syntactically obvious — if loops that iterate one too many times, conditions that are inverted, null checks that are placed after the value is used. These are the things that show up in a line-by-line read, and they show up because the reviewer's brain pattern-matches them against common mistakes.

Knowledge transfer is the underappreciated function. When a senior engineer reviews a junior engineer's code, the comments they leave are often more valuable as teaching than as defect prevention. The act of explaining why something should be structured differently propagates patterns across the codebase over time. This is review doing important organizational work that has nothing to do with shipping quality code this week.

### Where Review Consistently Fails

The failure modes are predictable and structural. They do not happen because reviewers are careless — they happen because of how human attention and memory work under real conditions.

**Context loss across PRs.** A single PR lives in isolation when reviewed. The reviewer sees the diff, not the codebase. They might not know that an almost identical function exists three modules away, or that this exact pattern was tried eighteen months ago and caused a production incident. That history exists in the codebase and in git history, but it is not in front of them. Most reviewers do not go looking for it unless something specific triggers the memory.

**Review fatigue on large PRs.** There is strong empirical evidence — most famously from Cisco's study of peer code review — that review effectiveness drops sharply after about 400 lines of code reviewed in one session. Reviewers on large PRs spend more time on the early files and less time on the later ones. They catch things in the first diff chunk that they would not catch in the fifth. This is not a discipline problem; it is a cognitive capacity problem.

**Inconsistent depth across reviewers.** The same PR reviewed by two different people will receive qualitatively different feedback. One reviewer focuses on architecture; another focuses on test coverage; a third focuses on variable naming. This inconsistency is not necessarily bad — multiple perspectives have value — but it means that coverage of any particular category of concern is unpredictable.

**Slow to spot cross-cutting changes.** When a change touches authentication logic in one file and a seemingly unrelated API endpoint in another, a reviewer looking at the diff as a set of changes to individual files may not perceive the cross-cutting relationship. Dependency relationships, call graphs, shared state — these are hard to hold in working memory across a diff with dozens of files.

**Security blind spots.** Security-relevant code patterns are often syntactically innocuous. A SQL query built by string concatenation looks like any other string operation. A deserialization call that trusts user input looks like routine data processing. Unless the reviewer is actively thinking about security — and has the specific knowledge to recognize the pattern as dangerous — it gets through.

> **Key Insight:** Review failure is not usually about reviewer skill. It is about reviewer context. The person reviewing your PR often does not have access to the codebase knowledge that would make a particular problem obvious. The goal of AI-assisted review is to give reviewers that context without requiring them to go find it themselves.

### The Data on What Slips Through

Studies of defect detection in code review generally find that review catches somewhere between 60% and 90% of defects — a range so wide it is almost not useful. The more actionable finding is what kind of defects escape. Logic errors that depend on understanding system state tend to escape. Security vulnerabilities that require recognizing a pattern tend to escape. Architectural problems that only become visible when you see the PR in context of the broader codebase tend to escape.

A 2020 analysis of CVEs in open-source projects found that a meaningful percentage of security-related defects had been introduced in commits that passed code review. In many cases, the introduction commit had received substantive review comments — the vulnerability just was not what reviewers were thinking about at the time.

This is important: the problem is not that review is unserious. The problem is that review is bounded. It is bounded by time, by context, and by the working memory of whoever happens to be reviewing at that moment.

### Review as a System

One useful reframe: instead of thinking about individual reviews as quality gates, think about the review process as a system with inputs, throughput constraints, and failure modes. The system has certain things it is reliable at and certain things it is not. The question is how to make the unreliable parts more reliable without breaking the parts that work.

The things that require human judgment — intent verification, design feedback, knowledge transfer — should stay with humans. The things that require breadth of context, consistency across reviewers, and memory of prior patterns are precisely the things machines are better positioned to support.

That is the organizing principle behind AI-assisted review. Not replacement. Augmentation in the specific places where augmentation addresses the actual failure modes.

> **Warning:** Teams that approach AI-assisted review as a way to require fewer or less experienced reviewers tend to get worse outcomes, not better ones. The tools work best when paired with capable reviewers who can use the additional context effectively. Reducing reviewer quality while adding AI tooling is not a meaningful tradeoff.

### Key Takeaways

1. Code review serves multiple functions simultaneously — defect detection, knowledge transfer, style enforcement, design alignment — and does not serve all of them equally well.
2. Review failure is structural, not personal. Context loss, fatigue, and inconsistency are predictable failure modes, not signs of bad reviewers.
3. Human reviewers are strongest at intent-based judgment. They are weakest at cross-file context, historical pattern recognition, and consistent security scrutiny.
4. Treating review as a system rather than a gate makes it easier to identify where improvement has the most leverage.
5. The goal of AI augmentation is to address the structural failure modes, not to replace the human judgment that review depends on.

> **Try This:** Pull the last ten bugs that made it to production from your team's incident log or bug tracker. For each one, classify whether it went through code review and, if so, why it was not caught. Categorize each miss: Was it a context problem? A fatigue problem? A security knowledge problem? The distribution will tell you where your team's review process has the most structural weakness.

---

## Chapter 2: Where AI Can Help and Where It Cannot

The easiest way to be disappointed by AI in code review is to expect it to do the wrong things. Vendors oversell. Engineers who have used these tools in one context apply them to another and find they do not work the same way. The result is either overreliance on something that produces false confidence, or rejection of something genuinely useful because it was tried in the wrong context.

This chapter is a frank accounting of where AI adds real value in review and where the limits are hard. Both matter.

### What AI Does Well

**Pattern recognition at scale.** AI systems — particularly those built on large language models — have been trained on enormous amounts of code. They have seen common anti-patterns, common security mistakes, common architectural problems. When asked to evaluate whether a block of code follows or violates a known pattern, they are often right, and they are consistent in a way that human reviewers are not. Two reviewers might disagree on whether a particular abstraction is appropriate; an AI model asked the same question on the same code will give the same answer both times.

**Breadth of coverage.** A human reviewer reading a 600-line PR will not give equal attention to every section. An AI-assisted review tool does not get tired. It applies the same level of scrutiny to file 12 as to file 1. This is not nothing — the empirical drop-off in human review quality with PR size is exactly the kind of thing consistent mechanical coverage can partially address.

**Cross-referencing.** With semantic search in the loop, an AI-assisted review tool can look at a function and ask whether similar functions exist elsewhere in the codebase. It can surface prior implementations of the same pattern. This is the kind of cross-file, cross-PR context that is almost never available to a human reviewer relying on memory alone.

**Security pattern matching.** Known vulnerability classes — SQL injection, path traversal, deserialization of untrusted data, use of weak cryptographic primitives — have recognizable code patterns. Matching against those patterns consistently is something AI handles well. It does not require understanding intent; it requires recognizing structure.

**Generating useful questions.** Even when an AI system cannot definitively say whether something is a bug, it can often identify places where something looks unusual enough to warrant a question. Prompting a reviewer to look more closely at a specific section is itself valuable — it channels attention toward the places most likely to have problems.

> **Key Insight:** The best use of AI in review is not to produce verdicts but to produce better-targeted attention. An AI that consistently surfaces the right questions for humans to answer is already a force multiplier, even if it cannot answer those questions itself.

### What AI Does Poorly

**Understanding intent.** An AI system does not know what the code was supposed to do. It can infer from naming and structure, but inference is not the same as knowledge. When a function is wrong because it implements the right logic for the wrong business requirement, AI will often miss it entirely. The code is syntactically and semantically coherent; it just does not match a requirement that lives in a document the model has never seen.

**Novel logic errors.** AI models have strong pattern libraries. For errors that fit known patterns, they perform well. For logic errors that are unusual — errors that arise from the specific state of a specific system in a specific domain — they perform much less reliably. A subtle off-by-one error in a custom scheduling algorithm is not something that matches against a training corpus of known patterns. A human reviewer with domain knowledge will catch it; an AI model may not.

**Architectural judgment.** Whether a particular abstraction is the right level of abstraction, whether an API surface is well-designed, whether a new module's boundaries are appropriate given where the system is heading — these are judgment calls that depend on understanding the system's history, its likely future, and the constraints the team is working within. AI can surface observations about architectural patterns it has seen elsewhere, but it cannot tell you whether the patterns it suggests are appropriate for your specific system.

**Confidence calibration.** Many current AI tools have poor calibration between how confident they appear and how correct they actually are. A tool that produces a list of potential issues with no differentiation in confidence is providing low-quality signal — reviewers end up spending time investigating false positives and may start ignoring the tool's output entirely. This is one of the more practical problems with deploying AI review tools at scale.

**Understanding context from outside the code.** The AI model sees code. It does not see the Jira ticket, the design doc, the Slack thread where the team decided to approach the problem a particular way. Decisions made in natural language outside the code often explain why the code looks the way it does. Without access to that context, AI systems will sometimes flag reasonable code as unusual and miss genuinely problematic code that happens to be patterned correctly.

> **Warning:** AI-generated review comments that appear with equal confidence regardless of their actual quality create a specific risk: reviewers who are busy or under time pressure will either trust all of them or dismiss all of them. Neither response is appropriate. Before deploying AI review tooling widely, spend time understanding how it presents uncertainty, and decide whether that presentation is calibrated well enough for your reviewers to use productively.

### The Middle Ground: Where It Depends

There is a category of review task where AI can help, but the quality of the help depends heavily on how it is deployed. Refactoring suggestions are one example. An AI system that identifies that a function could be more readable by being split, or that a repeated pattern could be extracted, is providing useful signal — but only if the suggestion fits the team's conventions and the reviewer has enough context to evaluate whether the suggestion is appropriate here. The same suggestion applied in the wrong context leads to churn and reviewer frustration.

Documentation quality is another example. AI can evaluate whether a function's documentation matches what the function appears to do. For boilerplate mismatches — a docstring that describes an old version of the function, or comments that refer to variable names that no longer exist — this is reliable. For evaluating whether documentation adequately describes the function's behavior for the next engineer who needs to understand it, the quality degrades quickly.

Test coverage analysis sits in similar territory. AI can identify which code paths are not exercised by the tests in the PR. It cannot evaluate whether those tests are actually testing the right things, or whether the assertions are meaningful.

### Matching Capability to Use Case

The practical implication of this is that AI-assisted review should be deployed selectively, not uniformly. Use it where it is strong: consistency, breadth, pattern matching, security scrutiny, cross-referencing. Do not lean on it where it is weak: intent verification, architectural judgment, novel logic errors.

This is less a limitation to work around and more a design principle. The most effective hybrid review processes are explicit about which part of the review the AI component is handling and which part the human is handling. They do not attempt to turn AI into a general-purpose reviewer, because that is not what it is. They identify the specific review functions that benefit from AI support and build the workflow around those specific functions.

That design question — how to structure the handoff between AI and human in review — is what later chapters address in detail. But the starting point is being clear-eyed about capability, in both directions.

### Key Takeaways

1. AI is strong at pattern recognition, consistency, breadth of coverage, and security pattern matching — all things that degrade or narrow in human review under real conditions.
2. AI is weak at intent understanding, novel logic errors, architectural judgment, and anything requiring context from outside the code itself.
3. Confidence calibration is a practical deployment problem, not just a technical one — how AI presents its uncertainty affects how reviewers use it.
4. The most effective AI-assisted review processes are explicit about which review functions AI handles and which remain human.
5. The goal is not a general-purpose AI reviewer. It is targeted augmentation of the specific places where human review predictably fails.

> **Try This:** Take a recent non-trivial PR from your codebase and run it through an AI review tool of your choice. For each comment the tool produces, classify it: Is this a genuine concern? A false positive? Something the tool correctly identified but which a human reviewer also would have caught? Something the tool caught that a human reviewer likely would have missed? This exercise calibrates your expectation of the tool's actual value relative to your current process before you commit to integrating it.

---

## Chapter 3: Semantic Search in the Review Workflow

Most engineers have used keyword search in code review. You are looking at a function that seems off and you search the codebase for the function name to see where it is called. That is useful. But it is a fraction of what semantic search makes possible, and understanding the difference explains why semantic search is specifically the right tool for many of the context problems that code review faces.

Keyword search is exact. You get back results that contain the string you searched for. Semantic search is approximate. You get back results that are conceptually similar to what you searched for, regardless of whether they share any specific tokens. That approximation — which sounds like a weakness — turns out to be a strength in exactly the contexts where review needs help most.

### Why Semantic Search Is Different

Consider a scenario: you are reviewing a PR that adds a new function for normalizing user input before it gets stored. You want to know whether there are other input normalization functions in the codebase that this new one should be consistent with, or that should be refactored away in favor of this new one.

With keyword search, you would need to know what to search for. You might search for "normalize," "sanitize," "clean," "format," "validate" — and you might miss the function named `prepareUserData` or `scrubInput` that does essentially the same thing. Keyword search is only as good as your ability to enumerate the vocabulary of the codebase.

Semantic search operates differently. You describe what you are looking for in natural language — "user input normalization before storage" — and the search system returns the most semantically similar code chunks it has indexed. It will surface `prepareUserData` and `scrubInput` alongside `normalizeInput` because the underlying embeddings reflect conceptual similarity, not token overlap.

> **Key Insight:** Keyword search finds what you know how to name. Semantic search finds what you can describe. This is the difference between searching a codebase and understanding a codebase — and it is precisely the difference that matters in code review, where you often cannot name what you are looking for.

### How Embedding-Based Search Works

The technical foundation is straightforward. Code — either whole files, functions, or chunks — is passed through an embedding model that converts it into a dense vector representation. Similar code produces vectors that are close together in embedding space. When you issue a search query, it is also converted to a vector, and the system returns the code chunks whose vectors are most similar to the query vector.

For code search specifically, the embedding model matters a great deal. Models trained on code tend to understand that `for i in range(len(arr))` and `for item in arr` are semantically similar iteration patterns, even though their token overlap is limited. They understand that `raise ValueError("invalid input")` and `throw new IllegalArgumentException("invalid input")` are doing conceptually the same thing across language boundaries.

More sophisticated implementations use hybrid search — combining semantic embeddings with BM25 keyword scoring, then fusing the results. This preserves the strength of keyword search for exact identifier matches while layering in semantic similarity for conceptual queries. For code review use cases, hybrid search generally outperforms either approach alone.

### Integrating Semantic Search Into Review

The workflow integration point is not during the diff review itself — at least not primarily. The highest-value use of semantic search in code review is at the beginning, before the reviewer starts reading the diff line by line.

The entry point looks like this: the reviewer (or an automated pre-review step) issues semantic queries against the codebase based on the content of the PR. The queries target the semantically interesting parts of the change — new functions, modified logic, new dependencies. The results populate a pre-review context document that the reviewer reads before opening the diff.

That context document might include:
- Existing functions that are semantically similar to functions being added or modified
- Prior implementations of the same pattern that may have been deprecated or superseded
- Related code that might need to be updated in parallel with this change
- Historical context from comments or documentation attached to similar code

This pre-loading of context directly addresses the context-loss failure mode described in Chapter 1. The reviewer is not starting cold from the diff — they are starting with a picture of where this change fits in the broader codebase.

```python
# Example: Pre-review context generation using semantic search
def generate_review_context(pr_diff: dict, search_client: SemanticSearchClient) -> ReviewContext:
    context = ReviewContext()

    for changed_function in extract_functions(pr_diff):
        # Find semantically similar existing implementations
        similar = search_client.search(
            query=f"{changed_function.name}: {changed_function.docstring or changed_function.body[:200]}",
            top_k=5,
            filters={"exclude_file": changed_function.file_path}
        )

        context.add_similar_implementations(
            function=changed_function,
            similar=similar
        )

        # Check for prior versions or related patterns
        related = search_client.search(
            query=f"similar pattern to {changed_function.name}",
            top_k=3
        )
        context.add_related_patterns(changed_function, related)

    return context
```

> **Try This:** Before your next substantive code review, spend five minutes running semantic queries against your codebase for the main concepts the PR touches. Use plain English descriptions of what the PR is doing, not function names. Note how many times the results surface code you would not have thought to look at manually. That delta is the context gap your current review process has.

### Semantic Search During the Review

Once a reviewer is in the diff, semantic search becomes a real-time lookup tool. A well-integrated review environment lets the reviewer highlight a function or block and immediately query for similar code elsewhere in the codebase, without leaving the review context.

This is the kind of capability that transforms a specific class of review interaction. The reviewer sees a utility function added in the PR. Instead of thinking "this seems like it might already exist somewhere" and then deciding the search is not worth the interruption, they can issue the search in three seconds and know with confidence whether the function is novel or duplicative.

The latency of the search matters here. A query that takes ten seconds breaks flow; a query that takes under a second becomes part of the natural rhythm of review. Building or choosing a semantic search system with sub-second query latency is not a performance nicety — it is a prerequisite for workflow integration.

### Semantic Search at the PR Level

Beyond individual function lookups, semantic search enables a PR-level analysis that is difficult to achieve any other way. The question being answered is: "Is this PR consistent with how this class of problem has been solved before in this codebase?"

This is different from checking style consistency. Style linters handle that. What semantic search can detect is pattern consistency — whether a new implementation is conceptually aligned with existing patterns, even if it uses different variable names and different syntactic structure.

Figure 1 illustrates the difference between surface-level consistency checking and semantic pattern consistency checking.

*Figure 1: Keyword search finds exact matches; semantic search finds conceptual matches. Both are useful for different problems, but code review's consistency problems are mostly conceptual, not literal.*

For a codebase with established patterns — a particular approach to error handling, a preferred abstraction for database access, a consistent structure for validation logic — semantic search can surface deviations from those patterns automatically. The PR introduces a new validation function that uses a different pattern than the seventeen existing validation functions. Keyword search would not have surfaced that unless you already knew what the existing functions were named. Semantic search finds it from a description.

> **Warning:** Semantic search relevance is not the same as correctness. A semantically similar result might represent a deprecated pattern, a known bad approach, or code from a completely different domain that happens to use similar abstractions. Surfacing similar code is the beginning of a question, not an answer. Make sure reviewers understand this distinction before relying on semantic search results to make decisions.

### Building the Index

For teams building their own semantic search pipeline, the index quality determines the search quality. The most common mistakes:

**Chunking too coarsely.** Embedding entire files produces vectors that average over too much content to be useful for function-level queries. Function-level chunking, with optional overlap to capture context, produces better results.

**Ignoring docstrings and comments.** Code that has been meaningfully documented carries signal in that documentation that pure code embeddings may not fully capture. Including docstrings and inline comments in the chunk that gets embedded improves retrieval for queries expressed in natural language.

**Not re-indexing frequently enough.** A stale index produces results that reflect the codebase as it was, not as it is. For review use cases specifically, the index should be as current as possible — ideally updated with each merged PR.

**Not filtering by relevance threshold.** Raw top-k retrieval will always return k results regardless of whether those results are actually similar. Setting a minimum similarity threshold prevents the system from surfacing irrelevant code that happens to be the least dissimilar thing in the index.

### Key Takeaways

1. Semantic search finds conceptually similar code rather than token-matching results, which is the right tool for the context problems code review faces.
2. The highest-value integration point is pre-review context generation — loading the reviewer with codebase context before they open the diff.
3. Sub-second query latency is a prerequisite for workflow integration, not a performance nice-to-have.
4. Semantic search surfaces questions for reviewers to investigate, not answers for reviewers to accept.
5. Index quality — chunk size, inclusion of documentation, freshness — directly determines search quality.

> **Try This:** Run a semantic search for "authentication token validation" against your codebase. Count the distinct locations where authentication logic exists. Then count how many of those locations you would have thought to check during a review of a new authentication-related PR. The gap between those two numbers is your baseline context deficit.

---

## Chapter 4: Finding Similar Patterns and Prior Implementations

The codebase is a record. Every function that has been written, every abstraction that has been tried, every pattern that has been adopted or rejected — it is all there, in varying states of currency. The problem is that this record is effectively inaccessible to most reviewers most of the time. They have the code in front of them and a mental model of the parts of the codebase they work in regularly. The rest is dark.

This chapter is about using semantic search specifically to illuminate that darkness at review time — to surface prior implementations and existing patterns so that reviewers can evaluate new code against the actual history of the codebase rather than against a much smaller mental model.

### Why Duplication Is a Review Problem

Duplicate or near-duplicate implementations are not just a code quality problem — they are a maintenance problem that compounds over time, and they are disproportionately created during exactly the conditions that code review is meant to prevent: time pressure, incomplete context, unfamiliarity with a part of the codebase.

When a developer writes a new function to parse a date string, there is often already a date parsing function in the codebase. When they write a new wrapper for an HTTP client, there is often an existing one. When they write a new validation helper for email addresses, there is almost certainly an existing one. They did not set out to duplicate — they did not know the existing function was there.

Code review is theoretically the place where this gets caught. But it only gets caught if the reviewer knows that the existing function exists, which requires either being familiar enough with the codebase to remember it or actively searching for it. Under time pressure, that search does not happen. The new function gets approved, the codebase acquires another implementation of the same thing, and the next engineer who encounters both has to figure out which one to use and why they differ.

Semantic search makes this search automatic.

### The Near-Duplicate Detection Pattern

The basic pattern is straightforward: for each function added or substantially modified in a PR, issue a semantic query that finds the most similar existing functions. The query should be derived from the function's semantics, not just its name.

```python
def find_similar_implementations(
    new_function: FunctionChunk,
    search_client: SemanticSearchClient,
    similarity_threshold: float = 0.82
) -> list[SimilarFunction]:
    """
    Returns existing functions semantically similar to new_function.
    Excludes the new function's own file from results.
    """
    # Build a rich query from both name and body semantics
    query = build_function_query(new_function)

    results = search_client.search(
        query=query,
        top_k=10,
        filters={
            "file_path_not": new_function.file_path,
            "min_similarity": similarity_threshold
        }
    )

    # Filter to functions (not classes, modules, or comments)
    function_results = [r for r in results if r.chunk_type == "function"]

    return [
        SimilarFunction(
            name=r.name,
            file_path=r.file_path,
            similarity=r.score,
            snippet=r.content[:300]
        )
        for r in function_results
    ]


def build_function_query(func: FunctionChunk) -> str:
    """
    Constructs a semantic query from function name, docstring, and body summary.
    A richer query produces better retrieval than using either alone.
    """
    parts = []

    # Normalized name gives strong signal
    parts.append(func.name.replace("_", " "))

    # Docstring is often the clearest semantic description
    if func.docstring:
        parts.append(func.docstring[:200])

    # First few lines of body capture the structural intent
    body_lines = func.body.split("\n")[:10]
    parts.append(" ".join(body_lines))

    return " ".join(parts)
```

This generates a list of candidate duplicates, which the review tool can surface as a comment or annotation on the PR.

> **Key Insight:** The goal is not to flag all similar functions as duplicates — it is to make the reviewer aware of them so they can make a deliberate decision. The reviewer might conclude that the new implementation is better and the old one should be removed. They might conclude they are different enough to coexist. They might conclude the new one should be removed in favor of the existing one. All three outcomes are better than the reviewer never knowing the existing function exists.

### Pattern Similarity vs. Semantic Duplication

Not all semantically similar code is duplicative. Two functions that both perform sorted list insertion are semantically similar but might serve genuinely different purposes — one for a priority queue, one for maintaining a sorted display list. The pattern is similar; the application is distinct.

This is an important distinction for surfacing useful results without creating reviewer fatigue from irrelevant matches. The most useful results are not just functions with high embedding similarity to the new function — they are functions with high similarity *and* similar surrounding context.

One approach: include the calling context of both functions in the similarity computation. A function that is called from authentication middleware and a function that is called from display formatting logic may be structurally similar but contextually distinct. Including call-site context in the comparison reduces false matches.

```python
def contextual_similarity(
    new_func: FunctionChunk,
    candidate: FunctionChunk,
    search_client: SemanticSearchClient
) -> float:
    """
    Computes similarity that accounts for calling context,
    reducing false matches between structurally similar but
    contextually distinct functions.
    """
    # Get immediate callers of each function
    new_callers = search_client.get_callers(new_func.file_path, new_func.name)
    candidate_callers = search_client.get_callers(candidate.file_path, candidate.name)

    # Embed the calling contexts
    new_context_vec = search_client.embed(
        " ".join(c.name for c in new_callers)
    )
    candidate_context_vec = search_client.embed(
        " ".join(c.name for c in candidate_callers)
    )

    # Combine function similarity with context similarity
    func_similarity = cosine_similarity(new_func.vector, candidate.vector)
    context_similarity = cosine_similarity(new_context_vec, candidate_context_vec)

    # Weight function similarity more, but let context penalize divergent use
    return 0.7 * func_similarity + 0.3 * context_similarity
```

### Prior Implementation History in Git

Semantic search against the current state of the codebase tells you what exists now. Git history tells you what has existed before — including patterns that were tried and removed.

For a reviewer considering a new approach, knowing that the same approach was tried eighteen months ago and later reverted is highly relevant information. The function may not exist in the current index, but the commits do.

Combining semantic search with git history search enables a more complete picture. The workflow:

1. Find semantically similar current implementations (semantic search against codebase).
2. Search git log for commits that mention similar concepts (semantic search against commit messages and deleted code).
3. Flag both: existing implementations that might be consolidated, and prior implementations that were removed (with context on why, if available in the commit message or PR description).

This is the difference between knowing the current state of a pattern and knowing its history. Both matter for a complete review.

> **Warning:** Prior implementations that were removed may have been removed for good reasons — performance, correctness, security. When surfacing "this pattern was tried before," always include the removal context if available. A reviewer who sees "similar implementation existed in 2024" without knowing it was removed due to a production incident may view the information as neutral when it should be a red flag.

### Consistency Across the Codebase

Beyond duplication, there is the consistency question: does new code follow the patterns established by existing code in the same domain?

Consistency matters for maintainability. A codebase where authentication is handled five different ways, where error handling follows four different conventions, where database access is wrapped in three different abstractions — that codebase has a high overhead for any engineer trying to understand it. Every new context requires learning the local conventions.

Semantic search enables a consistency check: given the set of existing functions in a particular domain, does the new function follow the established pattern?

This is harder than duplication detection. You are not looking for high-similarity matches; you are asking whether a new function is stylistically and structurally consistent with a set of related but not identical functions.

One practical approach: build a representation of the "centroid" of an existing pattern by averaging the embeddings of known instances, then measure the distance of the new function from that centroid. A new authentication handler that is far from the centroid of existing authentication handlers is likely deviating from established patterns.

```python
def pattern_consistency_score(
    new_func: FunctionChunk,
    pattern_name: str,
    search_client: SemanticSearchClient
) -> float:
    """
    Scores how consistent a new function is with established
    implementations of a named pattern. Returns 0.0 (inconsistent)
    to 1.0 (highly consistent).
    """
    # Find existing implementations of this pattern
    existing = search_client.search(
        query=pattern_name,
        top_k=20,
        filters={"min_similarity": 0.75}
    )

    if len(existing) < 3:
        # Not enough data to define a pattern
        return None

    # Compute centroid of existing implementations
    centroid = average_vectors([e.vector for e in existing])

    # Measure new function's distance from centroid
    distance = cosine_distance(new_func.vector, centroid)

    # Convert to similarity score (1.0 = identical to centroid)
    return 1.0 - min(distance, 1.0)
```

A low consistency score does not necessarily mean the new code is wrong — it might mean the new code is better, and the old pattern should be updated. But it surfaces a conversation that should be happening in the review.

### Key Takeaways

1. Duplicate implementations are a review failure mode that happens not because of carelessness but because of context loss — reviewers do not know what exists.
2. Semantic search automates the search for similar existing implementations, surfacing them before the reviewer has to know what to look for.
3. Contextual similarity (including call-site context) reduces false matches between structurally similar but functionally distinct functions.
4. Git history extends the context to prior implementations that were removed, which is often the most important context of all.
5. Pattern consistency scoring — measuring distance from the centroid of established implementations — enables consistency checking beyond simple duplication detection.

> **Try This:** Pick a utility function in your codebase that you wrote, or that you reviewed and approved. Run a semantic search for its purpose. Count the results with similarity above 0.80. If you find more than two functions doing essentially the same thing, you have a data point on how often your current review process misses near-duplication.

---

## Chapter 5: Flagging Anti-Patterns Automatically

Anti-patterns are the recurring mistakes that teams make often enough to have names for them. God objects. Premature optimization. Magic numbers. Callback hell. Hardcoded credentials. Sequential database calls inside a loop. Every engineering team has its own list — some drawn from the literature, some discovered through painful production incidents that were specific to their codebase.

The consistent property of anti-patterns is that they are recognizable. A sufficiently experienced reviewer knows one when they see it. The problem is that review requires experienced reviewers to be looking at the right code at the right time, and that is not a guarantee. Anti-pattern detection is precisely the kind of work that benefits most from automation: it requires pattern recognition, not judgment; it needs to be consistent; and it should not be gated on reviewer availability.

### The Two Categories of Anti-Pattern Detection

Anti-pattern detection in code review falls into two broad categories, and they require different approaches.

**Structural anti-patterns** are detectable from the code's syntax and structure. Sequential database calls inside a loop (the N+1 query problem) have a recognizable code shape. Mutable default arguments in Python have a recognizable syntax. Unguarded string interpolation into SQL queries has a recognizable structure. These patterns can be detected reliably without semantic understanding — they are pattern matches against syntax trees or even against code text.

**Semantic anti-patterns** require understanding what the code is doing, not just what it looks like. A function that does too many things is a semantic anti-pattern — it requires judgment about what "too many things" means in context. A class with too many responsibilities is similar. These are harder to detect automatically because they require semantic understanding of intent and scope.

AI-assisted review handles both categories, but differently. Structural detection is rule-based and high-precision. Semantic detection is embedding-based and lower-precision, requiring human confirmation.

### Structural Anti-Pattern Detection

The most reliable approach to structural anti-patterns is a combination of AST-based rules and LLM-based confirmation. The AST analysis provides precision (low false positive rate); the LLM layer provides coverage for patterns that are structurally varied but semantically identical.

```python
# Example: Detecting N+1 query patterns using AST analysis
import ast
from dataclasses import dataclass

@dataclass
class AntiPatternHit:
    pattern_name: str
    file_path: str
    line_start: int
    line_end: int
    description: str
    severity: str  # "error", "warning", "info"

class NPlusOneDetector(ast.NodeVisitor):
    """
    Detects database query calls inside loop bodies.
    Covers common ORM patterns (Django, SQLAlchemy, Peewee).
    """
    DB_PATTERNS = {
        "filter", "get", "all", "first", "last",
        "query", "execute", "fetchone", "fetchall",
        "find", "find_one", "find_many"
    }

    def __init__(self):
        self.hits: list[AntiPatternHit] = []
        self._loop_depth = 0

    def visit_For(self, node: ast.For):
        self._loop_depth += 1
        self.generic_visit(node)
        self._loop_depth -= 1

    visit_While = visit_For
    visit_ListComp = visit_For
    visit_GeneratorExp = visit_For

    def visit_Call(self, node: ast.Call):
        if self._loop_depth > 0:
            call_name = self._extract_call_name(node)
            if call_name and call_name.lower() in self.DB_PATTERNS:
                self.hits.append(AntiPatternHit(
                    pattern_name="n_plus_one_query",
                    file_path="",  # set by caller
                    line_start=node.lineno,
                    line_end=node.end_lineno,
                    description=(
                        f"Potential N+1: database call '{call_name}' "
                        f"inside a loop. Consider batching or prefetching."
                    ),
                    severity="warning"
                ))
        self.generic_visit(node)

    def _extract_call_name(self, node: ast.Call) -> str | None:
        if isinstance(node.func, ast.Attribute):
            return node.func.attr
        if isinstance(node.func, ast.Name):
            return node.func.id
        return None
```

This kind of detector has high precision for the patterns it covers. The limitation is coverage — not every N+1 pattern has the same syntactic form, and patterns specific to less common ORMs may not be covered by an off-the-shelf detector.

The LLM layer complements the AST layer for patterns that are syntactically varied. Rather than attempting to write AST rules for every variant, you pass suspicious code sections to an LLM with a focused prompt.

```python
def check_semantic_antipatterns(
    code_chunk: str,
    llm_client: LLMClient,
    antipattern_list: list[str]
) -> list[AntiPatternHit]:
    """
    Uses an LLM to identify semantic anti-patterns in code.
    antipattern_list constrains the LLM to known patterns,
    reducing false positive rate compared to open-ended analysis.
    """
    patterns_text = "\n".join(f"- {p}" for p in antipattern_list)

    prompt = f"""Analyze the following code for anti-patterns.
Only flag issues that clearly match one of these specific patterns:
{patterns_text}

For each issue found, respond with:
PATTERN: <pattern name>
LINE: <line number>
REASON: <one sentence explanation>

If no issues are found, respond with: NONE

Code:
```
{code_chunk}
```"""

    response = llm_client.complete(prompt, max_tokens=500)
    return parse_antipattern_response(response, code_chunk)
```

The key design choice here is constraining the LLM to a defined list of patterns. An unconstrained prompt asking "what is wrong with this code?" produces inconsistent, low-precision output. A constrained prompt asking "does this code exhibit any of these specific patterns?" produces output that is more consistent and more useful.

> **Key Insight:** Anti-pattern detection with AI is most reliable when the detection is constrained. An LLM asked to find "any problems" in code will find things — but the signal-to-noise ratio is poor. An LLM asked specifically whether code exhibits one of a defined list of known anti-patterns produces results that are actionable.

### Building a Team-Specific Anti-Pattern Library

Generic anti-patterns from the literature are a starting point. The more valuable library is the one your team builds from your own incidents and review history.

Every time a bug makes it to production, there is a post-incident question: "Was there a code pattern that contributed to this?" If yes, that pattern belongs in your anti-pattern library. Over time, this library becomes a distillation of your codebase's specific failure modes — patterns that are dangerous in your system, with your dependencies, under your load patterns.

The library entry format matters. A useful anti-pattern entry includes:

```yaml
# Anti-pattern library entry format
- name: "cached_result_mutation"
  description: |
    Mutating an object returned from a cache without
    making a copy first. Mutations propagate to other
    callers that have cached the same object.
  detection_hints:
    - Cache retrieval followed immediately by attribute assignment
    - get() or fetch() return value used as left-hand side of assignment
  example: |
    user = cache.get(user_id)
    user.last_seen = datetime.now()  # mutates the cached object
    db.save(user)
  fix: |
    user = copy.deepcopy(cache.get(user_id))
    # or refresh from DB to ensure fresh state
  severity: "error"
  context: "First encountered in incident 2024-11-12. Cache invalidation bug in user service."
```

The `context` field is critical. It explains why this pattern is in the library. Without that context, engineers adding new entries over time do not know which patterns were added from first principles versus from painful experience, and cannot prioritize accordingly.

### False Positive Management

Automated anti-pattern detection has a noise problem. Not every match is a real problem. An N+1 detector that flags every database call inside a loop will catch genuine N+1 problems and also flag code that is iterating over a list that was already fetched in a single query, code that is intentionally performing one database call per item because batching is not possible, and code where the loop iterates over a very small fixed set.

If reviewers see too many false positives, they ignore the tool. This is not a hypothetical — it is well-documented in the static analysis literature, and it applies to AI-based detection as well.

Approaches to false positive management:

**Threshold by severity.** Surface only "error" severity findings by default. Require an explicit opt-in for "warning" findings. Do not surface "info" findings in review at all — put those in a separate dashboard or scheduled report.

**Confidence scoring.** If the detection pipeline produces a confidence score, filter to results above a threshold before surfacing. Start with a high threshold (0.85+) and lower it only if reviewers are missing things.

**Suppression annotations.** Provide a mechanism for engineers to annotate code with a suppression comment that tells the tool this is a known exception. This gives engineers control without requiring the tool to be smarter.

```python
# pyckle: ignore=n_plus_one_query - batch API not available for this endpoint
for item_id in item_ids:
    item = db.get_item(item_id)
    process(item)
```

The suppression annotation also functions as documentation. "Batch API not available for this endpoint" is more informative than no comment at all, and it creates a record that this was a deliberate choice rather than an oversight.

> **Warning:** Anti-pattern suppression annotations should be reviewed just like any other code annotation. An engineer who suppresses an "error" severity finding without a good reason is not using the tool correctly, and that suppression should be visible in the PR and reviewable by the team.

### Key Takeaways

1. Anti-pattern detection separates into structural (AST-based, high precision) and semantic (embedding-based, lower precision, needs human confirmation) categories.
2. Constrained LLM prompts — asking specifically about known patterns — produce more actionable output than open-ended "what's wrong" prompts.
3. A team-specific anti-pattern library built from incident history is more valuable than generic patterns from the literature.
4. False positive management is a prerequisite for adoption — reviewers who see too much noise stop reading the output.
5. Suppression annotations create a visible record of deliberate exceptions, which is better than silence.

> **Try This:** Take the last three production incidents your team has had. For each, write an anti-pattern entry in the format above. Then check whether any of those patterns appear in the current codebase. The result tells you both how useful your existing tools are at detecting these patterns and how often the pattern is currently present.

---

## Chapter 6: Security-Focused Review with AI

Security review occupies a special place in the code review conversation because the failure modes are asymmetric. A false negative in security review — a vulnerability that gets through — can be catastrophically expensive. A false positive — flagging safe code as insecure — costs developer time to investigate. These are not symmetric costs, and any framework for AI-assisted security review has to account for that asymmetry.

The good news is that many of the most common vulnerability classes have recognizable code patterns. The bad news is that recognizing a pattern and correctly classifying it as exploitable requires context that is not always available from the code alone. AI-assisted security review handles the first problem well. It handles the second imperfectly.

### Vulnerability Classes That Are Pattern-Detectable

Not every vulnerability is pattern-detectable. But a meaningful subset of the OWASP Top 10 and common CVE categories are.

**Injection vulnerabilities** — SQL injection, command injection, LDAP injection — all involve building a query or command by concatenating user-controlled input with a query structure. The pattern is: untrusted input flows into a query or command construction without going through a parameterized mechanism.

**Path traversal** involves using user-controlled input to construct a file path. The pattern is: user input used in a file system operation without sanitization.

**Insecure deserialization** involves deserializing data from an untrusted source using a format that allows arbitrary object instantiation. In Python, `pickle.loads()` on untrusted data is the canonical case. In Java, `ObjectInputStream.readObject()` on untrusted data. The pattern is recognizable.

**Hardcoded secrets** — API keys, passwords, private keys in source code — are detectable by pattern matching against the structure of common secret formats (AWS key patterns, JWT structures, high-entropy strings in string assignments).

**Weak cryptography** — use of MD5 or SHA-1 for password hashing, use of ECB mode, use of a fixed IV — involves calling specific functions from cryptographic libraries that are known to be weak.

For all of these, the detection pattern is: identify the function or operation that is unsafe, trace whether its inputs include untrusted data, and flag if they do.

```python
# Simplified taint analysis for injection detection
from dataclasses import dataclass, field
from typing import Optional
import ast

@dataclass
class TaintResult:
    is_tainted: bool
    source: Optional[str] = None
    sink: Optional[str] = None
    path: list[str] = field(default_factory=list)
    confidence: float = 0.0

# Sources of untrusted data
TAINT_SOURCES = {
    # Web framework request inputs
    "request.args", "request.form", "request.json",
    "request.data", "request.cookies", "request.headers",
    # Environment inputs
    "os.environ", "sys.argv",
    # File inputs
    "open", "read", "readline"
}

# Dangerous sinks for injection
SQL_SINKS = {
    "execute", "executemany", "raw", "rawQuery",
    "cursor.execute", "db.execute"
}

def check_sql_injection(func_code: str) -> list[TaintResult]:
    """
    Performs lightweight taint analysis on a function body
    to detect potential SQL injection patterns.

    Returns results with confidence scores to support
    human reviewer decision-making.
    """
    tree = ast.parse(func_code)
    analyzer = SQLInjectionAnalyzer()
    analyzer.visit(tree)
    return analyzer.results
```

The taint analysis approach — tracking where data comes from (source) to where it ends up (sink) — is more precise than simple pattern matching against call sites. Code that calls `cursor.execute()` with a parameterized query is not vulnerable; code that calls `cursor.execute()` with a string built from user input is. Pattern matching without taint analysis produces more false positives.

> **Key Insight:** Security detection without taint analysis treats every call to a potentially dangerous function as a vulnerability, regardless of whether the inputs are controlled. Taint-aware detection is more precise and reduces the false positive burden on security reviewers.

### Semantic Search for Security Context

Beyond detecting known vulnerability patterns in the current PR, semantic search adds a dimension that static analysis misses: consistency with existing security practices.

Every codebase has established patterns for handling security-sensitive operations. Authentication, authorization checks, input validation, session management — these have been implemented somewhere. The question for a new PR is whether it follows those established patterns.

```python
def check_auth_consistency(
    pr_function: FunctionChunk,
    search_client: SemanticSearchClient
) -> AuthConsistencyResult:
    """
    Checks whether an endpoint or function that handles authenticated
    requests performs authorization checks consistent with existing
    endpoints in the codebase.
    """
    # Find existing endpoints that handle similar operations
    similar_endpoints = search_client.search(
        query=f"endpoint authentication authorization {pr_function.route or pr_function.name}",
        top_k=10,
        filters={"min_similarity": 0.75}
    )

    # Extract authorization patterns from similar endpoints
    auth_patterns = extract_auth_patterns(similar_endpoints)

    # Check whether the new function uses any of these patterns
    function_auth = extract_auth_patterns([pr_function])

    missing_patterns = auth_patterns - function_auth

    return AuthConsistencyResult(
        has_authorization=bool(function_auth),
        missing_patterns=missing_patterns,
        similar_endpoints=[e.file_path for e in similar_endpoints],
        is_consistent=len(missing_patterns) == 0
    )
```

A new API endpoint that does not perform the same authorization checks as similar existing endpoints is a security concern worth surfacing. The reviewer may have a good reason for the difference — perhaps the endpoint is intentionally public — but that decision should be explicit and reviewed, not implicit and overlooked.

### LLM-Based Security Analysis

For vulnerability classes that are harder to detect structurally — business logic flaws, authorization bypass, insecure direct object reference — LLM-based analysis can surface concerns that rule-based systems miss.

The prompt design matters significantly here.

```
Security review the following code for authorization and access control issues.

Context: This is an API endpoint in a multi-tenant SaaS application.
Users should only be able to access their own data. Tenant isolation
is enforced by requiring all database queries to include a tenant_id filter.

Code:
[function code here]

For each potential issue found, specify:
1. The specific line(s) involved
2. The vulnerability class (OWASP category if applicable)
3. A concrete attack scenario that would exploit this
4. Your confidence level: LOW, MEDIUM, or HIGH

If you are not confident about a finding, report it at LOW confidence
rather than omitting it. Do not report anything you cannot provide a
concrete attack scenario for.
```

The instruction to provide a concrete attack scenario serves two purposes. It forces the LLM to reason about exploitability rather than just pattern-matching, which improves precision. And it gives the human reviewer the information they need to quickly evaluate whether the concern is real.

The instruction to report low-confidence findings rather than omitting them is deliberate. In security review, a false negative is more costly than a false positive. The reviewer can quickly dismiss a low-confidence finding; they cannot un-miss a vulnerability that was never surfaced.

> **Warning:** LLM-based security analysis produces results of highly variable quality depending on the model, the prompt, and the nature of the code. Never treat LLM security findings as authoritative. Treat them as a list of questions to investigate. High-confidence findings should be verified by a human with security expertise before being resolved. Low-confidence findings should be triaged, not dismissed.

### Secrets Detection

Hardcoded secrets are one of the most damaging and most preventable categories of security vulnerability in code review. The combination of high impact and high detectability makes this an area where automated detection should be table stakes.

The detection problem is harder than it appears. Simple pattern matching against known secret formats (AWS key format: `AKIA[0-9A-Z]{16}`) produces reasonable precision but misses secrets that do not match standard formats. High-entropy string detection catches more, but generates more false positives (base64-encoded content that is not a secret, for example).

The practical approach for code review integration:

1. Pattern matching against known high-value secret formats (cloud provider keys, common API key structures, private key headers).
2. High-entropy string detection with false positive reduction (filter out strings that are clearly test data, clearly hashes of content, clearly user-visible strings).
3. Context-aware filtering: assignment to variables named `test_`, `mock_`, or `fake_` at a reduced severity.

```python
import re
import math
from typing import Optional

HIGH_ENTROPY_THRESHOLD = 4.5  # bits per character

KNOWN_SECRET_PATTERNS = [
    (r"AKIA[0-9A-Z]{16}", "AWS Access Key ID", "critical"),
    (r"(?i)api[_-]?key[\"'\s]*[:=][\"'\s]*[a-zA-Z0-9_\-]{20,}", "Generic API Key", "high"),
    (r"-----BEGIN (RSA |EC )?PRIVATE KEY-----", "Private Key", "critical"),
    (r"(?i)password[\"'\s]*[:=][\"'\s]*[^\s\"']{8,}", "Hardcoded Password", "high"),
    (r"ghp_[a-zA-Z0-9]{36}", "GitHub Personal Access Token", "critical"),
    (r"sk-[a-zA-Z0-9]{48}", "OpenAI API Key", "critical"),
]

def check_entropy(s: str) -> float:
    """Shannon entropy of a string."""
    if not s:
        return 0.0
    freq = {}
    for c in s:
        freq[c] = freq.get(c, 0) + 1
    return -sum(
        (count / len(s)) * math.log2(count / len(s))
        for count in freq.values()
    )

def scan_for_secrets(file_content: str, file_path: str) -> list[dict]:
    findings = []
    lines = file_content.split("\n")

    for i, line in enumerate(lines, 1):
        # Skip comments and test files
        stripped = line.strip()
        if stripped.startswith("#") or stripped.startswith("//"):
            continue

        # Check known patterns
        for pattern, label, severity in KNOWN_SECRET_PATTERNS:
            if re.search(pattern, line):
                findings.append({
                    "line": i,
                    "type": label,
                    "severity": severity,
                    "content": line.strip()[:80],
                    "detection_method": "pattern"
                })

        # Check high-entropy strings in assignments
        assignment_match = re.search(r'["\']([a-zA-Z0-9+/=_\-]{20,})["\']', line)
        if assignment_match:
            candidate = assignment_match.group(1)
            if check_entropy(candidate) > HIGH_ENTROPY_THRESHOLD:
                # Reduce false positives for obvious non-secrets
                if not looks_like_hash(candidate) and not looks_like_b64_content(candidate):
                    findings.append({
                        "line": i,
                        "type": "High-Entropy String",
                        "severity": "medium",
                        "content": line.strip()[:80],
                        "detection_method": "entropy"
                    })

    return findings
```

### Key Takeaways

1. Many common vulnerability classes — injection, path traversal, hardcoded secrets, weak cryptography — have recognizable patterns that AI can detect consistently and automatically.
2. Taint analysis improves precision over simple pattern matching by tracking whether unsafe operations receive untrusted inputs.
3. Semantic search enables security consistency checking: new code that deviates from established security patterns gets flagged for review.
4. LLM-based security analysis should require a concrete attack scenario from the model — this improves precision and gives reviewers actionable information.
5. Secrets detection combines pattern matching and entropy analysis; the balance between recall and precision should favor recall in security contexts.

> **Try This:** Run a secrets scan on your codebase right now. Use one of the tools listed in Appendix B or the pattern matching code above. The result is not a score on your security posture — it is a baseline for how much work the automated detection layer has been doing compared to your manual review. If you find active secrets in the codebase that slipped through review, you have a data point on the cost of not having automated detection.

---

## Chapter 7: Review for Architecture and Dependency Impact

Most code review happens at the function level — did this code do what it was supposed to do, is it well-structured, does it handle edge cases? That is the right level of granularity for correctness review. It is the wrong level for architecture review.

Architecture problems are emergent. A function that introduces a circular dependency between two modules looks correct in isolation. A function that couples a business logic layer to an infrastructure concern looks correct in isolation. A new service that replicates functionality that already exists in another service looks correct in isolation. The problem only becomes visible when you look at the broader context — the call graph, the dependency graph, the module boundaries.

That broader context is exactly where AI and semantic search can extend what a human reviewer can see.

### Dependency Graph Analysis

The most tractable form of architecture review is dependency analysis — understanding what a PR's changes do to the dependency relationships between modules, packages, or services.

The inputs to this analysis are the import graphs of the files being changed. A PR that adds a new import in module A to module B creates a new dependency edge from A to B. Whether that edge is acceptable depends on the architecture: which direction are dependencies supposed to flow? What are the layering rules? Does this create a cycle?

Automated dependency analysis at review time can surface:

- New dependencies that violate layering rules
- Circular dependencies introduced by the change
- Unusually large changes to a module's dependency footprint
- New dependencies on external packages not previously used

```python
# Dependency graph analysis for code review
import ast
from collections import defaultdict
from typing import Iterator

class DependencyAnalyzer:
    def __init__(self, layer_rules: dict[str, list[str]]):
        """
        layer_rules defines which layers a module can import from.
        Example: {"api": ["domain", "utils"], "domain": ["utils"]}
        This enforces that api can import domain and utils,
        but domain cannot import api.
        """
        self.layer_rules = layer_rules

    def analyze_pr(
        self,
        changed_files: list[str],
        current_graph: dict[str, set[str]],
        new_graph: dict[str, set[str]]
    ) -> list[DependencyViolation]:
        violations = []

        for file_path in changed_files:
            module = file_to_module(file_path)
            module_layer = self.get_layer(module)

            if module_layer is None:
                continue

            # Check new dependencies
            old_deps = current_graph.get(module, set())
            new_deps = new_graph.get(module, set())
            added_deps = new_deps - old_deps

            for dep in added_deps:
                dep_layer = self.get_layer(dep)
                if dep_layer and not self.is_allowed(module_layer, dep_layer):
                    violations.append(DependencyViolation(
                        module=module,
                        dependency=dep,
                        violation_type="layer_violation",
                        description=(
                            f"{module_layer} layer ({module}) importing "
                            f"from {dep_layer} layer ({dep}) — "
                            f"violates dependency direction rules."
                        )
                    ))

        # Check for new cycles
        new_cycles = find_new_cycles(current_graph, new_graph, changed_files)
        violations.extend([
            DependencyViolation(
                module=cycle[0],
                dependency=cycle[-1],
                violation_type="circular_dependency",
                description=f"New circular dependency: {' -> '.join(cycle)}"
            )
            for cycle in new_cycles
        ])

        return violations

    def is_allowed(self, from_layer: str, to_layer: str) -> bool:
        return to_layer in self.layer_rules.get(from_layer, [])

    def get_layer(self, module: str) -> str | None:
        for layer, patterns in self.layer_rules.items():
            for pattern in patterns:
                if module.startswith(pattern):
                    return layer
        return None
```

*Figure 2: A dependency violation — the domain layer importing from the API layer, reversing the intended dependency direction. This is invisible in line-by-line diff review and only visible in the dependency graph.*

The layer rules are team-specific and need to be encoded explicitly. But once encoded, the analysis runs automatically and flags violations before they are merged.

### Blast Radius Analysis

A different but related question: if this code is merged and something goes wrong with it, how much of the system is affected?

Blast radius analysis starts from the files changed in a PR and traverses the call graph outward: what functions call these changed functions, and what calls those callers, and so on. The result is a set of code paths that could be affected by defects in the changed code.

This serves two purposes in review. First, it tells reviewers how critical their review is — a change with a blast radius that includes the authentication critical path deserves more scrutiny than a change with a blast radius limited to a single utility module. Second, it identifies related tests and integration points that should be exercised before merging.

```python
def compute_blast_radius(
    changed_files: list[str],
    call_graph: dict[str, set[str]],  # caller -> set of callees
    max_depth: int = 4
) -> BlastRadiusResult:
    """
    Traverses the call graph outward from changed code to identify
    all code paths that could be affected by a defect in the changes.

    Returns both the affected code paths and a severity classification
    based on whether critical system components are in scope.
    """
    # Reverse the call graph to find callers
    reverse_graph = defaultdict(set)
    for caller, callees in call_graph.items():
        for callee in callees:
            reverse_graph[callee].add(caller)

    affected = set()
    frontier = set()

    # Seed with changed functions
    for file_path in changed_files:
        for func in get_functions(file_path):
            frontier.add(f"{file_path}:{func}")

    # BFS outward through callers
    for depth in range(max_depth):
        next_frontier = set()
        for node in frontier:
            callers = reverse_graph.get(node, set())
            new_callers = callers - affected
            next_frontier.update(new_callers)
        affected.update(frontier)
        frontier = next_frontier
        if not frontier:
            break

    # Classify severity based on what's in blast radius
    critical_paths = [
        node for node in affected
        if is_critical_path(node)  # auth, payments, data integrity
    ]

    return BlastRadiusResult(
        affected_nodes=affected,
        critical_paths=critical_paths,
        depth_reached=depth + 1,
        severity="critical" if critical_paths else "normal"
    )
```

> **Key Insight:** Blast radius is not a metric of code quality — it is a metric of review priority. High blast radius does not mean the code is bad; it means the review needs to be more thorough and the testing more comprehensive. Making this visible to reviewers before they start reading the diff changes how they allocate attention.

### Semantic Similarity for Architecture Signals

Beyond graph analysis, semantic search enables architectural review that is harder to formalize. The question: does this PR introduce a new pattern for a problem that is already solved in the codebase?

This is different from the duplication question in Chapter 4. There, the concern was a single function being duplicated. Here, the concern is an architectural pattern — a new way of handling an entire concern — being introduced alongside existing ways.

Examples: a PR that adds a new caching strategy in a codebase that already has two caching strategies. A PR that adds a new approach to background job processing in a codebase with an existing job queue. A PR that adds a new configuration loading mechanism in a codebase with an established configuration system.

Semantic search can surface the existing solutions when a new approach is being introduced.

```python
def check_architectural_novelty(
    pr_summary: str,
    search_client: SemanticSearchClient
) -> list[ArchitecturalConflict]:
    """
    Searches for existing architectural patterns that overlap
    with what the PR introduces. Flags for reviewer attention.

    pr_summary should describe what the PR is implementing,
    not what it changes. E.g., "adding background job queue
    for async email sending" not "changes to mailer.py".
    """
    results = search_client.search(
        query=pr_summary,
        top_k=8,
        filters={"min_similarity": 0.72}
    )

    # Group by architectural category
    by_category = defaultdict(list)
    for result in results:
        category = classify_architectural_category(result)
        by_category[category].append(result)

    conflicts = []
    for category, matches in by_category.items():
        if len(matches) >= 2:
            conflicts.append(ArchitecturalConflict(
                category=category,
                existing_implementations=[m.file_path for m in matches],
                description=(
                    f"PR introduces a new {category} implementation. "
                    f"Existing implementations found in {len(matches)} locations. "
                    f"Consider whether these should be consolidated."
                )
            ))

    return conflicts
```

### Microservice Boundary Analysis

For distributed systems, the architecture review problem extends beyond the codebase to service boundaries. A PR that moves business logic into a new service, or that has one service calling another for functionality that could be local, has architectural implications that extend beyond what any single repository's analysis can see.

This is one area where AI-assisted review has real but limited applicability. Within a repository, dependency analysis is tractable. Across service boundaries, it requires either a service mesh with tracing data, a service catalog, or knowledge encoded in annotations or documentation.

The practical approach is to flag PRs that add new inter-service calls for architectural review, then rely on human reviewers with system-level context to evaluate whether those calls are appropriate. Automation provides the flag; human judgment makes the call.

> **Warning:** Architecture review cannot be fully automated, and attempting to automate it too aggressively creates friction without proportionate benefit. Automated dependency and blast radius analysis work well. Automated judgment about whether an architectural decision is correct for the system's direction does not. Know where the automation ends and the human judgment begins.

### Key Takeaways

1. Dependency graph analysis catches architecture violations — layering rule breaches, circular dependencies — that are invisible in line-by-line review.
2. Blast radius analysis quantifies the scope of impact from potential defects in a PR, enabling better allocation of reviewer attention.
3. Encoding dependency direction rules explicitly enables automated checking without requiring reviewers to hold architectural rules in memory.
4. Semantic search can surface existing architectural patterns when a PR introduces a new approach to an already-solved problem.
5. Architecture review cannot be fully automated; the value of AI is in surfacing the right questions and signals for human judgment, not in making architectural decisions.

> **Try This:** Draw the intended dependency direction for your codebase's three main layers (API, domain/business logic, infrastructure). Then run an import analysis on the codebase as it currently stands to find violations of those rules. The number of existing violations tells you how consistently your current review process enforces architectural boundaries — and gives you a baseline for measuring improvement.

---

## Chapter 8: Building a Hybrid Review Process

The previous chapters have described capabilities — what semantic search can do, what AI pattern detection can do, what dependency analysis can do. This chapter is about how to assemble those capabilities into a process that actually gets used.

A process that is not used is not a process. It is documentation. The failure mode for AI-assisted review tooling is not technical — it is adoption. Teams adopt processes that fit into their workflow and produce visible value quickly. They do not adopt processes that add friction, produce noise, or require significant behavior change for unclear benefit.

The design principle for a hybrid review process: AI handles what it is good at, humans handle what they are good at, and the handoff between them is explicit and well-designed.

### The Three-Phase Review Model

An effective hybrid process has three distinct phases, each with different tooling and human responsibilities.

**Phase 1: Automated pre-review.** Before any human reviews the PR, automated analysis runs and produces a structured pre-review report. This report covers: secrets scanning, anti-pattern detection, dependency violations, blast radius classification, and semantically similar implementations in the codebase. The report does not block the PR — it informs the reviewer.

**Phase 2: Human review with context.** The reviewer opens the pre-review report before opening the diff. They understand what patterns the automated system flagged, what similar code exists in the codebase, and what the blast radius of the change is. Then they review the diff with that context active.

**Phase 3: Post-review synthesis.** After the review is complete and before merge, a final automated check confirms that flagged items from Phase 1 were either addressed or explicitly acknowledged. This is a lightweight gate — it does not require resolution of every automated finding, only acknowledgment that each one was considered.

```yaml
# Example CI/CD integration for hybrid review process
# .github/workflows/pr-review.yml

name: AI-Assisted Review

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  pre-review-analysis:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0  # Need full history for context

      - name: Run secrets scan
        run: pyckle scan-secrets --format json > /tmp/secrets.json

      - name: Run anti-pattern detection
        run: pyckle scan-patterns --config .pyckle/patterns.yaml --format json > /tmp/patterns.json

      - name: Run dependency analysis
        run: pyckle analyze-deps --rules .pyckle/layer-rules.yaml --format json > /tmp/deps.json

      - name: Run blast radius analysis
        run: pyckle blast-radius --changed-files $(git diff --name-only origin/main) --format json > /tmp/blast.json

      - name: Semantic similarity search
        run: pyckle find-similar --pr-diff $(git diff origin/main) --format json > /tmp/similar.json

      - name: Generate pre-review report
        run: |
          pyckle generate-report \
            --secrets /tmp/secrets.json \
            --patterns /tmp/patterns.json \
            --deps /tmp/deps.json \
            --blast /tmp/blast.json \
            --similar /tmp/similar.json \
            --output pre-review-report.md

      - name: Post report as PR comment
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = fs.readFileSync('pre-review-report.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: report
            });
```

### Designing the Pre-Review Report

The pre-review report format has a direct impact on how useful it is to reviewers. A report that dumps every finding with equal prominence trains reviewers to ignore it. A report structured around severity and reviewer decision points creates a productive starting point.

A useful report format:

```markdown
## Pre-Review Analysis

**Blast Radius:** High — changes touch authentication path
**Review Priority:** Required before merge

---

### 🔴 Critical Findings (1)
Require resolution or explicit sign-off

1. **Hardcoded API Key** — `src/integrations/stripe.py:47`
   ```python
   STRIPE_KEY = "sk-live-2jRk..."
   ```
   Move to environment variable. Do not merge with this in place.

---

### 🟡 Warnings (2)
Warrant reviewer consideration

1. **Potential N+1 Query** — `src/api/orders.py:89`
   Loop over `order_items` with DB call inside.
   *If item count is bounded and small, this is acceptable — annotate with `# pyckle: ignore` if intentional.*

2. **Similar Implementation Found** — `src/utils/validators.py:34`
   `validate_email()` exists at similarity 0.91.
   Consider reusing or consolidating.

---

### 🔵 Context (no action required)
3 related functions found in codebase for reviewer awareness:
- `src/auth/session.py:112` — handles similar token validation logic
- `src/api/auth.py:67` — existing authorization pattern for this endpoint type
```

This format gives reviewers immediate clarity: what requires action, what is worth considering, and what is informational. Reviewers can scan the 🔴 items first, decide whether to investigate the 🟡 items, and use the 🔵 items as context while reading the diff.

> **Key Insight:** The pre-review report is the primary interface between the automated analysis and the human reviewer. Spend as much design effort on the report format as on the detection capabilities that feed it. Noise in the report degrades the entire process.

### Managing the Rollout

Introducing AI-assisted review tooling into an established team is a change management problem as much as a technical one. A few principles that hold across teams:

**Start with informational-only findings.** Do not block PRs or require acknowledgment of automated findings in the first month. Let teams observe what the system surfaces, calibrate their trust in it, and give feedback on false positive rates before you make any findings required.

**Calibrate publicly.** When the system surfaces a finding and a reviewer says "that's not actually a problem," log it. When the system surfaces a finding that does turn out to be a bug or a security issue, log it too. Share both statistics with the team. Teams that see concrete evidence that the tool catches real things — and an honest accounting of when it doesn't — develop calibrated trust rather than either blind trust or dismissal.

**Let teams own their patterns library.** The anti-pattern library is most valuable when it is built from the team's own experience. Give each team the ability to add patterns, suppress findings, and tune sensitivity. A centrally mandated configuration that does not match team context produces friction and disengagement.

**Integrate with existing review workflows.** If the team uses GitHub PRs, post findings as PR comments. If they use GitLab, use the GitLab API. Do not require engineers to open a separate dashboard to see automated findings. Findings that require context switching have much lower adoption rates than findings that appear in the existing review interface.

> **Warning:** Automated review findings that block PRs without a human override option create adversarial dynamics. Engineers will find ways to suppress or bypass the checks rather than engage with them. Reserve hard blocks for the highest-severity, highest-confidence findings only (e.g., hardcoded secrets, known critical vulnerabilities). Everything else should be reviewable and overridable.

### Roles in the Hybrid Process

A well-designed hybrid process has explicit role assignments. The AI tooling is not a reviewer — it is a pre-analysis step. The human reviewer is still accountable for the review. The distinction matters because it affects how reviewers engage with automated findings.

**AI tooling responsibilities:**
- Run automated analysis (secrets, anti-patterns, dependencies, blast radius, similarity)
- Produce a structured pre-review report
- Surface relevant similar code and prior implementations
- Flag items for reviewer consideration

**Human reviewer responsibilities:**
- Read the pre-review report and load context before reviewing the diff
- Investigate flagged items and reach a conclusion (fix, suppress, or accept risk)
- Evaluate intent and correctness — things AI cannot assess
- Make design and architectural judgments
- Approve or reject the PR

This separation of concerns prevents the most common failure mode of AI-assisted review: reviewers who treat automated findings as authoritative and approve PRs without actually reviewing the code.

### Key Takeaways

1. The three-phase model — automated pre-review, human review with context, post-review synthesis — provides a structure that gets used because it fits existing workflow patterns.
2. Pre-review report format is as important as the detection capabilities that feed it; high-noise reports train reviewers to ignore them.
3. Rollout should be informational-only initially, with calibration data shared publicly to build appropriate trust in the tooling.
4. Hard blocks on PRs should be reserved for the highest-severity, highest-confidence findings; everything else should be overridable.
5. The AI tooling is a pre-analysis step, not a reviewer. The human reviewer retains accountability for the review.

> **Try This:** Draft a pre-review report format for your team's most common types of PRs. Include the categories of findings you would want to surface and how you would present them. Share the draft with two or three engineers on your team and ask: "Would you read this before every review? What would make you trust it more?" Their answers will shape your actual implementation more than any generic recommendation.

---

## Chapter 9: Metrics: Measuring Review Quality

The most common objection to investing in review quality tooling is that it is hard to measure whether things get better. PRs either get approved and code ships, or they get rejected and revisited. The defects that escaped review show up later — sometimes much later — and by then the connection to the review process is hard to trace.

This ambiguity is real, but it is not an excuse for not measuring. The absence of good metrics has historically meant that code review was judged by proxy metrics — review time, comments per PR, lines reviewed per hour — that do not actually correlate with what matters. The introduction of AI-assisted tooling is an opportunity to define and track metrics that do.

### The Metrics That Actually Matter

Review quality metrics fall into three categories: process metrics, outcome metrics, and quality signal metrics.

**Process metrics** measure what the review process is doing. They are leading indicators.

- **Review coverage rate:** What percentage of merged code went through review? This should be close to 100% for most codebases. If it is not, the review process is not the quality gate it is supposed to be.
- **Time from PR open to first substantive review:** Not just time to any review comment, but time to the first comment that indicates the reviewer actually read the code. Long times indicate bottlenecks that may cause developers to merge without review.
- **Review depth per PR size:** Average number of substantive comments relative to PR size (in lines or functions changed). If large PRs consistently receive fewer comments per unit of code than small PRs, review fatigue is occurring.
- **Automated finding acknowledgment rate:** What percentage of automated pre-review findings were explicitly acknowledged (addressed or suppressed) before merge? Low rates indicate reviewers are not reading the pre-review report.

**Outcome metrics** measure what happens after code ships. They are lagging indicators.

- **Defect escape rate:** Defects found post-merge as a percentage of all defects (pre-merge + post-merge). The denominator requires a good tracking system; you need to capture the bugs that code review did catch, not just the ones that escaped.
- **Security finding post-merge rate:** Security vulnerabilities found in production or in security audits that were present in code that passed review. This is the metric that makes the case for security-focused review investment.
- **Repeat defect rate:** How often does a defect class that escaped review once escape again? High repeat rates indicate the review process is not updating in response to failure.

**Quality signal metrics** measure whether the AI tooling is contributing useful signal.

- **True positive rate of automated findings:** For findings that reviewers investigate, what percentage turn out to be genuine concerns? Tracked over time and by finding category.
- **Suppression rate by category:** What percentage of automated findings are suppressed without being addressed? High suppression rates for a category indicate either poor detection quality or poor calibration of the threshold.
- **Findings-that-became-bugs rate:** Among automated findings that were suppressed or ignored, how many showed up later as bugs? This is the metric that validates the detection capabilities most directly.

> **Key Insight:** The most important metric to track is not how many findings the AI system produces — it is the true positive rate of those findings. A system that produces 50 findings per PR with a 10% true positive rate is less useful than a system that produces 5 findings with an 80% true positive rate. Calibration matters more than coverage.

### Establishing a Baseline

Before making any changes to the review process, establish a baseline for the metrics you intend to track. This is non-negotiable. Without a baseline, you cannot know whether your changes had any effect.

The baseline measurement period should be long enough to capture natural variation — at least eight weeks, ideally twelve. Code review effectiveness varies by sprint cycle, by team composition, and by the nature of the code being reviewed. A baseline that captures one sprint may not represent typical conditions.

For outcome metrics, the baseline period needs to extend backward in time — you need historical data on defect escape rates before you start measuring the impact of new tooling. Pull data from your issue tracker, your incident log, and your post-mortems. Map bugs and incidents back to the commits that introduced them and calculate how many were introduced in code that went through review.

This historical analysis often produces surprising findings. Teams that believe their review process is catching most defects frequently discover that a significant fraction of post-merge bugs were introduced in reviewed code. That finding is uncomfortable but useful — it establishes the actual state of the system before improvement work begins.

### Instrumenting the Review Process

Capturing these metrics requires instrumentation. The good news is that most version control and issue tracking platforms expose enough data through their APIs to compute the metrics described above. The investment is in building the pipeline, not in capturing the raw data.

```python
# Example: Computing review depth metrics from GitHub API
from datetime import datetime, timedelta
from typing import Iterator
import httpx

class ReviewMetricsCollector:
    def __init__(self, github_token: str, repo: str):
        self.client = httpx.Client(
            base_url="https://api.github.com",
            headers={"Authorization": f"token {github_token}"}
        )
        self.repo = repo

    def get_review_depth_metrics(
        self,
        start_date: datetime,
        end_date: datetime
    ) -> list[PRReviewMetrics]:
        prs = self.get_merged_prs(start_date, end_date)
        metrics = []

        for pr in prs:
            comments = self.get_review_comments(pr["number"])
            substantive_comments = [
                c for c in comments
                if self.is_substantive(c["body"])
            ]

            pr_size = pr["additions"] + pr["deletions"]

            metrics.append(PRReviewMetrics(
                pr_number=pr["number"],
                pr_size_lines=pr_size,
                total_comments=len(comments),
                substantive_comments=len(substantive_comments),
                comments_per_100_lines=(
                    len(substantive_comments) / max(pr_size, 1) * 100
                ),
                time_to_first_review=self.compute_time_to_first_review(pr),
                merged_at=pr["merged_at"]
            ))

        return metrics

    def is_substantive(self, comment_body: str) -> bool:
        """
        Heuristic: a substantive comment has at least 30 characters
        and is not a simple approval (LGTM, looks good, +1, etc.)
        """
        if len(comment_body) < 30:
            return False
        approval_patterns = ["lgtm", "looks good", "ship it", "+1", "approved"]
        lower = comment_body.lower().strip()
        return not any(lower.startswith(p) for p in approval_patterns)
```

### Interpreting Metric Changes

Metrics change in response to many things, not just tooling changes. Team composition changes, the nature of the work changes, sprint pressure increases or decreases. Attributing metric changes cleanly to tooling is difficult.

The most rigorous approach is a controlled experiment — enable the new tooling for some teams and not others, measure both groups over the same period, compare outcomes. This is feasible at organizations with multiple teams working in similar codebases. At smaller organizations, before/after comparison is usually the only option, with the caveat that other factors may be confounding.

The metrics most sensitive to review tooling changes, and thus most useful for evaluation:

- **Automated finding acknowledgment rate** — this should increase immediately when tooling is introduced, if reviewers are reading the pre-review reports
- **True positive rate of automated findings** — this should stabilize after an initial calibration period; if it trends downward over time, the anti-pattern library needs maintenance
- **Repeat defect rate** — this should decrease as the library incorporates patterns from past incidents; a flat or increasing repeat defect rate suggests the library is not being maintained

> **Warning:** Do not tie review quality metrics to individual reviewer performance evaluations. When metrics are attached to performance evaluations, reviewers optimize for the metrics rather than for review quality. Comment count goes up; comment quality does not. Review time decreases; defect escape rate does not improve. This is Goodhart's Law applied to code review, and it is predictable.

### The Long-Term Feedback Loop

The most valuable thing about establishing metrics early is what they enable over time. With a solid baseline and consistent measurement, you can run experiments. Does adding blast radius information to the pre-review report change review depth on high-blast-radius PRs? Does a new anti-pattern in the library actually reduce recurrence of that pattern? Does moving from informational findings to required acknowledgment change anything?

These are empirical questions. With metrics, they have empirical answers. Without metrics, you are relying on intuition about whether anything is working.

The feedback loop from metrics to process improvement is the mechanism that turns AI-assisted review from a one-time tooling investment into a continuously improving system. The anti-pattern library grows from incident analysis. The similarity thresholds are tuned based on true positive rates. The pre-review report format evolves based on what reviewers actually read and act on.

Teams that measure improve. Teams that do not measure plateau.

### Key Takeaways

1. Review quality metrics fall into process metrics (leading indicators), outcome metrics (lagging indicators), and quality signal metrics (measuring the AI tooling specifically).
2. Establishing a baseline before making changes is non-negotiable — without it, you cannot measure whether anything improved.
3. True positive rate of automated findings is more important than total finding count — calibration matters more than coverage.
4. Do not tie review quality metrics to individual performance evaluations; this predictably degrades the quality of reviewer behavior.
5. The measurement infrastructure enables experiments and continuous improvement over time — this is the long-term payoff of the investment.

> **Try This:** Define three metrics you would use to measure whether AI-assisted review is improving your team's process. For each, specify: what data source you would use, how you would compute the baseline, and what change in the metric would indicate success. The act of defining measurable success criteria often clarifies what the goal of the improvement actually is — which is worth knowing before you start building.

---

## Conclusion

The case for AI-assisted code review is not that AI reviews code better than humans. It does not. The case is that review quality degrades predictably under real conditions — time pressure, context loss, scale — and that AI can address those specific failure modes in ways that are practical to implement now.

What changes when AI enters the review loop: the reviewer starts with context they did not have to manually assemble. They see similar existing code they would not have thought to look for. They receive consistent scrutiny of patterns their current process misses. They have blast radius information before they start reading the diff. They do not have to rely entirely on memory or on having the right person available at the right time.

None of this removes the need for reviewers who understand the code, the business context, and the architectural direction. It does mean those reviewers spend more of their attention on the things only they can evaluate, and less on the things that can be handled mechanically.

The practical path forward is not to wait for the tooling to be perfect before starting. Build the baseline measurements now. Introduce automated pre-review reports as informational-only, calibrate, and expand scope as trust develops. Build the anti-pattern library from your own incident history — that library is specific to your codebase in a way no generic tool can replicate.

Code review is one of the highest-leverage activities in software development. It is also one of the most inconsistently executed. The gap between what review could be and what it usually is — in most teams, most of the time — is wide enough to justify the investment. The tools to close that gap are available.

---

## Appendix A: Glossary

**Anti-pattern library** — A curated, team-specific collection of code patterns identified as problematic, typically built from incident analysis and code review history. Serves as the input to automated anti-pattern detection.

**AST (Abstract Syntax Tree)** — A tree representation of the structure of source code, used in analysis tools to inspect code structure without executing it. AST-based analysis is language-specific and provides high-precision pattern detection.

**Blast radius** — The set of code paths potentially affected by a defect introduced in a specific change. Computed by traversing the call graph outward from changed code. Used to classify review priority.

**BM25** — A text ranking function used in information retrieval, based on term frequency and inverse document frequency. Used in hybrid search systems alongside semantic embeddings.

**Chunk** — A unit of code (typically a function, class method, or block) used as the basic unit of indexing in a semantic search system.

**Cosine similarity** — A measure of similarity between two vectors, computed as the cosine of the angle between them. Used to compare embedding vectors in semantic search.

**Defect escape rate** — The fraction of defects that make it past code review to post-merge discovery. A key outcome metric for review quality.

**Embedding** — A dense vector representation of a piece of code or text, produced by a model trained to represent semantic similarity as geometric proximity. Semantically similar items produce vectors that are close together in embedding space.

**Hybrid search** — A retrieval approach combining semantic embeddings with keyword-based scoring (typically BM25), fusing results to get the benefits of both approaches.

**N+1 query problem** — A performance anti-pattern where code makes one database query to retrieve a list of N items, then makes N additional queries to retrieve associated data for each item individually. Usually addressable by prefetching or batching.

**Pre-review report** — A structured document produced by automated analysis before human review begins. Summarizes automated findings, similar code, blast radius, and other context to inform the human reviewer.

**Semantic search** — Search that finds results by conceptual similarity rather than exact token matching. Implemented using embedding models that represent code as dense vectors.

**Taint analysis** — A program analysis technique that tracks the flow of untrusted data (from sources such as user input) through a program to potentially dangerous operations (sinks such as SQL query execution). Used to distinguish real injection vulnerabilities from false pattern matches.

**True positive rate** — For a detection system, the fraction of findings that represent genuine concerns. The key quality metric for automated review tooling.

**Vector centroid** — The average of a set of vectors. Used to represent a pattern by averaging the embeddings of known instances, enabling consistency scoring for new code.

---

## Appendix B: Tools & Resources

### Semantic Code Search

**Pyckle** — Semantic code search and review tooling built on hybrid search (BM25 + embeddings). Provides codebase indexing, PR pre-review analysis, anti-pattern detection, and blast radius analysis. Designed for integration with CI/CD workflows.

**Sourcegraph** — Code intelligence platform with semantic search capabilities. Strong integration with most version control systems. Useful for cross-repository search in larger organizations.

**GitHub Copilot for PRs** — GitHub's built-in AI review tooling. Limited to GitHub-hosted repositories; provides natural language PR summaries and some automated review suggestions.

**Codeium** — IDE-integrated code search with semantic capabilities. Particularly useful for individual developer context during review.

### Anti-Pattern and Security Detection

**Semgrep** — Static analysis tool supporting custom rule patterns using a high-level rule syntax. Strong for structural anti-pattern detection. Supports most major languages. Open-source core with commercial features.

**Bandit** — Python-specific security linter. Covers common Python security patterns including SQL injection, command injection, insecure deserialization, and hardcoded secrets.

**CodeQL** — GitHub's semantic analysis engine. Supports taint analysis for injection vulnerability detection. More setup overhead than simpler tools but higher precision for complex vulnerability patterns.

**Gitleaks** — Secrets scanning tool. Pattern library covers most common secret formats. Can be integrated into pre-commit hooks or CI pipelines.

**Trufflehog** — Secrets scanning with entropy analysis. Covers more secret formats than simple pattern matching and produces fewer false positives than entropy-only approaches.

### Dependency and Architecture Analysis

**Dep-check** — Multi-language dependency vulnerability scanner. Checks package dependencies against known CVE databases.

**py-dep-graph** / **dependency-cruiser** — Dependency visualization and analysis tools for Python and JavaScript/TypeScript respectively. Support encoding and checking architectural rules.

**Snyk** — Developer security platform with dependency vulnerability scanning and license compliance checking. Strong CI/CD integration.

### Metrics and Instrumentation

**GitHub Advanced Security** — Provides code scanning, secret scanning, and dependency review integrated into GitHub PRs. Metric dashboards for security findings over time.

**Codecov / Coveralls** — Test coverage tracking with PR-level integration. Useful for tracking whether test coverage is maintained as the codebase grows.

**SonarQube / SonarCloud** — Code quality platform with long-term metric tracking. Useful for trend analysis on technical debt, duplications, and code coverage.

---

## Appendix C: Further Reading

### Code Review Fundamentals

**"Best Kept Secrets of Peer Code Review"** — SmartBear Software. The most-cited empirical study of code review effectiveness. Includes the Cisco research on review throughput and defect detection rates. Foundational for understanding what review actually does and where it breaks down.

**"Code Complete"** — Steve McConnell. Chapter 21 covers collaborative construction including code review in depth. The empirical grounding in this chapter holds up well despite the book's age.

**"Accelerate: The Science of Lean Software and DevOps"** — Forsgren, Humble, Kim. Research on which engineering practices actually correlate with delivery performance. Code review appears as a component of trunk-based development and continuous integration practices. Relevant for teams measuring the impact of review process improvements.

### Retrieval and Semantic Search

**"Pretrained Transformers for Text Ranking: BERT and Beyond"** — Lin, Nogueira, Yates. Academic survey of transformer-based retrieval methods. Covers dense retrieval, hybrid methods, and reranking. Technical but accessible for engineers building retrieval systems.

**"Neural Information Retrieval at Microsoft"** — Microsoft Research. Practical survey of how modern retrieval systems are built at scale. Useful for understanding the engineering decisions in production semantic search systems.

### Static Analysis and Security

**"The Art of Software Security Assessment"** — Dowd, McDonald, Schuh. Deep coverage of vulnerability classes and code review techniques for security. Comprehensive on injection, memory safety, and authentication vulnerabilities. Reference-level depth rather than a cover-to-cover read.

**OWASP Testing Guide** — Open Web Application Security Project. Current edition available at owasp.org. The canonical reference for web application security testing and review. Updated regularly with current vulnerability classes and detection techniques.

**"Secure by Design"** — Johnsson, Deogun, Sawano. Focuses on designing security in rather than testing it in. Relevant for understanding why certain patterns are secure and others are not — the "why" behind security pattern detection.

### Software Architecture

**"Clean Architecture"** — Robert Martin. Covers dependency rule and layering principles that form the conceptual foundation of dependency direction analysis. Prescriptive but useful for encoding architectural rules into automated analysis.

**"Building Microservices"** — Sam Newman. Second edition covers service boundary decisions and inter-service dependency management. Relevant for teams extending review to distributed system boundaries.

**"Software Architecture Metrics"** — Tornhill, Bowes, et al. Case studies in measuring software quality through structural analysis. Covers coupling metrics, hotspot analysis, and architectural decay — provides empirical grounding for metric-based architecture review.

---

*AI-Assisted Code Review — Version 1.0*
*David Kelly Price — April 2026*
*Pyckle*


---

## Related Blog Posts

- [When AI Writes Itself](https://pyckle.co/blog/when-ai-writes-itself-what-100-percent-ai-generated-code-actually-means.html)
- [Your Codebase Has Its Own Language](https://pyckle.co/blog/your-codebase-has-its-own-languageand-your-ai-doesnt-speak-it.html)
- [Why Naive Retrieval Breaks at Scale](https://pyckle.co/blog/why-naive-retrieval-breaks-at-scale-and-what-we-built-instead.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*