---
title: "Semantic Routing: Design and Implementation"
subtitle: "Directing Queries to the Right Handler Based on Meaning, Not Keywords"
author: "David Kelly Price"
version: "1.0"
date: 2026-04-20
status: draft
type: ebook
target_audience: "Senior engineers building multi-tool or multi-model AI systems — familiar with embeddings and classification, building beyond simple single-path pipelines"
estimated_pages: 70
chapters:
  - "What Semantic Routing Solves"
  - "Routing vs. Classification: Key Distinctions"
  - "Embedding-Based Routing"
  - "Threshold Calibration"
  - "Multi-Stage Routing Pipelines"
  - "Routing for Code Queries"
  - "Fallback Strategies and Graceful Degradation"
  - "Evaluation and Monitoring"
tags:
  - pyckle
  - ebook
  - semantic-routing
  - classification
  - embeddings
  - architecture
  - llm
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Figures:
- Captioned and numbered (Figure 1, Figure 2, etc.)
- Referenced by number in body text
-->

---

# Semantic Routing: Design and Implementation

## Directing Queries to the Right Handler Based on Meaning, Not Keywords

**By David Kelly Price**

Version 1.0 — April 2026

---

## Table of Contents

- [About This Guide](#about-this-guide)
- [Chapter 1: What Semantic Routing Solves](#chapter-1-what-semantic-routing-solves)
- [Chapter 2: Routing vs. Classification: Key Distinctions](#chapter-2-routing-vs-classification-key-distinctions)
- [Chapter 3: Embedding-Based Routing](#chapter-3-embedding-based-routing)
- [Chapter 4: Threshold Calibration](#chapter-4-threshold-calibration)
- [Chapter 5: Multi-Stage Routing Pipelines](#chapter-5-multi-stage-routing-pipelines)
- [Chapter 6: Routing for Code Queries](#chapter-6-routing-for-code-queries)
- [Chapter 7: Fallback Strategies and Graceful Degradation](#chapter-7-fallback-strategies-and-graceful-degradation)
- [Chapter 8: Evaluation and Monitoring](#chapter-8-evaluation-and-monitoring)
- [Conclusion](#conclusion)
- [Appendix A: Glossary](#appendix-a-glossary)
- [Appendix B: Tools & Resources](#appendix-b-tools--resources)
- [Appendix C: Further Reading](#appendix-c-further-reading)

---

## About This Guide

This book is for engineers who have already built something with LLMs and are now hitting the wall where a single pipeline no longer works. You have multiple tools, multiple models, maybe multiple backends — and you need to figure out which one a given query should go to. Not based on a keyword match. Not based on a regex. Based on what the query actually means.

That problem is semantic routing.

The goal here is not to explain what embeddings are. You already know. The goal is to show you how to use meaning as a routing signal at production scale — how to calibrate thresholds so you're not constantly over- or under-triggering, how to compose routing stages without creating a spaghetti of conditions, and how to know when your router is failing before your users do.

Each chapter addresses a specific design problem. The progression is intentional: the early chapters build the conceptual foundation, the middle chapters get into implementation details, and the later chapters address the harder operational questions that only come up once something is actually running in production.

Code examples are written in Python. They assume you're working with something like `sentence-transformers`, `numpy`, and a basic understanding of cosine similarity. When specific libraries are used, they're named. The patterns transfer to any stack.

This book does not recommend a specific routing library or framework as the one true answer. The space is moving fast. What it does instead is give you the principles that will outlast whatever the current popular tool is.

---

## Chapter 1: What Semantic Routing Solves

### The Problem With Single-Path Pipelines

When you build the first version of an AI-powered feature, a single pipeline makes sense. User sends a query, the query goes to an LLM, the LLM responds. Maybe there's some context retrieval in the middle, but the path is linear and predictable. This architecture handles the common case well, which is why it's where almost everyone starts.

The problems start showing up around the second or third capability you add.

Suppose you have a coding assistant. It answers general programming questions, but it also has access to your internal documentation, your codebase, and an external package registry. A query about sorting algorithms in Python probably shouldn't hit the internal codebase retriever — there's nothing company-specific about it. A query about the internal authentication library probably shouldn't go to a generic LLM without pulling relevant context first. A question about a third-party package should probably hit the registry, not your docs.

Now you have a routing problem. And the naive approach — a chain of if/elif/else conditions — breaks down almost immediately. Not because it can't handle the easy cases, but because language doesn't cooperate. Users don't phrase things to match your conditions. They ask "how does login work in this app?" and your code checker fails to match "login" to the authentication module because it's called `auth_service`. They ask "can this library handle unicode?" and your regex looking for the word "package" misses it entirely.

The issue is that keyword-based routing matches form, not function. Semantic routing matches meaning.

### What Semantic Routing Actually Does

Semantic routing is the practice of directing queries to handlers based on the semantic content of those queries — their meaning — rather than their surface-level text features. It uses vector representations of meaning (embeddings) to measure how conceptually similar a query is to each possible routing destination, then sends the query to the best match.

At its core, it's doing something simple: comparing the meaning of an input against the meaning of a set of possible destinations and picking the closest one. But that simplicity is deceptive. Getting it right at production scale involves a series of non-obvious decisions about how similarity is measured, what counts as "close enough," how to handle ambiguous cases, and when to trust the router versus escalate.

> **Key Insight:** Semantic routing isn't a replacement for all your existing routing logic. It's the tool you reach for when the routing signal is conceptual rather than structural. Use keyword routing for clear categorical signals ("this is always a billing question if the word 'invoice' appears"). Use semantic routing when the distinction requires understanding meaning.

The distinction matters because semantic routing has cost: embedding computation, similarity calculation, possible latency from vector lookup. You don't want to pay that cost when a simple condition works perfectly well. But when a simple condition doesn't work — when the routing signal is buried in meaning rather than surface text — semantic routing is often the only thing that scales.

### The Failure Modes of Keyword Routing

To appreciate why semantic routing exists, it's worth being specific about how keyword routing fails. There are three recurring patterns.

**Vocabulary mismatch.** The user uses different words than the routing condition expects. "I can't log in" and "authentication failing" and "my password doesn't work" all mean roughly the same thing. A keyword system watching for "login" catches the first, might catch the third, and probably misses the second. You can add synonyms, but you're playing whack-a-mole with human language variation. Semantic routing handles this naturally because synonyms have similar embeddings.

**Context sensitivity.** Some words mean different things in different contexts. "Model" could mean a trained ML model, a data model, a fashion model, or a physical scale model. A keyword match on "model" will fire in all four cases. Semantic routing uses the full query context to disambiguate — it doesn't just see the word, it sees the meaning.

**Compositional intent.** "How do I make the thing faster" is a perfectly natural query that contains no keywords at all to indicate whether it's about database performance, rendering performance, model inference, or something else entirely. Only context can resolve it — and routing based on meaning is far better equipped to do that than any keyword list.

> **Warning:** The temptation when keyword routing fails is to add more keywords. This technically improves recall on the cases you've seen, but it degrades precision on the cases you haven't. Keyword rules accumulate technical debt fast. At some point it becomes easier to replace the whole system than to maintain the rule list.

### Where Semantic Routing Fits in the Stack

Semantic routing sits at the entry point of your pipeline — before query processing, before retrieval, before LLM calls. Its job is to decide what happens next. Think of it as a dispatcher: it receives the raw query and routes it to whatever handler is best suited to deal with it.

In a multi-tool system, handlers might be different retrieval backends, different LLM instances, different prompt templates, or even entirely different processing pipelines. The router doesn't know or care what the handler does with the query — it just needs to make the routing decision accurately.

This architectural position means routing decisions happen early, and routing errors cascade. A misrouted query that ends up in the wrong handler wastes compute, produces a worse response, and might completely fail to answer the user's actual question. Getting routing right isn't an optimization — it's load-bearing.

```
Figure 1: Position of the semantic router in a multi-handler pipeline.

[User Query]
     │
     ▼
[Semantic Router]
     │
     ├──► [Handler A: Internal Docs Retrieval]
     ├──► [Handler B: Code Search]
     ├──► [Handler C: External Package Registry]
     └──► [Handler D: General LLM Response]
```

The router in Figure 1 is making a single-stage decision — one query, one destination. Chapter 5 covers multi-stage routing, where the query passes through multiple routing decisions before reaching a final handler. But single-stage routing is the right starting point.

### When Semantic Routing Isn't the Answer

Not every routing problem needs semantic routing. Some queries carry explicit structural signals that are more reliable than semantic similarity: if a query comes from a specific API endpoint, route it accordingly; if it includes a `type` field in a structured request, use that directly; if it's triggered by a specific UI control, you already know the context. Semantic routing is for when you don't have those structural signals and need to infer intent from natural language.

It also doesn't scale well as the only tool in complex multi-domain systems. When you have dozens of possible handlers, the similarity calculation starts to become ambiguous — many handlers will be somewhat similar to a given query, and the signal-to-noise ratio drops. Multi-stage routing, fallbacks, and confidence thresholds address this, but you still need to design carefully.

The sweet spot for semantic routing is a small-to-medium number of meaningfully distinct handlers (say, 3 to 15), where the distinctions between them are conceptual rather than structural, and where user language is unpredictable.

---

### Key Takeaways

1. Single-path pipelines fail when you have multiple handlers with different strengths — routing becomes necessary as systems grow.
2. Keyword routing fails at vocabulary mismatch, context sensitivity, and compositional intent — the three places where meaning diverges from surface text.
3. Semantic routing uses vector representations to route based on meaning, not form.
4. The router sits at pipeline entry, before processing — routing errors cascade, so accuracy matters more than speed.
5. Not every routing problem needs semantic routing. Use it when the routing signal is conceptual; use simpler approaches when structural signals exist.

---

> **Try This:** Take the last ten queries your system received that produced poor responses. Identify the routing decision each one required. How many would keyword routing have gotten right? How many required understanding the meaning of the query to route correctly? That ratio tells you how much semantic routing could help.

---

## Chapter 2: Routing vs. Classification: Key Distinctions

### A Terminology Problem Worth Solving Early

Routing and classification are related enough that practitioners use the terms interchangeably, which creates confusion when the design decisions actually differ. Before going further, it's worth being precise.

Classification is a task: given an input, assign it to one of a fixed set of categories. The output is a label. The question classification answers is "what kind of thing is this?"

Routing is a behavior: given an input, send it somewhere. The output is an action. The question routing answers is "where should this go?"

In practice, routing often involves classification as a sub-step — you classify the query, then use the classification result to determine the route. But conflating the two causes design problems. A classification model optimized for labeling accuracy isn't necessarily a good router. A routing system that works well doesn't have to produce clean classifications.

Understanding the distinction helps you make better decisions about architecture, evaluation, and what to optimize.

### Classification Models Aren't Always Good Routers

A standard text classifier takes a query, runs it through a model, and outputs class probabilities. You pick the highest probability class. This sounds like routing, but there are critical differences in how the system needs to behave.

A classifier trained on labeled data is optimized to assign correct labels on the training distribution. It can be brittle outside that distribution, because it learned the features that distinguish classes in the training data, not a general understanding of what each class means.

A routing system, by contrast, operates on live user input that may fall nowhere near the training distribution. Users ask things you didn't anticipate. New handlers get added. The system needs to work on queries it has never seen before, with concepts it may have never explicitly trained on.

Embedding-based semantic routing handles this better because it's not relying on training distribution coverage — it's measuring similarity in a high-dimensional semantic space that generalizes across vocabulary and phrasing. A router built on embeddings can correctly route a query that uses entirely different words than any query it's been calibrated against, as long as the meaning is similar.

> **Key Insight:** If your routing options are stable and well-defined, and you have labeled training data, a fine-tuned classifier may outperform embedding-based routing on in-distribution queries. But if your routing destinations change frequently, or if you're routing in domains where labeled data is sparse, embedding-based routing gives you better generalization.

This doesn't mean classifiers have no place in routing pipelines. Chapter 5 covers how classification and embedding-based routing can be combined in multi-stage architectures. The point here is that they're different tools with different tradeoffs, and defaulting to "train a classifier" when you need a router is a common mistake.

### The Output Difference

Classification produces a label and a confidence score. Routing produces a destination — and optionally, metadata about why the routing decision was made.

This distinction matters for downstream behavior. When you're doing routing, the confidence score needs to be interpretable in terms of routing behavior: not "I'm 72% confident this is category B," but "I'm confident enough in this routing decision to send the query to handler B without fallback."

Mapping from classification confidence to routing confidence is non-trivial. A model can be 72% confident and still be completely wrong. It can be 51% confident and be right. The threshold at which you trust a routing decision enough to act on it is a calibration problem that's separate from the classification problem — and it's one of the most underappreciated challenges in building routing systems. Chapter 4 covers threshold calibration in detail.

### Routing Is Stateful; Classification Usually Isn't

A classifier receives a query and produces an output, independent of everything else. A router exists inside a system — it has a set of available handlers, it knows which ones are healthy, it may have per-session state about what's already been tried, and it makes decisions in context.

This statefulness changes the design requirements considerably. A routing system needs to handle the case where the best-matching handler is unavailable. It needs to update its available routes as the system changes. In multi-turn conversations, it may need to maintain routing context across turns so a follow-up question doesn't get routed to a completely different handler than the question it's following up on.

None of these concerns are native to a classification model. They have to be built around it.

> **Warning:** Don't confuse evaluating your router on accuracy metrics borrowed from classification (precision, recall, F1 on labeled test sets) for evaluating whether your router is actually doing its job. A router that achieves 94% accuracy on a held-out test set but routes coherently for only 70% of production queries has a calibration or distribution shift problem that accuracy alone won't reveal. Evaluation of routing systems is covered in Chapter 8.

### Intent vs. Topic: What Are You Actually Routing On?

One of the most useful distinctions in routing system design is between routing on intent and routing on topic.

Topic-based routing asks: what is this query about? It groups queries by subject matter domain. "How do I sort a list in Python?" and "What's the time complexity of merge sort?" are both about algorithms — same topic.

Intent-based routing asks: what does the user want to do? It groups queries by the action they require. "How do I sort a list in Python?" is asking for code (intent: generate). "What's the time complexity of merge sort?" is asking for explanation (intent: explain). Different intent, same topic.

Most routing systems are implicitly doing topic-based routing because that's the natural way to think about distributing queries across specialized handlers. But intent-based routing is often the more useful signal for choosing between types of responses rather than specialized knowledge sources.

A complete routing system frequently needs both. You might first route on topic (this is a code question, not a data analysis question) and then route on intent (the user wants to generate code, not have existing code reviewed). Multi-stage pipelines formalize this — Chapter 5.

```python
# Simplified example: intent extraction before topic routing
from enum import Enum

class QueryIntent(Enum):
    GENERATE = "generate"
    EXPLAIN = "explain"
    DEBUG = "debug"
    REVIEW = "review"
    UNKNOWN = "unknown"

INTENT_TEMPLATES = {
    QueryIntent.GENERATE: "Write code that does the following:",
    QueryIntent.EXPLAIN: "Explain how the following concept works:",
    QueryIntent.DEBUG: "Find the bug in the following code:",
    QueryIntent.REVIEW: "Review the following code for quality and correctness:",
}

def classify_intent(query: str, embedder) -> QueryIntent:
    """Route based on intent, not just topic."""
    intent_anchors = {
        QueryIntent.GENERATE: ["write code for", "implement", "create a function",
                                "generate", "how do I build", "show me how to"],
        QueryIntent.EXPLAIN: ["explain", "what is", "how does", "why does",
                               "what's the difference between", "describe"],
        QueryIntent.DEBUG: ["fix", "debug", "why is this failing", "error",
                             "not working", "broken", "exception"],
        QueryIntent.REVIEW: ["review", "improve", "is this good", "better way",
                              "code quality", "refactor"],
    }

    query_emb = embedder.encode(query)
    best_intent = QueryIntent.UNKNOWN
    best_score = 0.5  # minimum confidence threshold

    for intent, anchors in intent_anchors.items():
        anchor_embs = embedder.encode(anchors)
        scores = cosine_similarity([query_emb], anchor_embs)[0]
        max_score = scores.max()
        if max_score > best_score:
            best_score = max_score
            best_intent = intent

    return best_intent
```

This is intent routing using anchor phrases as synthetic class representatives — a pattern covered in depth in Chapter 3.

### When You Need Routing, Not Classification

The cleanest heuristic: if you're building a system where queries need to be handled differently based on meaning, and the handling involves meaningfully distinct execution paths (different tools, different models, different retrieval strategies), you need a router.

If you're building a system that needs to label data, organize content, or report on what kinds of things are happening — but doesn't need to change execution based on that — you need a classifier.

When the same system needs both, build them separately and compose them. The classifier serves analytics and monitoring; the router drives execution. They can share embedding infrastructure, but their outputs serve different purposes and should be evaluated on different criteria.

---

### Key Takeaways

1. Classification assigns labels; routing drives behavior. They're related but not the same, and optimizing for one doesn't guarantee success at the other.
2. Embedding-based routing generalizes better than trained classifiers for novel queries and unstable routing destinations.
3. Confidence score interpretation is different in routing than in classification — routing requires knowing when a score is good enough to act on, not just which score is highest.
4. Routing is stateful; it exists inside a system with available handlers, health state, and context. Classifiers are stateless.
5. Distinguishing intent from topic enables more precise routing — combining them in multi-stage pipelines gives the best results.

---

> **Try This:** Audit your current routing or dispatch logic. Identify each decision point and label it: is this routing on topic, intent, or both? Are any of these currently keyword-based where the routing signal is actually conceptual? Start there with semantic routing.

---

## Chapter 3: Embedding-Based Routing

### The Core Mechanism

Embedding-based routing works by converting both the incoming query and a set of routing anchors into vectors in a shared semantic space, then measuring which anchor the query is closest to. The handler associated with the closest anchor receives the query.

That's it. Everything else — anchor selection, similarity metric choice, threshold calibration, multi-stage composition — is scaffolding built around this core operation.

Understanding why this works requires understanding what embeddings are representing. A well-trained embedding model encodes semantic content: two strings with similar meaning produce vectors that are geometrically close. "I can't log in" and "my account access is broken" end up near each other in the embedding space even though they share no significant keywords. This is the property that makes embedding-based routing generalize in ways that keyword matching cannot.

The embedding model is doing most of the heavy lifting here. Choosing the right model — or the wrong one — has a larger impact on routing quality than almost any other decision in the system.

### Choosing an Embedding Model

Not all embedding models are equally suited for routing. The main considerations are domain coverage, dimensionality, and latency.

**Domain coverage** is the most important factor. A general-purpose embedding model trained on web text handles everyday language well but may be weak on specialized technical domains. If you're routing code-related queries, a model that has seen substantial code during training will produce better embeddings for those queries. If you're in a highly specialized domain (medical, legal, financial), consider whether a domain-adapted model is available and worth the tradeoff.

**Dimensionality** affects both accuracy and performance. Higher-dimensional embeddings generally capture more nuance, but at the cost of more memory and slower similarity computation. For most routing use cases, 256 to 768 dimensions is the practical sweet spot. 1536-dimensional embeddings (as produced by some OpenAI models) offer marginally better quality in some settings but rarely enough to justify the compute cost in a low-latency routing layer.

**Latency** matters because routing happens before everything else. If routing adds 100ms to every request, that's 100ms added to every response. Smaller, faster embedding models (all-MiniLM-L6-v2 at 384 dimensions is a common choice) often outperform larger models when latency is a constraint, because the routing quality difference between a medium and large model is smaller than the query-by-query latency difference.

> **Key Insight:** Cache embeddings aggressively. For routing anchors, pre-compute and store embeddings at startup — they don't change unless your routing configuration changes. For queries, consider caching embeddings for recently seen or high-frequency inputs. Embedding computation is the majority of routing latency in most implementations.

```python
import numpy as np
from sentence_transformers import SentenceTransformer
from typing import Dict, List, Tuple
from dataclasses import dataclass

@dataclass
class Route:
    name: str
    handler: callable
    anchors: List[str]
    anchor_embeddings: np.ndarray = None

class SemanticRouter:
    def __init__(self, model_name: str = "all-MiniLM-L6-v2"):
        self.model = SentenceTransformer(model_name)
        self.routes: List[Route] = []

    def add_route(self, name: str, handler: callable, anchors: List[str]) -> None:
        embeddings = self.model.encode(anchors, normalize_embeddings=True)
        route = Route(name=name, handler=handler, anchors=anchors,
                      anchor_embeddings=embeddings)
        self.routes.append(route)

    def route(self, query: str) -> Tuple[str, float]:
        if not self.routes:
            raise ValueError("No routes configured")

        query_embedding = self.model.encode(query, normalize_embeddings=True)
        best_route = None
        best_score = -1.0

        for route in self.routes:
            # Score is max similarity to any anchor in this route
            scores = query_embedding @ route.anchor_embeddings.T
            route_score = float(scores.max())

            if route_score > best_score:
                best_score = route_score
                best_route = route

        return best_route.name, best_score

    def dispatch(self, query: str, threshold: float = 0.35) -> any:
        route_name, score = self.route(query)
        if score < threshold:
            return None  # No confident route — caller handles fallback
        route = next(r for r in self.routes if r.name == route_name)
        return route.handler(query)
```

This is a minimal but complete implementation. The `route` method returns both the best route name and the confidence score. The `dispatch` method applies a threshold before invoking the handler, returning `None` when confidence is insufficient. The caller handles the fallback — the router's job ends at the decision.

### The Role of Anchor Phrases

Anchor phrases are the routing signals. They're the examples or descriptions that define what each route means semantically. Their quality determines routing quality more than almost anything else.

There are two strategies for creating anchors: **example-based anchors** and **description-based anchors**.

Example-based anchors are representative queries that should route to this handler — what real users might ask. "How do I fix a merge conflict?" and "my branch has conflicts" are example-based anchors for a git help route.

Description-based anchors describe the domain or capability in natural language. "Git version control questions and issues" is a description-based anchor for the same route.

In practice, mixing both produces the best coverage. Description-based anchors capture the general conceptual territory; example-based anchors pull in the actual phrasing patterns users use. Neither alone is sufficient: descriptions can be too abstract to attract edge cases, and examples can miss the space between them.

```python
# Example: mixed anchor strategy for a code assistance system
router = SemanticRouter()

router.add_route(
    name="code_generation",
    handler=code_generation_handler,
    anchors=[
        # Description-based
        "Writing new code, implementing functions or classes, code generation",
        "Creating programs or scripts from scratch",
        # Example-based
        "write a function that does X",
        "implement a binary search tree",
        "create a Python script to parse CSV files",
        "how do I write code to connect to a database",
        "generate a REST API endpoint",
    ]
)

router.add_route(
    name="code_debugging",
    handler=debugging_handler,
    anchors=[
        # Description-based
        "Debugging errors, fixing broken code, diagnosing failures and exceptions",
        "Troubleshooting code that isn't working correctly",
        # Example-based
        "why is this throwing a KeyError",
        "my code crashes with a segfault",
        "this function returns None and I don't know why",
        "how do I fix this TypeError",
        "my tests are failing and I can't figure out why",
    ]
)

router.add_route(
    name="code_review",
    handler=review_handler,
    anchors=[
        # Description-based
        "Reviewing existing code for quality, correctness, and best practices",
        "Code quality assessment and improvement suggestions",
        # Example-based
        "is this implementation good",
        "can you review my pull request",
        "what's wrong with this approach",
        "how would you improve this code",
        "is there a more Pythonic way to do this",
    ]
)
```

The number of anchors per route is a tuning parameter. More anchors provide broader coverage but also increase the chance of a high-similarity match to the wrong route if anchors overlap semantically. A good starting point is 6 to 12 anchors per route, covering both description and representative examples.

### Similarity Metrics

Cosine similarity is the standard choice for semantic routing, and it's the right default. It measures the angle between two vectors in the embedding space, which corresponds to semantic similarity independent of vector magnitude. When embeddings are L2-normalized (as in the implementation above), cosine similarity reduces to a dot product, which is fast.

The range is -1 to 1, though in practice routing scores cluster between 0.2 and 0.9. Scores below 0.3 typically indicate no meaningful semantic relationship. Scores above 0.75 typically indicate strong semantic alignment.

Dot product without normalization is sometimes used when magnitude carries signal — which it does in some embedding models that encode frequency or importance information in vector magnitude. For most routing use cases, normalized embeddings and cosine similarity are the right choice.

Euclidean distance is occasionally used but has no advantage over cosine similarity for routing, and it's harder to interpret — smaller means closer, the range isn't bounded, and it varies with embedding dimensionality.

### Aggregating Scores Across Anchors

Given multiple anchors per route, you need a strategy for converting multiple similarity scores into a single route score. The main options are maximum, mean, and top-k mean.

**Maximum** (as used in the example above) selects the highest similarity across all anchors. It asks: "what's the closest thing in this route to the query?" This is appropriate when you want to capture specialized coverage — if any anchor is a strong match, the query probably belongs here.

**Mean** averages all anchor similarities. It asks: "how generally aligned is this query with this route?" This is more conservative and less likely to fire on edge cases that happen to be lexically close to one specific anchor.

**Top-k mean** takes the mean of the k highest-scoring anchors. It's a middle ground that reduces the noise sensitivity of the pure mean while not firing as aggressively as the maximum. Top-3 mean is a common configuration.

The right aggregation strategy depends on your anchor construction. If your anchors are tightly curated representative examples (low variance, high quality), maximum works well. If your anchors are broader and more varied, top-k mean provides more stability.

> **Warning:** More anchors don't always mean better routing. Anchors that overlap semantically with anchors in other routes create ambiguity. Before adding anchors to improve recall for one route, check whether they could also increase false positives from related routes. Chapter 4 covers how to detect this through threshold calibration.

### Batch Processing and Caching

In production, routing needs to be fast. Two optimizations have the most impact: pre-computing anchor embeddings and batching query embeddings when possible.

Pre-computing anchor embeddings is straightforward — do it at startup and store in memory. Anchors only change when routing configuration changes, which should be rare and controlled.

```python
import hashlib

class CachedSemanticRouter(SemanticRouter):
    """Router with query embedding cache for high-throughput scenarios."""

    def __init__(self, model_name: str = "all-MiniLM-L6-v2",
                 cache_size: int = 1024):
        super().__init__(model_name)
        self._cache: Dict[str, np.ndarray] = {}
        self._cache_size = cache_size
        self._cache_order: List[str] = []  # LRU tracking

    def _get_query_embedding(self, query: str) -> np.ndarray:
        cache_key = hashlib.md5(query.encode()).hexdigest()

        if cache_key in self._cache:
            self._cache_order.remove(cache_key)
            self._cache_order.append(cache_key)
            return self._cache[cache_key]

        embedding = self.model.encode(query, normalize_embeddings=True)

        if len(self._cache) >= self._cache_size:
            oldest = self._cache_order.pop(0)
            del self._cache[oldest]

        self._cache[cache_key] = embedding
        self._cache_order.append(cache_key)
        return embedding

    def route(self, query: str) -> Tuple[str, float]:
        query_embedding = self._get_query_embedding(query)
        best_route = None
        best_score = -1.0

        for route in self.routes:
            scores = query_embedding @ route.anchor_embeddings.T
            route_score = float(scores.max())
            if route_score > best_score:
                best_score = route_score
                best_route = route

        return best_route.name, best_score
```

For systems handling high query volume, consider replacing the in-process cache with a distributed cache (Redis is common) keyed on the normalized query string. This allows embedding reuse across multiple router instances without redundant computation.

---

### Key Takeaways

1. Embedding-based routing measures semantic similarity between queries and anchor phrases — the embedding model choice determines most of the system's baseline quality.
2. Anchor phrase construction is as important as model choice: mix description-based and example-based anchors, targeting 6-12 per route.
3. Cosine similarity on normalized embeddings is the right default metric — fast, interpretable, and well-calibrated for typical embedding ranges.
4. Score aggregation strategy (max, mean, top-k mean) should match the variance of your anchors — tightly curated anchors work well with max; broader anchors need averaging.
5. Cache anchor embeddings at startup; cache query embeddings for high-frequency inputs. These two optimizations account for most of the latency reduction available in the routing layer.

---

> **Try This:** Build a minimal semantic router with three routes corresponding to three meaningfully distinct handlers in a system you're working on. Write 8-10 anchors per route. Run 20 queries you've received from real users through it and evaluate the routing decision for each. Pay attention not just to whether the route was right, but how high the score was — the score distribution tells you a lot about calibration before you even start tuning thresholds.

---

## Chapter 4: Threshold Calibration

### Why Thresholds Matter

Every semantic router produces a score. The score tells you how semantically similar the query is to the best-matching route. But a score alone doesn't tell you whether to act on it.

A score of 0.62 might mean "confidently belongs to handler B" in one system and "ambiguously somewhere between handler A and handler B" in another, depending on how the anchors are constructed, which embedding model was used, and how different the routes are from each other. The threshold is what converts a continuous similarity score into a binary routing decision: route this query, or don't.

Getting threshold calibration wrong has two failure modes. Set the threshold too low and the router fires confidently on queries it shouldn't — misrouting queries with insufficient signal. Set it too high and the router refuses to route anything except the most obvious cases, pushing too many queries to fallback handlers.

Neither failure is acceptable in production. Miscalibration is one of the most common reasons semantic routing systems underperform in practice, despite the underlying similarity computation being correct.

### The Distribution of Routing Scores

The first step in calibration is understanding your score distribution. Before setting any thresholds, collect similarity scores across a representative sample of real or realistic queries and plot the distribution.

```python
import numpy as np
from collections import defaultdict

def collect_routing_scores(router: SemanticRouter,
                            queries: List[str]) -> Dict[str, List[float]]:
    """
    Collect routing scores across a query set.
    Returns scores per route for analysis.
    """
    route_scores = defaultdict(list)
    all_scores = []

    for query in queries:
        best_route, best_score = router.route(query)
        route_scores[best_route].append(best_score)
        all_scores.append({
            "query": query,
            "route": best_route,
            "score": best_score
        })

    return route_scores, all_scores

def score_percentiles(scores: List[float]) -> Dict[str, float]:
    arr = np.array(scores)
    return {
        "p10": float(np.percentile(arr, 10)),
        "p25": float(np.percentile(arr, 25)),
        "p50": float(np.percentile(arr, 50)),
        "p75": float(np.percentile(arr, 75)),
        "p90": float(np.percentile(arr, 90)),
        "mean": float(arr.mean()),
        "std": float(arr.std()),
    }
```

What you're looking for in the distribution:

1. **Bimodality**: A well-calibrated router with clear routes often shows a bimodal distribution — a cluster of high-confidence correct routes and a cluster of low-confidence ambiguous queries. The valley between these clusters is a natural threshold location.

2. **Compression at the top**: If most scores cluster above 0.7, your anchors may be too similar to each other or the embedding model is producing compressed similarity values. The threshold needs to go higher to create meaningful discrimination.

3. **Flat distribution**: If scores are roughly uniformly distributed across [0.3, 0.8], the routing signal is weak. This usually indicates that routes aren't semantically distinct enough, or that the query sample doesn't match the routing categories well.

### Setting Per-Route Thresholds

A single global threshold is rarely optimal. Different routes have different semantic breadth, and a threshold that works well for a narrow, specialized route will be too conservative for a broad, general route.

Per-route thresholds let you tune each route to its own characteristics. The route for "questions about internal API authentication" can have a higher threshold (it's narrow, so high confidence should be required) than the route for "general Python programming questions" (it's broad, and many queries have some relevance to it).

```python
@dataclass
class Route:
    name: str
    handler: callable
    anchors: List[str]
    anchor_embeddings: np.ndarray = None
    threshold: float = 0.35  # Per-route threshold

class ThresholdRouter(SemanticRouter):
    def dispatch(self, query: str) -> any:
        query_embedding = self.model.encode(query, normalize_embeddings=True)
        best_route = None
        best_score = -1.0

        for route in self.routes:
            scores = query_embedding @ route.anchor_embeddings.T
            route_score = float(scores.max())
            if route_score > best_score:
                best_score = route_score
                best_route = route

        # Check against per-route threshold
        if best_score < best_route.threshold:
            return None  # Below this route's confidence requirement

        return best_route.handler(query)
```

Per-route thresholds require per-route calibration data. Collect a sample of queries that should and shouldn't route to each handler, run them through the router, and find the threshold that maximizes correct routing while minimizing misroutes for each route independently.

### Precision-Recall Tradeoffs

Threshold calibration is fundamentally a precision-recall tradeoff. Raising the threshold improves precision (when the router fires, it's more likely to be right) but hurts recall (more queries fall through to fallback). Lowering the threshold improves recall but degrades precision.

The right tradeoff depends on the cost asymmetry in your system. If a misrouted query produces a visibly wrong response (user asks about billing, gets sent to the coding assistant), precision matters more. If a query falling through to a slow or expensive fallback is the main cost, recall matters more.

Formalize this with a cost matrix:

```python
def optimal_threshold(scores: List[float],
                       labels: List[bool],
                       misroute_cost: float = 2.0,
                       fallback_cost: float = 1.0) -> float:
    """
    Find threshold that minimizes expected cost.

    labels: True = should route to this handler, False = should not
    misroute_cost: cost when a non-matching query is routed here
    fallback_cost: cost when a matching query falls to fallback
    """
    scores_arr = np.array(scores)
    labels_arr = np.array(labels)
    thresholds = np.linspace(scores_arr.min(), scores_arr.max(), 200)

    best_threshold = thresholds[0]
    best_cost = float('inf')

    for t in thresholds:
        routed = scores_arr >= t
        misroutes = (routed & ~labels_arr).sum()   # FP: routed when shouldn't
        fallbacks = (~routed & labels_arr).sum()   # FN: not routed when should

        cost = misroute_cost * misroutes + fallback_cost * fallbacks
        if cost < best_cost:
            best_cost = cost
            best_threshold = t

    return float(best_threshold)
```

Setting `misroute_cost` higher than `fallback_cost` biases toward higher thresholds (conservative routing). Equal costs produces a balanced threshold. Adjust based on actual production consequences.

### Score Gap Analysis

One refinement that improves routing reliability is considering not just the best route's score, but the gap between the best and second-best scores.

A query that scores 0.72 on route A and 0.68 on route B is more ambiguous than a query that scores 0.72 on route A and 0.41 on route B, even though the best score is identical. The gap captures this ambiguity.

```python
def route_with_gap_check(self, query: str,
                          min_gap: float = 0.1) -> Tuple[str, float, bool]:
    """
    Route query, flagging low-gap decisions as potentially ambiguous.

    Returns (route_name, best_score, is_confident)
    """
    query_embedding = self.model.encode(query, normalize_embeddings=True)
    route_scores = []

    for route in self.routes:
        scores = query_embedding @ route.anchor_embeddings.T
        route_scores.append((route.name, float(scores.max())))

    route_scores.sort(key=lambda x: x[1], reverse=True)
    best_name, best_score = route_scores[0]
    second_score = route_scores[1][1] if len(route_scores) > 1 else 0.0

    gap = best_score - second_score
    is_confident = gap >= min_gap

    return best_name, best_score, is_confident
```

Use gap analysis as a secondary signal: high score but small gap → route to best match but flag as potentially ambiguous; high score and large gap → route confidently; low score regardless of gap → fall to default.

> **Key Insight:** Score gap analysis is particularly useful for detecting queries that sit at the boundary between two semantically adjacent routes. If you consistently see small-gap decisions between two specific routes, it's a signal that those routes aren't semantically distinct enough — their anchor spaces overlap. The fix is usually to refine the anchor phrases, not to adjust the threshold.

### Dynamic Threshold Adjustment

Production systems drift. The distribution of queries changes over time, new topics emerge, and routes that were well-calibrated at launch become under or over-triggering. Static thresholds set at launch don't stay optimal.

A practical approach is periodic threshold recalibration based on production signals. The signals needed for this are: routing decisions made (route, score), and routing quality indicators (downstream response quality ratings, escalations, explicit user feedback). Collecting these and periodically running calibration analysis gives you the data needed to update thresholds without manual inspection.

A more aggressive approach uses online threshold adjustment — updating thresholds in response to observed feedback signals in near-real-time. This is complex to implement correctly and creates its own instabilities, so it's only worth considering if your query distribution changes very rapidly and you have reliable automated quality signals. Most systems are better served by periodic batch recalibration.

---

### Key Takeaways

1. Thresholds convert continuous similarity scores into binary routing decisions — miscalibration is one of the most common production failure modes.
2. Analyze your score distribution before setting thresholds — look for bimodality, compression, and flatness as signals of routing system health.
3. Per-route thresholds outperform global thresholds for systems with routes of varying semantic breadth.
4. The optimal threshold depends on the cost asymmetry between misrouting and fallback — formalize this with a cost matrix.
5. Score gap analysis detects ambiguous routing decisions that a threshold check alone would miss.

---

> **Try This:** Take 100 labeled queries for your system — 50 that should route to each of two adjacent handlers — and plot their similarity score distributions. Identify the score range where the distributions overlap. That overlap zone is your calibration problem: it contains both the queries you'll misroute and the queries you'll under-route. The width of the overlap zone tells you how hard the calibration problem actually is.

---

## Chapter 5: Multi-Stage Routing Pipelines

### When Single-Stage Routing Isn't Enough

Single-stage routing — one decision, one destination — is the right starting architecture. It's simple, fast, and handles most cases well. But as the number of handlers grows and the distinctions between them become more nuanced, a single routing decision starts to compress too many distinctions into one step.

The result is either too many routes (hard to calibrate, high ambiguity between adjacent routes) or routes that are too coarse (accurate routing but handlers that are still too generic to be useful). Multi-stage routing addresses this by decomposing the routing decision into a sequence of coarser-to-finer decisions, each building on the previous.

The principle is the same one that makes hierarchical classification work: it's easier to make a confident high-level decision and then a confident low-level decision than to make the combined decision in one step.

```
Figure 2: Two-stage routing pipeline.

[Query]
  │
  ▼
[Stage 1: Domain Router]
  │
  ├──► [Domain: Code] ──► [Stage 2: Code Intent Router]
  │                              │
  │                              ├──► [Generate Handler]
  │                              ├──► [Debug Handler]
  │                              └──► [Review Handler]
  │
  ├──► [Domain: Data] ──► [Stage 2: Data Task Router]
  │                              │
  │                              ├──► [Analysis Handler]
  │                              └──► [Visualization Handler]
  │
  └──► [Domain: General] ──► [General LLM Handler]
```

In Figure 2, the first stage makes a high-confidence domain decision. The second stage, within each domain, makes a more specific intent decision. Each stage only needs to distinguish between a small number of options, making calibration at each stage tractable.

### Designing Stage Boundaries

The central design question in multi-stage routing is where to draw the boundaries. Bad boundaries create routers that fight each other; good boundaries create clear, composable separation of concerns.

Stage boundaries should align with natural semantic clusters in your query space. If you find yourself drawing a boundary between two categories that are hard to distinguish semantically, the boundary is in the wrong place. The best multi-stage designs have first-stage decisions that are high-confidence and coarse (domain, capability area, interaction type) and later-stage decisions that apply only within a narrower context where distinctions are sharper.

Some heuristics for stage boundary design:

- If a routing decision changes what context you need to retrieve, it belongs in its own stage before retrieval.
- If a routing decision only makes sense given information from a previous stage, it belongs in a later stage.
- If two routes are semantically similar in isolation but clearly different given domain context, they belong in a later stage after domain routing.

> **Warning:** Multi-stage routing multiplies error rates. If stage 1 is 95% accurate and stage 2 is 95% accurate, the combined accuracy for a query that passes through both is approximately 90% (0.95 × 0.95). Each stage adds a potential failure point. Design the minimum number of stages needed — don't add stages because they seem like clean abstractions.

### Implementation: A Composable Pipeline

The implementation challenge in multi-stage routing is composability — each stage should be self-contained and testable, but stages need to pass context to each other.

```python
from typing import Optional, Dict, Any, Callable
from dataclasses import dataclass, field

@dataclass
class RoutingContext:
    """Carries query and routing state through pipeline stages."""
    query: str
    stage_results: Dict[str, Any] = field(default_factory=dict)
    metadata: Dict[str, Any] = field(default_factory=dict)

    def record(self, stage_name: str, route: str, score: float) -> None:
        self.stage_results[stage_name] = {"route": route, "score": score}

    def get_route(self, stage_name: str) -> Optional[str]:
        result = self.stage_results.get(stage_name)
        return result["route"] if result else None

@dataclass
class RouterStage:
    """A single routing stage in a multi-stage pipeline."""
    name: str
    router: SemanticRouter
    threshold: float = 0.35
    fallback_route: Optional[str] = None

    def execute(self, context: RoutingContext) -> Optional[str]:
        route_name, score = self.router.route(context.query)

        if score < self.threshold:
            route_name = self.fallback_route
            score = 0.0

        context.record(self.name, route_name, score)
        return route_name

class MultiStageRouter:
    def __init__(self):
        self.stages: Dict[str, Dict[str, RouterStage]] = {}
        self.entry_stage: Optional[RouterStage] = None

    def set_entry(self, stage: RouterStage) -> None:
        self.entry_stage = stage

    def add_stage(self, parent_route: str, stage: RouterStage) -> None:
        if parent_route not in self.stages:
            self.stages[parent_route] = {}
        self.stages[parent_route][stage.name] = stage

    def route(self, query: str) -> RoutingContext:
        context = RoutingContext(query=query)

        if not self.entry_stage:
            raise ValueError("No entry stage configured")

        domain = self.entry_stage.execute(context)

        if domain and domain in self.stages:
            for stage_name, stage in self.stages[domain].items():
                stage.execute(context)

        return context
```

This implementation keeps routing context in a `RoutingContext` object that accumulates stage results as the query moves through the pipeline. Each stage records its decision, producing a full trace of the routing path.

```python
# Usage: building a two-stage code/data/general router
pipeline = MultiStageRouter()

# Build entry domain router
domain_router = SemanticRouter()
domain_router.add_route("code", handler=None, anchors=[
    "Programming, software development, code writing and review",
    "Functions, classes, algorithms, debugging code",
    "write a function", "fix this bug", "review my code",
    "how do I implement", "why does this crash",
])
domain_router.add_route("data", handler=None, anchors=[
    "Data analysis, statistics, data processing and transformation",
    "Datasets, DataFrames, SQL queries, aggregations",
    "analyze this dataset", "calculate the average", "plot this data",
    "SQL query for", "pandas DataFrame",
])
domain_router.add_route("general", handler=None, anchors=[
    "General questions, explanations, factual information",
    "explain this concept", "what is the difference between",
    "how does X work", "what should I know about",
])

pipeline.set_entry(RouterStage(
    name="domain",
    router=domain_router,
    threshold=0.35,
    fallback_route="general"
))

# Build code-specific intent router
code_intent_router = SemanticRouter()
code_intent_router.add_route("generate", handler=code_gen_handler, anchors=[
    "Writing new code, implementing features",
    "write", "implement", "create", "build", "generate code for",
])
code_intent_router.add_route("debug", handler=debug_handler, anchors=[
    "Fixing broken code, diagnosing errors and exceptions",
    "fix", "debug", "error", "crash", "not working", "exception",
])
code_intent_router.add_route("review", handler=review_handler, anchors=[
    "Reviewing existing code, code quality, improvements",
    "review", "improve", "better way", "is this good", "refactor",
])

pipeline.add_stage("code", RouterStage(
    name="code_intent",
    router=code_intent_router,
    threshold=0.30,
    fallback_route="generate"
))

# Route a query
context = pipeline.route("why does my function return None instead of the value")
print(context.stage_results)
# {'domain': {'route': 'code', 'score': 0.71},
#  'code_intent': {'route': 'debug', 'score': 0.68}}
```

### Conditional Routing: Using Previous Stage Results

Sometimes the routing decision at stage N should consider the routing results from stages 1 through N-1. A query about "performance" means different things in the code domain versus the data domain — the second stage can use the first stage's result to narrow the interpretation.

```python
class ContextAwareStage(RouterStage):
    """Stage that uses prior routing context in its decision."""

    def __init__(self, name: str, routers: Dict[str, SemanticRouter],
                 threshold: float = 0.35, fallback_route: Optional[str] = None):
        self.name = name
        self.routers = routers
        self.threshold = threshold
        self.fallback_route = fallback_route

    def execute(self, context: RoutingContext) -> Optional[str]:
        parent_route = None
        for stage_name, result in context.stage_results.items():
            parent_route = result.get("route")

        router = self.routers.get(parent_route, self.routers.get("default"))
        if not router:
            context.record(self.name, self.fallback_route, 0.0)
            return self.fallback_route

        route_name, score = router.route(context.query)
        if score < self.threshold:
            route_name = self.fallback_route
            score = 0.0

        context.record(self.name, route_name, score)
        return route_name
```

### Routing with Retrieved Context

A particularly powerful multi-stage pattern is routing before retrieval, then re-routing after retrieval. The first stage makes a coarse routing decision with only the query. A retrieval step fetches relevant context. The second stage uses both the query and retrieved context to make a refined routing decision.

This is more expensive — you're doing retrieval before you know the final route — but it dramatically improves accuracy for queries where the intent is ambiguous without context. A query like "how does this handle authentication?" is nearly impossible to route correctly without knowing what "this" refers to. After retrieval, you have the document context, and routing becomes much clearer.

The decision of whether to use this pattern depends on whether your routing ambiguity is fundamentally about underspecified queries. If most misrouting comes from queries that are ambiguous in isolation but clear with context, retrieval-enhanced routing is worth the cost.

---

### Key Takeaways

1. Multi-stage routing decomposes complex decisions into sequences of simpler decisions — more accurate than single-stage routing when distinctions are nuanced.
2. Stage boundaries should align with natural semantic clusters — coarse-to-fine is the right progression.
3. Each stage multiplies error rates; use the minimum number of stages that solve the problem.
4. The `RoutingContext` pattern — carrying state through stages — enables both context-aware routing and full decision traceability.
5. Retrieval-enhanced routing uses fetched context to resolve underspecified queries; use it when routing ambiguity is fundamentally about missing context, not missing signal.

---

> **Try This:** Map the routing decisions in a system you're working on to a tree structure — what's the first distinction that needs to be made, what are the second-level distinctions within each branch? Count the depth of the deepest path. If it's more than 3, you probably have a too-complex routing taxonomy and should consolidate some handlers rather than adding more stages.

---

## Chapter 6: Routing for Code Queries

### Why Code Is Different

Routing code queries requires a different approach than routing natural language queries, for reasons that are partly about the nature of code and partly about how users phrase code-related requests.

Code itself contains dense semantic content in a highly formalized syntax. A function signature tells you more about what the function does than a paragraph of natural language description. But users don't always present code in their queries — they describe it. "Why isn't my async function returning the right value" contains no code at all. "How do I make this faster" might include a 50-line implementation. The routing system needs to handle both.

The intent taxonomy for code queries is also richer and more operationally significant than for general queries. The difference between "generate this code" and "debug this code" and "explain this code" and "review this code" maps directly to different handler behaviors — different prompts, different context retrieval strategies, different output formats. Getting intent routing right for code queries is higher-stakes than for general queries.

### The Code Query Taxonomy

A workable taxonomy for code query routing has four primary dimensions: **intent**, **language/framework**, **artifact type**, and **specificity**.

**Intent** is the action the user wants. The primary intents are: generate (create new code), debug (diagnose and fix broken code), explain (understand existing code), review (assess quality), optimize (improve performance), and convert (translate between languages or paradigms).

**Language and framework** determines which specialist tools or models to involve. A query about React hooks should invoke a handler that has current React documentation context. A query about Rust's borrow checker needs Rust-specific context. For systems with deep language-specific knowledge, routing on language is as important as routing on intent.

**Artifact type** indicates what kind of code entity the query is about: function, class, module, configuration file, test, documentation. This matters because different artifact types need different handling — a question about a test needs test-specific context and evaluation criteria; a question about a configuration file needs different docs than a question about a service class.

**Specificity** is the distinction between general questions ("how do async functions work in Python") and specific questions about a particular piece of code ("why does this specific async function not return"). Specific queries need retrieval of the actual code being discussed; general queries may not.

```python
from dataclasses import dataclass
from typing import Optional

@dataclass
class CodeQueryMetadata:
    intent: str
    language: Optional[str]
    artifact_type: Optional[str]
    is_specific: bool  # True if query references specific code
    has_code_block: bool  # True if query contains embedded code

def analyze_code_query(query: str, router: SemanticRouter,
                        lang_detector: callable) -> CodeQueryMetadata:
    """
    Extract routing metadata from a code query.
    """
    intent, intent_score = router.route(query)
    language = lang_detector(query)

    has_code = "```" in query or any(
        marker in query.lower()
        for marker in ["this function", "this code", "my function", "the following"]
    )
    is_specific = has_code or any(
        phrase in query.lower()
        for phrase in ["this file", "in my", "our codebase", "the implementation"]
    )

    artifact_keywords = {
        "function": ["function", "method", "def ", "func "],
        "class": ["class", "object", "instance"],
        "test": ["test", "spec", "unittest", "pytest", "jest"],
        "config": ["config", "configuration", "yaml", "json", "env"],
        "module": ["module", "package", "import", "library"],
    }
    artifact_type = None
    for artifact, keywords in artifact_keywords.items():
        if any(kw in query.lower() for kw in keywords):
            artifact_type = artifact
            break

    return CodeQueryMetadata(
        intent=intent,
        language=language,
        artifact_type=artifact_type,
        is_specific=is_specific,
        has_code_block="```" in query
    )
```

### Embedding Models for Code

General-purpose text embedding models work for routing natural language descriptions of code tasks. But for routing queries that include actual code — or for systems doing deep code-semantic routing — code-specialized embedding models produce significantly better results.

Models like `microsoft/codebert-base` or `Salesforce/codet5-base` are trained on code corpora and produce embeddings that capture structural and semantic properties of code that general text models miss. A function that sorts a list in ascending order and a function that sorts in descending order are semantically close in a general embedding space; a code embedding model understands they're similar in purpose but different in behavior.

For routing purposes, the practical guidance is:

- If routing decisions turn primarily on natural language (user is describing what they want, not including code), a general-purpose embedding model is adequate.
- If routing decisions turn on code properties (what a code snippet does, how it's structured, what patterns it uses), use a code-specialized model.
- For hybrid queries — natural language description plus embedded code — consider two-stage embedding that handles each modality separately and combines the signals.

```python
from transformers import AutoTokenizer, AutoModel
import torch

class CodeAwareRouter:
    """
    Router that uses separate models for text and code content.
    """
    def __init__(self, text_model: str = "all-MiniLM-L6-v2",
                 code_model: str = "microsoft/codebert-base"):
        from sentence_transformers import SentenceTransformer
        self.text_embedder = SentenceTransformer(text_model)
        self.code_tokenizer = AutoTokenizer.from_pretrained(code_model)
        self.code_model = AutoModel.from_pretrained(code_model)
        self.routes: List[Route] = []

    def _extract_code_blocks(self, text: str) -> tuple[str, List[str]]:
        """Separate natural language from code blocks."""
        import re
        code_blocks = re.findall(r'```[\w]*\n?(.*?)```', text, re.DOTALL)
        clean_text = re.sub(r'```[\w]*\n?.*?```', '[CODE]', text, flags=re.DOTALL)
        return clean_text, code_blocks

    def _embed_code(self, code: str) -> np.ndarray:
        inputs = self.code_tokenizer(code, return_tensors="pt",
                                      max_length=512, truncation=True)
        with torch.no_grad():
            outputs = self.code_model(**inputs)
        return outputs.last_hidden_state[:, 0, :].numpy()[0]

    def embed_query(self, query: str) -> np.ndarray:
        text_part, code_blocks = self._extract_code_blocks(query)
        text_emb = self.text_embedder.encode(text_part, normalize_embeddings=True)

        if not code_blocks:
            return text_emb

        code_embs = [self._embed_code(block) for block in code_blocks]
        code_emb = np.mean(code_embs, axis=0)
        code_emb = code_emb / (np.linalg.norm(code_emb) + 1e-8)

        # Weighted combination: 60% text, 40% code
        combined = 0.6 * text_emb + 0.4 * code_emb
        return combined / (np.linalg.norm(combined) + 1e-8)
```

### Language and Framework Detection

For systems with language-specific handlers, language detection is a critical pre-routing step. For query-level language detection (detecting which programming language a query is about), lightweight approaches work well:

```python
LANGUAGE_SIGNATURES = {
    "python": ["python", "pip", "django", "flask", "pandas", "numpy",
               "def ", ".py", "pytest", "asyncio", "pydantic", "fastapi"],
    "javascript": ["javascript", "js", "node", "npm", "react", "vue",
                   "typescript", "webpack", "eslint", "const ", "let ",
                   "=>", "async/await"],
    "go": ["golang", " go ", "goroutine", "channel", "interface{}",
           "go mod", "gofmt", ":=", "func "],
    "rust": ["rust", "cargo", "crates.io", "borrow checker", "lifetime",
             "ownership", "Result<", "Option<", "impl "],
    "sql": ["sql", "select ", "from ", "where ", "join", "database",
            "postgres", "mysql", "sqlite", "query"],
}

def detect_language(query: str) -> Optional[str]:
    query_lower = query.lower()
    scores = {}

    for lang, signatures in LANGUAGE_SIGNATURES.items():
        score = sum(1 for sig in signatures if sig in query_lower)
        if score > 0:
            scores[lang] = score

    if not scores:
        return None
    return max(scores, key=scores.get)
```

This is intentionally simple — keyword matching for language detection is appropriate because programming language names and syntax markers are reliable categorical signals, not the kind of conceptual signal that requires semantic routing.

### Context Retrieval Routing for Code

One of the highest-value routing decisions in a code system is whether to retrieve context at all, and if so, from what. A general query needs no retrieval; a query about specific code needs the specific code; a query about a framework needs framework documentation.

```python
from enum import Enum

class RetrievalStrategy(Enum):
    NONE = "none"         # General question, answer from model knowledge
    CODEBASE = "codebase" # Retrieve from indexed codebase
    DOCS = "docs"         # Retrieve from documentation
    BOTH = "both"         # Retrieve from both, merge context

def route_retrieval_strategy(query: str,
                               metadata: CodeQueryMetadata,
                               router: SemanticRouter) -> RetrievalStrategy:
    """
    Determine retrieval strategy based on query and metadata.
    """
    if not metadata.is_specific and not metadata.has_code_block:
        general_anchors_score = compute_general_score(query, router)
        if general_anchors_score > 0.65:
            return RetrievalStrategy.NONE

    if metadata.is_specific or metadata.has_code_block:
        if metadata.intent in ("debug", "review", "explain", "optimize"):
            return RetrievalStrategy.CODEBASE

    if metadata.language and not metadata.is_specific:
        if metadata.intent in ("generate", "explain"):
            return RetrievalStrategy.DOCS

    if metadata.is_specific and metadata.language:
        if metadata.intent == "generate":
            return RetrievalStrategy.BOTH

    return RetrievalStrategy.CODEBASE  # Default for code queries
```

This retrieval routing happens before the main handler dispatch and shapes what context is assembled for the handler. The routing decision and the retrieval decision are separate, even if they're related — routing determines which handler processes the query, retrieval determines what context that handler receives.

---

### Key Takeaways

1. Code queries need richer routing metadata than general queries: intent, language, artifact type, and specificity all matter for handler selection.
2. Code-specialized embedding models outperform general models when queries contain actual code — use hybrid embedding for mixed text-and-code queries.
3. Language detection can use lightweight keyword matching — programming languages have reliable syntactic signatures that don't require semantic similarity.
4. Retrieval strategy routing (whether and where to retrieve context) is as important as handler routing for code systems.
5. The four primary code intents — generate, debug, explain, review — map to fundamentally different handler behaviors and should be routed distinctly.

---

> **Try This:** Collect 50 code-related queries from your system's logs and manually label them on the four dimensions: intent, language, artifact type, and specificity. Run them through your current routing system and count how many routing decisions would change if all four dimensions were available. That gap represents the improvement available from richer code query routing.

---

## Chapter 7: Fallback Strategies and Graceful Degradation

### The Query Your Router Can't Route

Every routing system, no matter how well-calibrated, will encounter queries it can't confidently route. Some queries are genuinely ambiguous. Some fall between defined route categories. Some are unlike anything the routing system was designed to handle. The question is what happens next.

Fallback handling is what separates production routing systems from demos. In a demo, every query is cherry-picked to route cleanly. In production, there's always a tail of difficult, ambiguous, and unexpected queries. How the system handles that tail determines user experience for a non-trivial fraction of real traffic.

There are three types of unroutable queries, and they require different fallback strategies.

**Ambiguous queries** have high similarity to multiple routes and the router can't distinguish between them. The score is reasonable, but the gap between the top two routes is too small to be confident. These queries might belong to either route — the information in the query simply doesn't resolve the ambiguity.

**Out-of-scope queries** have low similarity to all routes. The query is about something the routing system doesn't cover. No route is a good match — the user is asking about something outside the handled domain.

**Malformed or adversarial queries** are structurally unusual — very short ("hi"), very long without meaningful content, or deliberately crafted to confuse the routing layer. These are less common but need handling.

### The Tiered Fallback Architecture

The most robust fallback architecture has multiple tiers, with each tier providing a different quality/cost tradeoff:

```python
from typing import Optional, Callable, Any
from enum import Enum

class FallbackTier(Enum):
    CLARIFICATION = "clarification"   # Ask user for more information
    GENERAL_HANDLER = "general"       # Send to a general-purpose handler
    BEST_EFFORT = "best_effort"       # Route to closest match, flag uncertainty
    REJECTION = "rejection"           # Inform user the query can't be handled

@dataclass
class FallbackConfig:
    ambiguous_gap_threshold: float = 0.1   # Max gap before flagging as ambiguous
    low_confidence_threshold: float = 0.30  # Min score for any routing attempt
    rejection_threshold: float = 0.20       # Below this, inform user

class FallbackRouter:
    def __init__(self, base_router: SemanticRouter,
                 general_handler: Callable,
                 clarification_handler: Callable,
                 rejection_message: str,
                 config: FallbackConfig = None):
        self.router = base_router
        self.general_handler = general_handler
        self.clarification_handler = clarification_handler
        self.rejection_message = rejection_message
        self.config = config or FallbackConfig()

    def dispatch(self, query: str) -> dict:
        scores = self._get_all_scores(query)
        best_route, best_score = scores[0]
        second_score = scores[1][1] if len(scores) > 1 else 0.0
        gap = best_score - second_score

        result = {
            "query": query,
            "routed_to": None,
            "score": best_score,
            "gap": gap,
            "fallback_tier": None,
            "response": None
        }

        # Tier 0: Rejection — query too dissimilar to all handlers
        if best_score < self.config.rejection_threshold:
            result["fallback_tier"] = FallbackTier.REJECTION
            result["response"] = self.rejection_message
            return result

        # Tier 1: General handler — low but non-trivial confidence
        if best_score < self.config.low_confidence_threshold:
            result["fallback_tier"] = FallbackTier.GENERAL_HANDLER
            result["routed_to"] = "general"
            result["response"] = self.general_handler(query)
            return result

        # Tier 2: Ambiguous — good scores but too close to distinguish
        if gap < self.config.ambiguous_gap_threshold:
            result["fallback_tier"] = FallbackTier.CLARIFICATION
            result["response"] = self.clarification_handler(query, scores[:2])
            return result

        # Tier 3: Normal routing with score
        route = next(r for r in self.router.routes if r.name == best_route)
        result["routed_to"] = best_route
        result["response"] = route.handler(query)
        return result

    def _get_all_scores(self, query: str) -> List[tuple]:
        query_emb = self.router.model.encode(query, normalize_embeddings=True)
        scores = []
        for route in self.router.routes:
            s = float((query_emb @ route.anchor_embeddings.T).max())
            scores.append((route.name, s))
        return sorted(scores, key=lambda x: x[1], reverse=True)
```

### Clarification as a Routing Strategy

When a query is genuinely ambiguous between two routes, the most honest fallback is to ask for clarification rather than guess. This isn't always appropriate — high-volume systems can't add a clarification round-trip to every ambiguous query — but for conversational systems, it's often the best user experience.

Effective clarification prompts don't ask the user to re-phrase their query. They ask a specific question that resolves the ambiguity the router is facing.

```python
def generate_clarification(query: str,
                             top_routes: List[tuple],
                             route_descriptions: Dict[str, str]) -> str:
    """
    Generate a targeted clarification question based on ambiguous routing.

    top_routes: [(route_name, score), ...] for top 2 routes
    route_descriptions: human-readable descriptions of each route
    """
    route_a, score_a = top_routes[0]
    route_b, score_b = top_routes[1]

    desc_a = route_descriptions.get(route_a, route_a)
    desc_b = route_descriptions.get(route_b, route_b)

    return (f"Your question could be about {desc_a} or {desc_b}. "
            f"Could you clarify which you're asking about?")

# Route descriptions for user-facing clarification
ROUTE_DESCRIPTIONS = {
    "code_generation": "writing new code or implementing a feature",
    "code_debugging": "diagnosing a bug or error in existing code",
    "code_review": "assessing the quality of existing code",
    "internal_docs": "our internal APIs and documentation",
    "general_coding": "general programming concepts and best practices",
}
```

The generated clarification is targeted — it tells the user exactly what ambiguity the system faced and asks for the minimum information needed to resolve it. "Could you clarify what you're asking?" is useless; "Are you asking how to write this code, or why existing code is failing?" is useful.

### Best-Effort Routing with Uncertainty Flags

Sometimes a fallback isn't the right choice even when confidence is low. If the best-effort routing result is likely to be useful even if it's not perfect, routing with an uncertainty flag is better than refusing to route.

This is particularly true for informational queries where an imperfect but relevant answer beats no answer, and for high-volume systems where the cost of clarification round-trips is too high to apply broadly.

```python
def route_with_uncertainty(router: SemanticRouter,
                            query: str,
                            confidence_threshold: float = 0.45,
                            uncertainty_flag_threshold: float = 0.55) -> dict:
    """
    Route with uncertainty flagging for downstream handlers.

    Returns routing result with uncertainty metadata for handler use.
    """
    route_name, score = router.route(query)

    routing_quality = "high" if score >= uncertainty_flag_threshold else \
                      "medium" if score >= confidence_threshold else "low"

    return {
        "route": route_name,
        "score": score,
        "quality": routing_quality,
        "flagged": routing_quality in ("medium", "low"),
        "query": query,
    }
```

Downstream handlers can use the `flagged` field to modify their behavior: a flagged response might include a disclaimer ("I'm interpreting this as a question about X — let me know if that's not what you meant"), responses above a quality gate might be held for review, or flagged responses routed to a human queue in high-stakes systems.

> **Key Insight:** Uncertainty propagation — passing the routing confidence downstream to handlers — is more valuable than trying to resolve all uncertainty at the routing layer. Some ambiguity can only be resolved by the handler, which has access to more context. Let it.

### Degradation Patterns for Component Failures

Routing systems operate inside larger systems, and larger systems fail. An embedding model can be unavailable. A handler can be down. A vector index can be slow. Graceful degradation means the system continues to function in a degraded but acceptable state when components fail.

For routing specifically:

**Embedding model unavailable:** Fall back to keyword-based routing on a predefined rule set. This is less accurate but it's faster and doesn't depend on the embedding model. The rule set should cover the most common query patterns.

```python
import re

KEYWORD_FALLBACK_RULES = [
    # (pattern, route)
    (re.compile(r'\b(fix|bug|error|exception|crash|failing)\b', re.I), 'debug'),
    (re.compile(r'\b(write|implement|create|generate|build)\b', re.I), 'generate'),
    (re.compile(r'\b(review|improve|better|quality|refactor)\b', re.I), 'review'),
    (re.compile(r'\b(explain|what is|how does|understand)\b', re.I), 'explain'),
]

def keyword_fallback_route(query: str) -> Optional[str]:
    for pattern, route in KEYWORD_FALLBACK_RULES:
        if pattern.search(query):
            return route
    return "general"
```

**Handler unavailable:** Route to an alternative handler or fall back to the general handler. Record which queries were affected so they can be re-processed when the primary handler recovers.

**Circuit breaker pattern:** When a handler is repeatedly failing, stop routing to it temporarily and log the degradation. This prevents cascading failures where a failing handler absorbs queries that could otherwise succeed with an alternative.

```python
import time
from collections import deque

class CircuitBreaker:
    def __init__(self, failure_threshold: int = 5,
                 reset_timeout: float = 60.0):
        self.failure_threshold = failure_threshold
        self.reset_timeout = reset_timeout
        self.failures = deque(maxlen=failure_threshold)
        self.last_failure_time = 0.0
        self.state = "closed"  # closed=healthy, open=failed, half-open=testing

    def call(self, func: Callable, *args, **kwargs) -> Any:
        if self.state == "open":
            if time.time() - self.last_failure_time > self.reset_timeout:
                self.state = "half-open"
            else:
                raise RuntimeError("Circuit open — handler unavailable")

        try:
            result = func(*args, **kwargs)
            if self.state == "half-open":
                self.state = "closed"
                self.failures.clear()
            return result
        except Exception as e:
            self.failures.append(time.time())
            self.last_failure_time = time.time()
            if len(self.failures) >= self.failure_threshold:
                self.state = "open"
            raise
```

> **Warning:** Don't implement fallback paths and then never test them. The failure modes that trigger fallbacks are, by definition, unusual — which means they're also the modes least likely to be caught in normal testing. Schedule periodic drills where you intentionally disable components to verify that fallback paths work as expected.

---

### Key Takeaways

1. Every production routing system encounters queries it can't route — design fallback paths before you need them.
2. Three types of unroutable queries require different fallback strategies: ambiguous queries, out-of-scope queries, and malformed queries.
3. Tiered fallbacks — clarification, general handler, best-effort, rejection — give you graduated responses proportional to query difficulty.
4. Uncertainty propagation passes routing confidence to handlers, enabling downstream adaptation to low-confidence routing decisions.
5. Graceful degradation for component failures requires pre-built fallback paths and circuit breakers — test them deliberately before they're needed in production.

---

> **Try This:** Identify the three most common categories of queries your system currently handles poorly. For each, determine whether the issue is routing (wrong handler selected), confidence (correct handler but borderline score), or something post-routing (handler receives the query correctly but produces a poor response). This diagnosis determines whether fallback improvements help, or whether the real fix is elsewhere.

---

## Chapter 8: Evaluation and Monitoring

### What Success Means for a Routing System

Routing systems are infrastructure, and infrastructure needs monitoring. Unlike application features where user satisfaction is a direct signal, routing layer quality is often invisible to users — they see the response from the handler, not the routing decision that sent the query there. This makes routing monitoring both more important (failures are harder to attribute) and more difficult (you don't have direct user feedback on routing quality).

Defining what success means for a routing system before building the monitoring is essential. There are three distinct definitions, and they're related but not the same.

**Routing accuracy** is the fraction of queries sent to the correct handler. This is the direct measure of routing quality, but it requires labeled ground truth — you need to know what the correct handler was for each query, which requires either human labeling or a proxy signal.

**Routing coverage** is the fraction of queries that receive a confident routing decision (above threshold) rather than falling to a fallback. High coverage means the routing system is handling most queries; low coverage means too many queries are falling through, either because thresholds are too strict or because the routing categories don't cover the actual query distribution.

**Downstream quality** is the quality of responses produced by handlers after routing. This is the outcome that actually matters to users. Routing accuracy is instrumentally important because it drives downstream quality, but downstream quality is what you ultimately optimize for.

A routing system can have high accuracy and low downstream quality (handlers are getting the right queries but producing poor responses); high downstream quality but low accuracy (handlers are forgiving enough that misrouted queries still produce decent responses); or any combination. Monitoring all three, and understanding the relationships between them, gives you the full picture.

### Building a Routing Evaluation Dataset

The foundation of routing evaluation is a labeled dataset of queries with correct routes. Building this is unglamorous but essential.

There are three practical sources for routing evaluation data:

**Historical queries with known correct routes.** If your system logs which handler ultimately produced a satisfactory response, those logs are a source of routing ground truth. This requires some post-processing to filter out cases where the response quality was poor regardless of routing.

**Manual labeling.** A human reads each query and assigns the correct route. This is expensive but produces the highest quality ground truth. For a routing system with fewer than 10 routes, 50 to 100 labeled queries per route is a workable evaluation set.

**LLM-assisted labeling.** Use a capable LLM to label queries at scale, then spot-check a sample for quality. This isn't as reliable as human labeling, but it scales better. The prompt should include detailed descriptions of each route and worked examples.

```python
def build_evaluation_dataset(queries: List[str],
                              route_descriptions: Dict[str, str],
                              labeler_llm: callable,
                              sample_for_review: int = 50) -> List[dict]:
    """
    Build routing evaluation dataset using LLM-assisted labeling.
    """
    labeled = []

    for query in queries:
        routes_text = "\n".join(
            f"- {name}: {desc}"
            for name, desc in route_descriptions.items()
        )

        prompt = f"""Given this query: "{query}"

And these routing options:
{routes_text}

Which single route should this query be sent to? Reply with just the route name.
If none apply, reply "none"."""

        label = labeler_llm(prompt).strip().lower()

        labeled.append({
            "query": query,
            "assigned_route": label,
            "labeling_method": "llm",
            "needs_review": False
        })

    import random
    review_sample = random.sample(labeled, min(sample_for_review, len(labeled)))
    for item in review_sample:
        item["needs_review"] = True

    return labeled
```

### Offline Evaluation Metrics

With a labeled dataset, you can compute a full suite of routing quality metrics.

```python
from sklearn.metrics import (classification_report, confusion_matrix,
                              precision_recall_fscore_support)
import pandas as pd

def evaluate_routing(router: SemanticRouter,
                     dataset: List[dict],
                     threshold: float = 0.35) -> dict:
    """
    Comprehensive routing evaluation against labeled dataset.
    """
    results = []

    for item in dataset:
        query = item["query"]
        true_route = item["assigned_route"]
        predicted_route, score = router.route(query)

        if score < threshold:
            predicted_route = "fallback"

        results.append({
            "query": query,
            "true_route": true_route,
            "predicted_route": predicted_route,
            "score": score,
            "correct": predicted_route == true_route,
        })

    df = pd.DataFrame(results)
    accuracy = df["correct"].mean()

    report = classification_report(
        df["true_route"],
        df["predicted_route"],
        output_dict=True
    )

    score_stats = {
        "mean": df["score"].mean(),
        "std": df["score"].std(),
        "p25": df["score"].quantile(0.25),
        "p50": df["score"].quantile(0.50),
        "p75": df["score"].quantile(0.75),
        "below_threshold": (df["score"] < threshold).mean(),
    }

    routes = sorted(df["true_route"].unique())
    cm = confusion_matrix(df["true_route"], df["predicted_route"],
                           labels=routes)

    return {
        "accuracy": accuracy,
        "per_route": report,
        "score_stats": score_stats,
        "confusion_matrix": cm.tolist(),
        "routes": routes,
    }
```

The confusion matrix is particularly valuable for routing evaluation. It shows you exactly which routes are confused with each other — which directly informs anchor refinement. If route A is frequently misclassified as route B, the anchor spaces for A and B overlap, and you need to either differentiate the anchors or reconsider whether A and B should be separate routes.

### Production Monitoring

Offline evaluation tells you how the router performs on the evaluation dataset. Production monitoring tells you how it performs on real traffic, which is the only distribution that actually matters.

The minimum monitoring set for a production routing system:

**Routing decision distribution:** The fraction of queries routed to each handler over time. Sudden shifts in this distribution are often a signal of query distribution change — the topics users are asking about have shifted, which may mean your routing categories are no longer covering the distribution well.

**Score distribution:** The distribution of routing scores over time. Shifts toward lower scores indicate the routing system is becoming less confident — possibly because the query distribution is drifting away from the anchor space.

**Fallback rate:** The fraction of queries falling to fallback handlers. A rising fallback rate is an early signal of routing system degradation.

**Gap distribution:** The distribution of score gaps (best score minus second-best score). Compressing gaps mean more ambiguous queries — either the routing categories are becoming less distinct or the incoming queries are becoming less clear.

```python
import time
from collections import defaultdict, deque
from threading import Lock

class RoutingMonitor:
    """
    Thread-safe real-time monitoring for production routing.
    """
    def __init__(self, window_minutes: int = 60):
        self.window = window_minutes * 60  # seconds
        self.events: deque = deque()
        self.lock = Lock()

    def record(self, route: str, score: float, gap: float,
               is_fallback: bool) -> None:
        with self.lock:
            self.events.append({
                "timestamp": time.time(),
                "route": route,
                "score": score,
                "gap": gap,
                "is_fallback": is_fallback,
            })
            self._prune()

    def _prune(self) -> None:
        cutoff = time.time() - self.window
        while self.events and self.events[0]["timestamp"] < cutoff:
            self.events.popleft()

    def summary(self) -> dict:
        with self.lock:
            if not self.events:
                return {}
            events = list(self.events)

        scores = [e["score"] for e in events]
        gaps = [e["gap"] for e in events]
        fallback_count = sum(1 for e in events if e["is_fallback"])

        route_counts = defaultdict(int)
        for e in events:
            route_counts[e["route"]] += 1

        total = len(events)
        return {
            "total_queries": total,
            "fallback_rate": fallback_count / total,
            "route_distribution": {k: v/total for k, v in route_counts.items()},
            "score_mean": np.mean(scores),
            "score_p25": np.percentile(scores, 25),
            "score_p75": np.percentile(scores, 75),
            "gap_mean": np.mean(gaps),
            "gap_p25": np.percentile(gaps, 25),
        }
```

### Alerting Thresholds

Define alerting thresholds for routing health metrics and treat routing degradation as an infrastructure alert, not a background concern.

Reasonable starting thresholds (adjust based on your baseline):

- **Fallback rate > 20%:** routing is not covering the incoming query distribution
- **Mean score < 0.40:** routing confidence is generally low — anchor quality or distribution shift
- **Mean gap < 0.08:** too many ambiguous routing decisions — route boundaries may be collapsing
- **Any single route > 60% of traffic:** unexpected concentration — possible query distribution shift or anchor dominance

> **Key Insight:** The most important thing routing monitoring tells you is when to recalibrate. Routing systems built on good embeddings and well-designed anchors tend to stay accurate for a long time, but they do drift as query patterns change. Monitoring gives you the signal to know when recalibration is needed before users notice degradation.

### A/B Testing Routing Changes

When updating routing configurations — new anchors, changed thresholds, new routes — validate changes with controlled A/B testing before full deployment. Route a fraction of traffic through the new configuration, compare routing accuracy and downstream quality metrics, and promote if the new configuration improves on both.

The challenge with A/B testing routing changes is that routing affects all downstream metrics. A routing change that improves routing accuracy but routes queries to a slightly weaker handler can appear to worsen user experience metrics even when routing quality has improved. Design A/B tests to measure routing decisions independently of handler quality where possible.

```python
import random

class ABTestRouter:
    def __init__(self, control: SemanticRouter, treatment: SemanticRouter,
                 treatment_fraction: float = 0.10):
        self.control = control
        self.treatment = treatment
        self.treatment_fraction = treatment_fraction

    def route(self, query: str, user_id: str = None) -> tuple:
        # Deterministic assignment by user_id (stable across queries)
        if user_id:
            bucket = hash(user_id) % 100
            use_treatment = bucket < (self.treatment_fraction * 100)
        else:
            use_treatment = random.random() < self.treatment_fraction

        variant = "treatment" if use_treatment else "control"
        router = self.treatment if use_treatment else self.control

        route_name, score = router.route(query)
        return route_name, score, variant
```

---

### Key Takeaways

1. Routing evaluation requires three distinct metrics: accuracy (correct routes), coverage (fraction confidently routed), and downstream quality (handler response quality).
2. Building a routing evaluation dataset requires upfront investment in labeling — LLM-assisted labeling with human spot-checking is a practical approach at scale.
3. Confusion matrices for routing are uniquely informative — they directly reveal which anchor spaces overlap and need refinement.
4. Production monitoring should track routing distribution, score distribution, fallback rate, and gap distribution as continuous health metrics.
5. A/B test routing configuration changes before full deployment — routing affects all downstream metrics, so validate changes in controlled experiments.

---

> **Try This:** Set up a weekly routing health report: routing distribution, mean score, fallback rate, and gap distribution compared to the prior week. Run it for a month. You'll quickly develop intuition for what "normal" looks like in your system, and anomalies will become obvious before they become user-visible problems.

---

## Conclusion

The fundamental challenge semantic routing addresses is this: systems that do more than one thing need to know which thing to do. When the signal that determines that decision is buried in meaning rather than visible in form, keyword matching and structural rules don't scale. Semantic routing uses vector representations of meaning to make that decision accurately, at scale, and without brittle rule maintenance.

The key ideas from each chapter compose into a coherent design.

Start with embedding-based routing as the core mechanism — pre-compute anchor embeddings, measure cosine similarity at query time, select the highest-scoring route. This handles the common case well with minimal infrastructure.

Design anchors carefully. Use a mix of description-based and example-based anchors. More anchors isn't always better — overlap between anchor spaces creates ambiguity. Audit anchor quality before tuning thresholds.

Calibrate thresholds empirically, not intuitively. Collect score distributions, understand the precision-recall tradeoff, and use per-route thresholds for routes with different semantic breadth. Score gap analysis catches the ambiguous cases that threshold checks alone miss.

Compose multi-stage pipelines when single-stage routing can't capture the distinctions your system needs. Keep stages coarse-to-fine, minimize the total number of stages, and propagate routing context through the pipeline for traceability and context-aware decisions.

For code systems, use the full routing metadata: intent, language, artifact type, and specificity all matter and they map to distinct handler behaviors. Code-specialized embeddings improve quality when queries contain actual code.

Build fallback paths before you need them. Tiered fallbacks — clarification, general handler, best-effort, rejection — handle the tail of difficult queries gracefully. Uncertainty propagation passes routing confidence to handlers so they can adapt. Test fallback paths before they're needed.

Measure everything. Routing accuracy, coverage, downstream quality. Score distribution, gap distribution, fallback rate. A/B test configuration changes. Set alerts for degradation. The routing layer is infrastructure — treat it with the same operational rigor as any other critical component.

One final note: the specifics of which embedding model, which library, which threshold will change. The space moves fast, and the optimal choices today may not be optimal in six months. The patterns here — anchor design, threshold calibration, multi-stage composition, fallback architecture, monitoring — are more durable than any specific implementation. Understand the patterns, and the implementation decisions will follow.

---

## Appendix A: Glossary

**Anchor phrase**: A text string used as a representative example or description of a routing destination. Queries are routed based on similarity to anchor phrases. Also called a routing example or seed phrase.

**Circuit breaker**: A reliability pattern that prevents a failing component from receiving requests until it has recovered. In routing, applied to handlers that are returning errors above a threshold rate.

**Cosine similarity**: A metric that measures the angle between two vectors, yielding a value between -1 and 1. For normalized embeddings, equivalent to the dot product. The standard similarity metric for semantic routing.

**Embedding**: A dense vector representation of text (or other data) that encodes semantic content. Similar meanings produce geometrically close vectors in the embedding space.

**Embedding model**: A neural network that converts text into embeddings. Model quality determines most of the baseline routing quality in an embedding-based routing system.

**Fallback handler**: A handler that receives queries when no primary handler can be confidently selected. Typically a general-purpose handler, a clarification prompt, or a rejection.

**Gap analysis**: Measuring the difference between the highest and second-highest routing scores for a query. Small gaps indicate ambiguous queries where two routes score similarly.

**Handler**: A function or service that processes a query after it has been routed. In a multi-tool AI system, handlers might be different retrieval backends, LLM instances, or processing pipelines.

**L2 normalization**: Scaling a vector to unit length (magnitude = 1). Applied to embeddings before cosine similarity computation; makes the similarity calculation equivalent to a dot product.

**Route**: A named destination consisting of a handler and a set of anchor phrases. Queries are assigned to routes based on semantic similarity to anchors.

**Routing accuracy**: The fraction of queries sent to the correct handler. Requires labeled ground truth to compute.

**Routing context**: State accumulated as a query passes through a multi-stage routing pipeline. Contains stage results, scores, and metadata for downstream stages and handlers.

**Routing coverage**: The fraction of queries receiving a confident routing decision above threshold, rather than falling to a fallback.

**Score gap**: The difference between the highest routing score and the second-highest routing score for a given query. A signal of routing confidence beyond the raw score.

**Semantic routing**: Routing queries to handlers based on semantic similarity (meaning) rather than keyword matching or structural rules.

**Threshold**: A minimum similarity score required before a routing decision is executed. Queries scoring below threshold fall to fallback handlers.

**Top-k mean**: A score aggregation strategy that averages the k highest anchor similarity scores for a route. A middle ground between maximum (aggressive) and mean (conservative).

---

## Appendix B: Tools & Resources

### Embedding Models

**sentence-transformers** (`pip install sentence-transformers`)
The standard Python library for using pre-trained sentence embedding models. Includes models of varying size/quality tradeoffs. `all-MiniLM-L6-v2` (384-dim) is a common starting point for routing; `all-mpnet-base-v2` (768-dim) offers better quality at higher latency.
GitHub: UKPLab/sentence-transformers

**CodeBERT** (`microsoft/codebert-base` on HuggingFace)
Microsoft's code-specialized BERT variant, pre-trained on six programming languages. Better than general text models for routing queries that contain code. Available via the `transformers` library.

**OpenAI text-embedding-ada-002 / text-embedding-3-small**
API-based embeddings. High quality, no local compute required, but adds API latency and cost to each routing decision. Suitable for offline evaluation and anchor pre-computation; less appropriate for real-time routing in latency-sensitive paths.

### Vector Storage

**NumPy** — For small routing systems (fewer than 20 routes, fewer than 200 anchors total), in-memory NumPy matrix operations are fast enough. No additional infrastructure required.

**FAISS** (`pip install faiss-cpu`) — Facebook's similarity search library. Useful when you have large numbers of anchors or are doing nearest-neighbor routing across a large destination space. Provides approximate nearest neighbor search with configurable accuracy/speed tradeoffs.

**ChromaDB** (`pip install chromadb`) — Lightweight embedded vector database. Suitable for routing systems that need persistence, filtering, or metadata alongside similarity search.

**Qdrant** — A more production-focused vector database with filtering, quantization, and distributed deployment options. Worth considering when routing scales to many destinations or when you need filtering by metadata alongside similarity.

### Routing Libraries

**semantic-router** (`pip install semantic-router`) — A Python library specifically for semantic routing use cases. Implements the anchor-based routing pattern with built-in support for multiple embedding providers, threshold management, and route management. A useful starting point for systems where building from scratch isn't the goal.

### Monitoring and Evaluation

**scikit-learn** — Provides `classification_report`, `confusion_matrix`, and other evaluation utilities suitable for routing evaluation.

**Prometheus + Grafana** — Standard infrastructure monitoring stack for production metrics. Export routing health metrics (fallback rate, score distribution, route distribution) as Prometheus metrics and visualize in Grafana.

**MLflow / Weights & Biases** — Experiment tracking for routing system calibration experiments. Useful when running systematic threshold calibration across multiple parameter settings.

---

## Appendix C: Further Reading

### Foundational Papers

**"Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks"**
Reimers & Gurevych, EMNLP 2019.
The paper behind sentence-transformers. Essential reading for understanding how sentence-level embeddings are trained and what properties they have. The training objective and its effect on the semantic similarity properties of the embeddings is directly relevant to why embedding-based routing works.

**"Dense Passage Retrieval for Open-Domain Question Answering"**
Karpukhin et al., EMNLP 2020.
While focused on retrieval rather than routing, the problem structure — finding the most semantically relevant item from a set given a query — is directly analogous. The discussion of in-batch negative training and hard negative selection applies to anchor quality in routing systems.

**"CodeBERT: A Pre-Trained Model for Programming and Natural Language"**
Feng et al., EMNLP 2020.
The paper introducing CodeBERT, covering the training objective and evaluation on code-related tasks. Relevant for Chapter 6's discussion of code-specialized embeddings.

### System Design

**"Patterns for Building LLM-based Systems & Products"**
Eugene Yan, 2023.
A practical survey of architectural patterns in LLM system design, including routing, orchestration, and fallback patterns. Grounded in production experience.

**"Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks"**
Lewis et al., NeurIPS 2020.
The foundational paper for RAG architectures. The interplay between retrieval and generation is directly relevant to routing decisions about when and where to retrieve context.

### Evaluation

Standard machine learning evaluation texts apply well to routing evaluation, which is fundamentally a classification problem at the decision layer. The specifics of multi-class precision/recall and confusion matrix interpretation are directly applicable. Any thorough treatment of precision-recall tradeoffs in binary and multi-class settings — the framing used in Chapter 4 — covers the underlying concepts in depth.

### Operational Practices

**"The Twelve-Factor App"** (12factor.net) — While not specific to AI systems, the operational principles around configuration, logging, and process management apply directly to routing system deployment. Config (including thresholds and anchor phrases) should be environment-variable driven; routing decisions should be logged as structured events.

**Google's Site Reliability Engineering book** (freely available online) — The circuit breaker, graceful degradation, and monitoring patterns in Chapters 7 and 8 come from SRE practice. The SRE book covers these patterns in depth, with production war stories that sharpen intuition for where they matter most.

---

*Semantic Routing: Design and Implementation — Version 1.0*
*David Kelly Price — April 2026*
*Published by Pyckle*


---

## Related Blog Posts

- [Semantic Routing Explained](https://pyckle.co/blog/semantic-routing-explained.html)
- [Semantic Routing: The Decision Layer AI Coding Tools Actually Need](https://pyckle.co/blog/semantic-routing-the-decision-layer-ai-coding-tools-actually-need.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*