---
title: "KV Cache and Inference Optimization"
subtitle: "The Infrastructure Layer That Determines Your Real LLM Costs"
author: "David Kelly Price"
version: "1.0"
date: 2026-04-20
status: draft
type: ebook
target_audience: "Platform engineers and ML engineers running inference infrastructure — deploying or evaluating self-hosted LLMs and inference APIs, responsible for latency and cost"
estimated_pages: 70
chapters:
  - "How Attention Works (and Why It Matters for Cost)"
  - "The KV Cache Explained"
  - "Prompt Caching in Hosted APIs"
  - "Prefill vs. Decode: Where Time Goes"
  - "Batching Strategies for Throughput"
  - "Quantization and Its Impact on Cache"
  - "Speculative Decoding"
  - "Monitoring Inference Performance"
tags:
  - pyckle
  - ebook
  - kv-cache
  - inference
  - optimization
  - llm-infrastructure
  - latency
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Figures:
- Captioned and numbered (Figure 1, Figure 2, etc.)
- Referenced by number in body text
-->

---

# KV Cache and Inference Optimization

## The Infrastructure Layer That Determines Your Real LLM Costs

**By David Kelly Price**

Version 1.0 — April 2026

---

## Table of Contents

- About This Guide
- Chapter 1: How Attention Works (and Why It Matters for Cost)
- Chapter 2: The KV Cache Explained
- Chapter 3: Prompt Caching in Hosted APIs
- Chapter 4: Prefill vs. Decode: Where Time Goes
- Chapter 5: Batching Strategies for Throughput
- Chapter 6: Quantization and Its Impact on Cache
- Chapter 7: Speculative Decoding
- Chapter 8: Monitoring Inference Performance
- Conclusion
- Appendix A: Glossary
- Appendix B: Tools & Resources
- Appendix C: Further Reading

---

## About This Guide

Most engineers who deploy LLMs spend their first six months optimizing the wrong things. They tune prompt templates, chase model benchmark scores, argue about which API provider has better latency SLAs. Then they get a real production bill and start asking harder questions.

The honest answer to "why does this cost so much" almost always traces back to inference mechanics — specifically to how attention is computed, how intermediate state is managed, and whether the system is structured to reuse work it's already done. KV caching sits at the center of all of that. It's not a single knob you turn. It's a set of architectural behaviors that determine whether your inference system does unnecessary work at every request, or accumulates efficiencies over time.

This guide is written for engineers who are responsible for inference performance — whether you're self-hosting models on GPU clusters, evaluating hosted APIs, or building applications where latency and cost are real constraints rather than theoretical ones. The goal is to give you a working mental model of the mechanics, practical guidance on the tradeoffs, and enough depth that you can read a profiling trace or an API cost breakdown and actually understand what's driving the numbers.

A few assumptions: you can read Python, you're comfortable with the idea of tensors and matrix operations without needing a deep math background, and you've run at least one LLM inference workload before. You don't need to be a GPU architecture expert. The concepts here are explained at the level where they become actionable — which is the only level that matters for this kind of work.

Every chapter ends with key takeaways and a practical exercise. These aren't summaries of what you just read. They're starting points for applying the concepts in your own environment. The fastest path from understanding to competence is running the thing yourself.

---

## Chapter 1: How Attention Works (and Why It Matters for Cost)

Before KV caching makes sense, attention has to make sense. Not the mathematical formulation from the paper — the intuition behind why transformers are expensive in the way they're expensive, and why that expense grows with context length in a way that catches engineers off guard.

### The Core Idea

A transformer processes a sequence of tokens. Each token needs to figure out which other tokens are relevant to understanding its meaning. The mechanism it uses to do that is attention: every token queries every other token in the sequence, and the result is a weighted mixture of all the information in the context, shaped by those relevance scores.

This is genuinely powerful. It's why transformers generalize so well across language tasks — the model can build arbitrary relationships between tokens regardless of how far apart they sit in the sequence. A token at position 1000 can attend to token at position 3 just as easily as to token 999.

The problem is what "every token queries every other token" costs computationally.

### The Math, Without the Math

For a sequence of length N, attention requires computing N × N relationships. Each of those relationships involves a dot product between two vectors, and those vectors have dimension d (typically 64 to 128 per attention head in modern models). The result is that the computational cost of attention scales with N², which is the part that matters.

A sequence of 1,000 tokens requires roughly 1 million pair comparisons per attention head. A sequence of 4,000 tokens requires 16 million. An 8K context requires 64 million. Quadratic scaling is brutal at long contexts, and modern deployed models commonly handle 32K, 128K, or even 1M token contexts.

This is why context length is a cost driver, not just a capability dimension.

> **Key Insight:** The quadratic scaling of attention means that doubling context length approximately quadruples the attention computation. This relationship is what makes KV cache reuse so valuable — if you can avoid recomputing attention over static context, you're eliminating work that grows nonlinearly with context size.

### Heads, Layers, and the Full Forward Pass

In practice, transformers don't use a single attention computation. They use multi-head attention, where the model runs several parallel attention operations (the "heads") and concatenates the results. A 70B parameter model might have 64 attention heads per layer, and 80 transformer layers. Every forward pass runs attention at every layer, for every head.

The numbers compound quickly. If you're running a model with 64 heads, 80 layers, and an 8K context, the attention computation alone (ignoring the feed-forward layers) is substantial. Feed-forward layers, which make up the majority of parameter count, have their own costs — but they scale linearly with sequence length, not quadratically. Attention is the quadratic component.

This is why "just use a bigger context window" is not free. It's an explicit cost tradeoff that has implications for latency, throughput, and memory usage.

### Why This Matters for Cost

When you pay for inference — whether through a hosted API or by running your own hardware — you're ultimately paying for compute and memory. Attention's quadratic growth means both of those increase faster than linearly with context length.

For self-hosted deployments, this shows up in GPU utilization patterns. Long contexts saturate memory bandwidth before they saturate compute — the bottleneck shifts. For hosted APIs, this often shows up in the pricing model: most providers charge differently for input tokens (the prompt) versus output tokens (the generated response). Input token processing corresponds closely to the prefill computation, which includes all the attention over your context. The longer the context, the more prefill work, and the more that work dominates your cost.

There's a second-order effect that's easy to miss: latency. The time-to-first-token (TTFT) is primarily determined by prefill. If your system prompt is 4,000 tokens and a user adds a 200-token query, the model has to process all 4,200 tokens before it can start generating the response. The user waits. KV caching is the mechanism that eliminates most of that wait on subsequent requests — but only if the infrastructure is set up to use it correctly.

### Attention Variants and Their Cost Profiles

The original "scaled dot-product attention" described above is still the dominant mechanism in deployed models, but several variants have emerged specifically to reduce its cost.

**Multi-query attention (MQA)** reduces the number of key-value heads while keeping multiple query heads. Instead of each head having its own K and V projections, all heads share a single set. This reduces the memory footprint of the KV cache significantly — often by 8× or more — at some cost to model quality.

**Grouped-query attention (GQA)** splits the difference: queries are grouped, and each group shares K and V heads. Llama 3, Mistral, and most recent open-weights models use GQA. It provides most of the memory savings of MQA with less quality degradation.

**Sliding window attention (SWA)** limits each token to attending only to a local window of previous tokens, rather than the full context. This reduces attention complexity from O(N²) to O(N × window_size). Mistral 7B uses this for some layers. The tradeoff is that information can only travel across the window boundary slowly, over multiple layers.

These variants matter for infrastructure engineers because they directly determine KV cache size and memory requirements. A model using GQA with 8 KV heads instead of 64 query heads stores roughly 8× less cache per layer per token. That's not a minor difference when you're trying to fit a large batch in GPU memory.

> **Warning:** Don't assume attention variant from model parameter count or architecture family. Check the config. A 70B Llama 2 model uses multi-head attention; a 70B Llama 3 model uses GQA. The KV cache memory requirements differ substantially, which affects batching capacity.

### The Practical Implication

Understanding attention mechanics gives you a framework for evaluating almost every other inference optimization. Batching works because attention over multiple sequences can be computed together. KV caching works because the keys and values computed over a prompt don't change if the prompt doesn't change. Speculative decoding works because short sequences are cheap to verify. Quantization affects cache size because the keys and values are stored at reduced precision.

None of those optimizations are magic. They're all consequences of the structure of attention and the specific ways it's expensive. Once the structure is clear, the optimizations follow.

**Key Takeaways:**
1. Attention scales quadratically with sequence length — this is the primary driver of inference cost at long contexts.
2. Multi-head attention compounds the compute: every layer, every head, every forward pass.
3. Attention variants (MQA, GQA, SWA) exist specifically to reduce KV cache memory at some quality tradeoff.
4. Time-to-first-token is driven primarily by prefill, which includes all attention over the input.
5. Context length is a cost dimension as real as parameter count. Treat it that way.

> **Try This:** Take a model you're currently running or evaluating. Find its architecture config (usually in `config.json` for HuggingFace models). Identify the number of attention heads, number of KV heads, number of layers, and maximum sequence length. Calculate the theoretical KV cache size in bytes for a single sequence at maximum context: `2 × layers × kv_heads × head_dim × max_seq_len × bytes_per_element`. For fp16, bytes_per_element is 2. Compare this to the model's parameter memory footprint. The ratio is often surprising.

---

## Chapter 2: The KV Cache Explained

When a transformer generates a token, it runs attention over the entire context — every token that came before the new one. Without optimization, this means the model recomputes the keys and values for every previous token on every generation step. For a 2,000-token context generating a 500-token response, that's 500 full attention computations over a growing sequence. The KV cache eliminates the redundancy.

### What Gets Cached

During attention computation, each token's representation is projected into three vectors: a query (Q), a key (K), and a value (V). The query is used to ask "what should I pay attention to?" The keys are what each token offers as a descriptor of itself. The values are the actual content that gets mixed into the output when a token is attended to.

For a token at position i, its K and V vectors are computed once. They don't change based on what comes after — the transformer architecture is causal, meaning each token can only attend to previous tokens, never future ones. This is the property that makes caching possible: the K and V vectors for a given token at a given position are fixed once computed, regardless of how many more tokens get added to the sequence.

The KV cache stores those K and V tensors. On each new generation step, the model only needs to compute Q for the new token, and then attend to the cached K and V tensors from all previous positions. The output of attention is the same as if all keys and values had been recomputed from scratch — because for static content, they would be identical.

> **Key Insight:** The KV cache doesn't change the math of attention. It changes whether the model recalculates K and V tensors that haven't changed, or reads them from memory. This is a pure compute savings, not an approximation.

### The Memory Layout

Understanding the physical layout of the KV cache helps clarify both the optimization opportunities and the constraints.

For a model with L layers, H KV heads, D head dimension, and a sequence of length N tokens, the full KV cache has shape:

```
[2, L, H, N, D]
```

The factor of 2 is for K and V. Each element is typically stored in fp16 (2 bytes) or bf16 (2 bytes), occasionally in fp8 (1 byte). For a concrete example:

- Llama 3 8B: 32 layers, 8 KV heads, head dim 128
- For 4,096 tokens in fp16:
  - `2 × 32 × 8 × 4096 × 128 × 2 bytes = 536,870,912 bytes ≈ 512 MB`

For a single sequence at 4K context. A 70B model with 80 layers and 8 KV heads at the same context length:

```python
layers = 80
kv_heads = 8
head_dim = 128
seq_len = 4096
bytes_per_element = 2  # fp16

kv_cache_bytes = 2 * layers * kv_heads * seq_len * head_dim * bytes_per_element
# = 2 * 80 * 8 * 4096 * 128 * 2
# = 1,342,177,280 bytes ≈ 1.25 GB
```

1.25 GB per sequence. If you're batching 8 sequences simultaneously, that's 10 GB of KV cache alone — before model weights, before activations, before anything else. On an 80 GB A100, this isn't catastrophic, but it means KV cache is a significant fraction of your available memory budget.

This is why attention variants matter so much. A model with 64 KV heads instead of 8 would require 8× the cache: 10 GB per sequence. Batching 8 of those is 80 GB — your entire A100, just for cache.

### How It's Managed at Runtime

Naive KV cache management allocates a fixed contiguous block of memory per sequence, sized to the maximum possible sequence length. This wastes memory when sequences are shorter, and it makes it difficult to batch sequences of different lengths efficiently.

**PagedAttention**, introduced by vLLM, addressed this. It manages KV cache the way an operating system manages virtual memory: in fixed-size pages (called "blocks") that can be allocated non-contiguously and shared across sequences. Each block holds KV tensors for a fixed number of tokens (the block size, often 16 or 32 tokens).

This has two major consequences. First, memory waste drops dramatically — instead of pre-allocating space for the maximum context length, blocks are allocated on demand as the sequence grows. Second, pages can be physically shared between sequences that have identical prefixes, which is the foundation for prefix caching.

```
Traditional allocation (4K max length, 800 token sequence):
[--800 tokens used--][---3,200 tokens wasted---]

PagedAttention (block size = 16, 800 tokens):
[block 1: tokens 0-15]
[block 2: tokens 16-31]
...
[block 50: tokens 784-799]
No wasted space.
```

### Prefix Caching

Prefix caching extends the basic KV cache idea to work across requests, not just within a single generation sequence. If two requests share an identical prefix — say, a system prompt — the server can compute the KV tensors for that prefix once, cache them, and reuse them for every subsequent request that starts with the same prefix.

This is an enormous win for applications with long, static system prompts. An agent framework that prepends a 2,000-token instruction set to every request, running 10,000 requests per day, would otherwise compute KV tensors for those 2,000 tokens 10,000 times. With prefix caching, it computes them once. The prefill cost for those tokens essentially goes to zero on cache hits.

The cache key is typically a hash of the token sequence. When a new request arrives, the system checks whether it can find any cached KV blocks matching the beginning of the sequence. If it finds a match for the first k tokens, it loads those blocks and only computes prefill for the remaining tokens.

> **Warning:** Prefix caching only helps if the prefix is exactly identical at the token level — not semantically similar, but byte-for-byte identical in the token sequence. Whitespace differences, prompt template variations, or any dynamic content inserted before the static content will invalidate the cache hit. Structure your prompts so that static content comes first and dynamic content comes after.

### The Cache Eviction Problem

KV cache is bounded by GPU memory. When the cache fills up, old entries must be evicted to make room for new ones. The eviction policy determines which entries are discarded — and getting this wrong has direct latency consequences, because an evicted entry means the next request that would have hit it must recompute from scratch.

Common eviction strategies:

**LRU (Least Recently Used):** Evict the entry that hasn't been accessed in the longest time. Simple, effective, but can be suboptimal for patterns where some sequences are revisited infrequently but are expensive to recompute.

**LFU (Least Frequently Used):** Evict the entry accessed the fewest times. Better for skewed access patterns but requires frequency tracking.

**Prefix-aware eviction:** Evict leaf nodes in the prefix tree before parent nodes, preserving shared prefixes. This is what vLLM's cache manager implements. The intuition is that a shared prefix serving N sequences is N times as valuable as a unique sequence's cache.

**Beam search cache:** For scenarios using beam search, the cache needs to support branching — multiple sequence prefixes sharing a common trunk. This is a specialized case that most serving frameworks handle through copy-on-write semantics.

For production deployments, cache eviction is a tunable behavior. If you're seeing high recompute rates in your traces, the first question is whether your working set fits in cache. If it doesn't, the options are: increase cache budget (more GPU memory, or reduce model weight precision to free space), reduce sequence diversity (structure prompts to share more prefix), or accept the miss rate and size the GPU cluster accordingly.

### KV Cache in Multi-GPU Deployments

Things get more complicated when the model is distributed across multiple GPUs. Tensor parallelism splits attention heads across devices — each GPU computes a subset of heads and holds the corresponding KV cache fragments. For a model with 64 attention heads distributed across 4 GPUs, each GPU handles 16 heads and stores KV cache for those 16 heads only.

Pipeline parallelism, which splits layers across GPUs, means different GPUs hold cache for different layers. A 4-GPU pipeline-parallel deployment of an 80-layer model might give each GPU 20 layers of KV cache.

Both approaches require careful memory planning. The total KV cache memory is the same regardless of how it's distributed, but the distribution affects whether any single GPU becomes a bottleneck. An imbalanced cache allocation can cause one GPU to evict aggressively while others sit with ample space.

**Key Takeaways:**
1. The KV cache stores pre-computed key and value tensors, eliminating redundant attention computation during generation.
2. Cache size scales with layers × KV heads × head_dim × sequence_length × batch_size. This is often 10-30% of total GPU memory in a well-configured deployment.
3. PagedAttention allocates cache in fixed blocks, enabling near-zero memory waste and cross-request prefix sharing.
4. Prefix caching dramatically reduces prefill cost for workloads with shared static prompts — but only for exact token matches.
5. Cache eviction policy affects effective cache utilization as much as raw cache size.

> **Try This:** Install vLLM locally or in a dev environment. Run a simple benchmark: serve a model with a 1,000-token system prompt, then fire 50 requests that all start with that system prompt. Profile the TTFT on the first request versus subsequent requests. The gap tells you how much prefix caching is saving. Then shuffle your prompt structure so the static content is not a prefix (put dynamic content first). Rerun. The gap should collapse.

---

## Chapter 3: Prompt Caching in Hosted APIs

When you run inference through a hosted API rather than self-hosted infrastructure, you don't control the KV cache directly. But several major providers expose caching behavior through pricing and API mechanisms — and understanding how it works determines whether you design your applications to benefit from it or accidentally work against it.

### How Hosted Providers Implement Caching

Anthropic's Claude API, Google's Gemini API, and OpenAI's API all offer some form of prompt caching, though the mechanics and pricing differ.

**Anthropic (Claude)** uses explicit cache control. You mark specific parts of your prompt with a `cache_control` parameter, telling the API where to create cache checkpoints. Tokens up to that checkpoint are cached server-side, and subsequent requests that share the same content up to that point pay a reduced read price rather than the full write price.

```python
import anthropic

client = anthropic.Anthropic()

response = client.messages.create(
    model="claude-opus-4-7",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": "You are a helpful assistant with deep expertise in "
                    "distributed systems. You will answer questions about "
                    "infrastructure, databases, and scalability...",
        },
        {
            "type": "text",
            "text": (
                "RAFT CONSENSUS REFERENCE\n\n"
                "Leader Election: A Raft cluster maintains one leader at a time. "
                "Followers wait a randomized election timeout (150–300 ms). If no "
                "heartbeat arrives, the follower becomes a candidate, increments its "
                "term, and sends RequestVote RPCs. A majority vote wins the election.\n\n"
                "Log Replication: The leader appends client entries to its log and "
                "sends AppendEntries RPCs to followers. An entry is committed once "
                "acknowledged by a majority. Committed entries are never overwritten.\n\n"
                "Safety: A candidate can only win if its log is at least as up-to-date "
                "as any voter's log — determined by term and index comparison. This "
                "guarantees that every committed entry is present on any future leader.\n\n"
                "Membership Changes: Cluster reconfiguration uses joint consensus. Both "
                "the old and new configurations must independently achieve majorities "
                "before the new configuration takes effect, preventing split-brain "
                "during topology changes.\n\n"
                "PAXOS VS RAFT: Paxos allows concurrent proposals; Raft serializes "
                "all writes through a single leader, simplifying reasoning about "
                "ordering at the cost of leader-bottleneck throughput.\n\n"
                "CONSISTENCY MODELS: Strong consistency requires linearizability — "
                "every read reflects the most recent committed write. Eventual "
                "consistency relaxes this, allowing stale reads in exchange for "
                "higher availability under partition."
            ),
            "cache_control": {"type": "ephemeral"},  # Cache checkpoint here
        }
    ],
    messages=[
        {"role": "user", "content": "How does Raft handle leader election?"}
    ]
)
```

The `cache_control: ephemeral` marker tells Claude to cache all tokens up to that point. The cache persists for approximately 5 minutes and is reset on access. Subsequent requests that share the same prefix up to the cache checkpoint will pay the cache read price (~10% of the normal input price) rather than the full input price.

**OpenAI** offers automatic prompt caching for prompts longer than 1,024 tokens. There's no explicit API parameter — the caching happens automatically when your prompt prefix is long enough and matches a cached prefix on their servers. Pricing applies automatically: cached tokens cost 50% of the standard input price.

**Google Gemini** has an explicit "context caching" feature. You create a named cache object from your content, receive a cache ID, and reference that ID in subsequent requests. The cache has a configurable TTL (minimum 1 hour, maximum 1 month). This model makes caching more predictable — you know exactly what's cached and for how long — but it requires managing cache lifecycle explicitly.

```python
import google.generativeai as genai

# Create cache
cache = genai.caching.CachedContent.create(
    model="gemini-1.5-pro",
    contents=[
        {
            "role": "user",
            "parts": [{"text": (
                "ENTERPRISE SOFTWARE LICENSE AGREEMENT v4.2\n\n"
                "1. GRANT OF LICENSE\nSubject to the terms of this Agreement, "
                "Licensor grants Licensee a non-exclusive, non-transferable, "
                "limited license to install and use the Software solely for "
                "Licensee's internal business operations.\n\n"
                "2. RESTRICTIONS\nLicensee shall not: (a) sublicense, sell, "
                "resell, transfer, assign, or otherwise dispose of the Software; "
                "(b) modify or make derivative works based upon the Software; "
                "(c) reverse engineer or access the Software to build a competitive "
                "product or service.\n\n"
                "3. PRICING SCHEDULE\nTier 1 (1–50 seats): $42/seat/month. "
                "Tier 2 (51–250 seats): $38/seat/month. "
                "Tier 3 (251–1000 seats): $31/seat/month. "
                "Enterprise (1000+ seats): negotiated contract.\n\n"
                "4. SUPPORT AND SLA\nPlatinum support: 99.99% uptime, 15-minute "
                "response for P1 incidents. Gold support: 99.9% uptime, 1-hour "
                "response for P1 incidents. Standard support: 99.5% uptime, "
                "next-business-day response."
            )}]
        }
    ],
    ttl="3600s"
)

# Use cache in requests
model = genai.GenerativeModel.from_cached_content(cache)
response = model.generate_content("What does section 3 say about pricing?")
```

### The Economics

The cost differential between cached and uncached input tokens is significant enough that prompt structure becomes a financial engineering problem.

At Anthropic's pricing for Claude Sonnet 4.6:
- Standard input: $3.00 per million tokens
- Cache write: $3.75 per million tokens (25% premium on first write)
- Cache read: $0.30 per million tokens (90% discount on subsequent reads)

The break-even calculation is straightforward. If your cached prefix is P tokens and you make N requests:
- Without caching: N × P × $3.00 / 1M
- With caching: 1 × P × $3.75 / 1M + (N-1) × P × $0.30 / 1M

Break-even at N ≈ 1.5 requests (the cache write premium is recovered almost immediately). After 2 requests, you're saving money. After 100 requests with the same prefix, you're paying roughly 10% of what you'd pay uncached.

For a 4,000-token system prompt across 10,000 requests per day, the math is stark:
- Uncached daily cost: 10,000 × 4,000 × $3.00 / 1M = $120/day
- Cached daily cost: $0.015 (first write) + 9,999 × 4,000 × $0.30 / 1M = $12/day
- Monthly savings: roughly $3,240

> **Key Insight:** Prompt caching at hosted APIs is not a reliability or performance feature — it's a cost feature that often provides the largest single optimization available in API-based applications. Before tuning anything else, audit whether your cache hit rate is near 100% for your static content.

### Designing for Cache Hits

Most cache misses in production aren't random. They follow predictable patterns caused by prompt structure choices that work against caching.

**The static-first rule.** Cache checkpoints work from the beginning of the prompt. If your prompt looks like this:

```
[dynamic user context: 200 tokens]
[static system instructions: 2,000 tokens]
[user query: 50 tokens]
```

The cache can't match anything, because the prefix is dynamic. Restructure to:

```
[static system instructions: 2,000 tokens]  ← cache checkpoint here
[dynamic user context: 200 tokens]
[user query: 50 tokens]
```

Now the first 2,000 tokens are always identical, and the cache hit rate approaches 100%.

**Conversation threading.** In multi-turn conversations, each turn appends new messages to the history. The cache checkpoint should be placed after the static system prompt, not at the end of the conversation history. The system prompt is always at the beginning and never changes; the conversation history grows. If you place the checkpoint at the end of history, you're caching a different, growing prefix every turn.

**Templating side effects.** Dynamic values injected into an otherwise-static prompt destroy cache hits. A system prompt that includes `"Current date: {date}"` will never cache, because the date changes. Consider whether the dynamic element is actually necessary in the prompt. If it is, move it after the cache checkpoint.

> **Warning:** API-level prompt caching is session-scoped and server-side — you don't control which server handles your request in a multi-region deployment. Some providers tie cache to a specific server instance, meaning cross-region failover can invalidate your cache. Check your provider's documentation on cache scope before assuming 100% hit rates.

### Measuring Cache Performance

Most providers return cache usage information in API responses. For Anthropic:

```python
response = client.messages.create(...)

# Check cache usage
print(response.usage)
# Usage(
#   input_tokens=150,
#   output_tokens=423,
#   cache_creation_input_tokens=2000,
#   cache_read_input_tokens=2000
# )
```

`cache_creation_input_tokens > 0` means this was the request that populated the cache. `cache_read_input_tokens > 0` means tokens were read from cache. If both are zero and input_tokens is large, you're not caching.

Build monitoring around these fields from the start. A dashboard that shows cache hit rate over time, segmented by endpoint or application, tells you immediately whether a deployment change broke your caching structure.

```python
def track_cache_efficiency(response):
    usage = response.usage
    total_input = (
        usage.input_tokens
        + getattr(usage, 'cache_creation_input_tokens', 0)
        + getattr(usage, 'cache_read_input_tokens', 0)
    )
    cache_hits = getattr(usage, 'cache_read_input_tokens', 0)
    hit_rate = cache_hits / total_input if total_input > 0 else 0

    return {
        "hit_rate": hit_rate,
        "cache_reads": cache_hits,
        "cache_writes": getattr(usage, 'cache_creation_input_tokens', 0),
        "uncached_input": usage.input_tokens,
    }
```

### Multi-Turn Applications

Conversation applications deserve special attention because they have a natural structure for caching: the system prompt and early conversation history are static relative to the current turn.

The strategy is to place a cache checkpoint after the system prompt and let the conversation history grow uncached. As conversation progresses, the fraction of total tokens that are cached decreases — you can't cache a conversation that's still happening. But you preserve the savings on the static system prompt across all turns.

For very long conversations, some applications re-checkpoint periodically: marking a cache point after a certain number of accumulated turns, summarizing older context, and restarting the cache from the summary. This is a tradeoff between implementation complexity and cache efficiency for long sessions.

**Key Takeaways:**
1. Hosted API caching operates on exact token prefix matches — structure your prompts so static content is always first.
2. Break-even on cache write cost typically happens within 2-3 requests; savings compound quickly at scale.
3. Measure cache hit rates explicitly from response usage fields — don't assume.
4. Conversation applications should cache the system prompt, not the conversation history.
5. Dynamic values in prompts (dates, user IDs, timestamps) destroy cache hits for everything that follows them.

> **Try This:** Pick an existing application or API integration. Log the `usage` field from every response for one day of traffic. Calculate the average cache_read_input_tokens as a fraction of total effective input. If it's below 60% and your application uses a system prompt, you have room to improve your prompt structure. Identify the first dynamic element in your prompt and move everything before it into a cacheable prefix.

---

## Chapter 4: Prefill vs. Decode: Where Time Goes

Inference latency is not a single number. It's the sum of two fundamentally different phases: prefill, which processes the entire input before generating a single token, and decode, which generates output tokens one at a time. Understanding each phase separately is required for understanding why your latency profile looks the way it does.

### Prefill: Parallel Processing of the Prompt

During prefill, the model processes all input tokens simultaneously. This is a matrix multiplication-heavy operation that maps well onto GPU hardware — GPUs excel at parallel computation, and processing N tokens in parallel is exactly the kind of workload they're designed for.

The prefill phase produces the KV cache entries for the input tokens and generates the logits for the next token (the first output token). Time-to-first-token (TTFT) is the duration of prefill. Everything the user waits for before seeing any response is prefill time.

Prefill complexity scales with N (sequence length), because the matrix multiplications grow proportionally with the number of tokens being processed. But within those matrix multiplications, attention is still O(N²), so very long prompts hit a wall.

Prefill is primarily compute-bound on modern hardware. The GPU is doing a lot of arithmetic on a large batch of data, and the arithmetic units are the bottleneck, not memory bandwidth. This means prefill benefits directly from more powerful compute hardware and from techniques that reduce the number of attention operations (like sliding window attention or prefix caching).

### Decode: Sequential Token Generation

After prefill, the model generates tokens one at a time. Each decode step produces exactly one token, adds it to the context, and uses it as input for the next step. This sequential dependency — each token depends on all previous tokens — means decode cannot be parallelized across tokens the way prefill can.

At each decode step, the model runs a forward pass for a single token (or a small batch of tokens, one per sequence in a batch). This looks very different from prefill computationally: you're running matrix multiplications where one dimension is 1 (or the batch size, which is small), which is extremely inefficient on GPU hardware.

GPUs have thousands of arithmetic units and are designed to keep them all busy simultaneously. A matrix multiplication where one dimension is 1 barely scratches the surface of that capacity. The GPU spends most of its time waiting on memory — reading the model weights, reading the KV cache — rather than doing arithmetic. Decode is memory bandwidth-bound.

This creates a hard asymmetry:
- Prefill: compute-bound, benefits from faster arithmetic
- Decode: memory bandwidth-bound, benefits from faster memory and smaller models

The throughput characteristics follow the same pattern. During prefill, you can process a large batch of long prompts efficiently because the GPU is compute-saturated. During decode, throughput is limited by how fast you can move data from memory, and batching helps by amortizing that cost across multiple sequences — but there are limits.

> **Key Insight:** If your TTFT is acceptable but your total response latency is high, the problem is decode speed, not prefill. If your first-token latency is high, the problem is prefill. These require different solutions and different profiling approaches.

### The Arithmetic: Where Time Actually Goes

For a typical request with a 1,000-token prompt and 200-token response:

```
Prefill:  Process 1,000 tokens in parallel
          → Produces KV cache for 1,000 tokens
          → Generates first token
          → Duration: proportional to N × compute_cost

Decode:   Generate 200 tokens sequentially
          → Each step: one forward pass + KV cache lookup + KV cache append
          → Duration: 200 × per_step_cost
```

The per-step cost during decode includes:
1. Loading model weights from memory (this dominates for large models)
2. Reading KV cache for all previous tokens
3. Running attention over cached K and V
4. Feed-forward computation
5. Sampling the next token

On a fast GPU, a single decode step for a 7B model might take 20-30ms. For a 70B model, 150-200ms. Generating 200 tokens at 150ms/step means 30 seconds of generation time, which is often unacceptable for interactive applications.

This is why there's sustained investment in reducing decode latency: better hardware with higher memory bandwidth, quantization to make model weights smaller, and speculative decoding to generate multiple tokens per step.

### Inter-Token Latency and User Experience

For interactive applications, the perceived experience during generation depends more on inter-token latency (ITL) — the time between successive tokens — than on total generation time.

A response that streams tokens at 20 tokens/second feels faster than one that buffers for 3 seconds and then dumps 300 tokens at once, even if the total generation time is identical. This is because streaming shows progress, and users tolerate latency better when they can see movement.

For engineers designing systems, this means:
- Enable streaming in your API calls whenever possible
- Monitor per-token latency, not just total generation time
- Separate TTFT monitoring from ITL monitoring — they indicate different system behaviors

A degradation in ITL while TTFT stays stable indicates decode pressure: the GPU is being pushed harder on the generation phase, probably due to increased batch size or longer accumulated KV caches. A degradation in TTFT with stable ITL indicates prefill pressure: longer prompts or reduced prefix cache hit rate.

### Chunked Prefill

A significant operational challenge in shared inference systems is that prefill and decode compete for GPU resources. A single long prefill request can stall decode for other sequences in the batch, spiking their inter-token latency.

The scenario: you have 8 sequences decoding simultaneously at a comfortable pace. A new request arrives with a 32K-token prompt. The system needs to run prefill for that prompt. Prefill is compute-bound and saturates the GPU for, say, 2 seconds. During those 2 seconds, the 8 active decode sequences receive no updates. Their users see a 2-second freeze in the token stream.

Chunked prefill addresses this by splitting long prompts into smaller chunks and interleaving prefill chunks with decode steps. Instead of running the full 32K prefill in one shot, the system processes 512 tokens of prefill, then runs a decode step, then processes another 512 tokens of prefill, and so on.

This extends the TTFT for the new request but prevents the latency spike for existing requests. The tradeoff is tunable via chunk size — larger chunks favor TTFT for new requests, smaller chunks favor ITL stability for existing sequences.

```
Without chunked prefill:
[---long prefill: 32K tokens---] [decode][decode][decode]...
 ↑ decode for other sequences stalls here

With chunked prefill (chunk=512):
[prefill:512][decode][prefill:512][decode][prefill:512]...
 ↑ decode interleaved, no stall
```

> **Warning:** Chunked prefill increases TTFT for long prompts. If your application has strict TTFT requirements for certain request types, you may need to route them to a dedicated pool that doesn't share resources with decode-heavy workloads.

### Continuous Batching

Traditional batching runs a fixed batch until all sequences complete, then starts a new batch. This wastes GPU resources badly: as shorter sequences finish and leave the batch, remaining sequences underutilize available capacity.

Continuous batching (also called iteration-level scheduling) adds new sequences to the batch at each decode step, filling the slots vacated by completed sequences. The batch composition is dynamic — at any given step, you might have sequences at different stages of their generation.

This has profound effects on throughput. A GPU serving requests with variable output lengths (some at 50 tokens, some at 500) can maintain near-constant batch utilization rather than degrading as short sequences complete.

vLLM, TensorRT-LLM, and SGLang all implement continuous batching. It's now table stakes for production inference serving.

The interaction with KV cache management is direct: continuous batching requires that KV cache blocks can be allocated and freed per-sequence at any time, not just at batch boundaries. This is why PagedAttention was developed alongside continuous batching — the two optimizations are complementary and almost always deployed together.

**Key Takeaways:**
1. Prefill is compute-bound; decode is memory bandwidth-bound. Different phases respond to different optimization strategies.
2. TTFT is primarily determined by prefill length. ITL is primarily determined by decode speed and batch composition.
3. Chunked prefill prevents long prompts from stalling decode for other active sequences.
4. Continuous batching dramatically improves GPU utilization for variable-length workloads.
5. Stream tokens to clients — it improves perceived latency even when total generation time is unchanged.

> **Try This:** Run two benchmarks against the same inference endpoint. First, send 100 requests with a 4,000-token prompt and a 50-token response. Measure average TTFT and ITL. Second, send 100 requests with a 50-token prompt and a 500-token response. Measure the same metrics. The ratio of TTFT in case 1 to case 2 should be approximately the ratio of prompt lengths. If TTFT doesn't scale linearly with prompt length, prefix caching is active.

---

## Chapter 5: Batching Strategies for Throughput

If latency is how fast you serve one request, throughput is how many requests you serve per unit time. These objectives conflict. Optimizing purely for throughput often increases latency; optimizing purely for latency wastes capacity. Production systems need to manage the tradeoff deliberately, and batching is the primary mechanism for doing so.

### Why Batching Matters for GPU Economics

GPU economics are driven by utilization. An A100 costs roughly $2-3/hour on cloud infrastructure. At 100% utilization, every operation the GPU performs is economically productive. At 10% utilization — typical for single-request inference with small models — you're paying for compute you're not using.

Batching increases utilization by running multiple requests through the model simultaneously. The model weights are loaded into memory once. The matrix multiplications that dominate computation can be performed over the combined batch, using the GPU's parallel architecture efficiently. Memory bandwidth, the bottleneck during decode, is amortized across multiple sequences.

The ideal batch size for compute efficiency is typically large — on an A100 with a 70B model, saturating the GPU at peak decode throughput often requires batch sizes of 64-256 depending on sequence length. But larger batches mean requests must wait for the batch to fill, which increases latency.

The operational question is: given your latency SLA, what's the largest batch size you can sustain?

### Static Batching vs. Dynamic Batching

**Static batching** collects a fixed number of requests and runs them as a group. Simple to implement, easy to reason about, but terrible for utilization when request rates are variable. If your batch size is 16 and requests arrive at 4/second, batches take 4 seconds to fill. If your latency budget is 2 seconds, you can't use batching at all.

**Dynamic batching** with a timeout: collect requests for up to T milliseconds (or until batch size K, whichever comes first), then run the batch. This is better — the batch runs even if not full, bounding the wait time. The tradeoff is that small request spikes might fill the batch quickly (low latency, low waste), while quiet periods run small batches with low utilization.

```python
import asyncio
from collections import deque

class DynamicBatcher:
    def __init__(self, max_batch_size=16, max_wait_ms=50):
        self.queue = deque()
        self.max_batch_size = max_batch_size
        self.max_wait_ms = max_wait_ms

    async def add_request(self, request):
        future = asyncio.Future()
        self.queue.append((request, future))
        return await future

    async def batching_loop(self, model):
        while True:
            # Wait for first request
            while not self.queue:
                await asyncio.sleep(0.001)

            # Collect batch
            deadline = asyncio.get_event_loop().time() + (self.max_wait_ms / 1000)
            batch = []
            futures = []

            while self.queue and len(batch) < self.max_batch_size:
                req, fut = self.queue.popleft()
                batch.append(req)
                futures.append(fut)

                if asyncio.get_event_loop().time() >= deadline:
                    break

            # Run batch
            results = await model.generate_batch(batch)
            for fut, result in zip(futures, results):
                fut.set_result(result)
```

### Continuous Batching in Practice

As discussed in the previous chapter, continuous batching (iteration-level scheduling) eliminates the structural waste of static batching by allowing sequences to join and leave the batch at each decode step.

The key operational parameters in continuous batching systems:

**`max_num_seqs`**: Maximum number of sequences that can be in the batch simultaneously. Higher values increase throughput but require more KV cache memory.

**`max_num_batched_tokens`**: Maximum total tokens processed per step. This bounds the prefill cost per step. If set too high, a single long prefill can dominate step time.

**`block_size`**: The number of tokens per KV cache block. Larger blocks reduce fragmentation overhead but waste memory for sequences shorter than a block.

For vLLM, these are passed at server startup:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --tensor-parallel-size 4 \
  --max-num-seqs 256 \
  --max-num-batched-tokens 32768 \
  --block-size 16 \
  --gpu-memory-utilization 0.90
```

The `--gpu-memory-utilization 0.90` parameter is critical — it controls how much GPU memory vLLM reserves for KV cache (the remainder goes to model weights). Setting it too high causes OOM; too low wastes memory that could hold more cache blocks and serve larger batches.

### Sequence Packing

For fixed-length batching scenarios (fine-tuning, certain evaluation workloads), sequence packing improves utilization by fitting multiple short sequences into each batch slot. Instead of padding short sequences to the maximum length, sequences are concatenated and processed together with attention masks preventing cross-sequence attention.

This reduces the fraction of computation wasted on padding tokens. For datasets with high length variance — sequences ranging from 50 to 2,000 tokens — naive batching can waste 50-70% of compute on padding. Packing reduces that to near zero.

Hugging Face's `DataCollatorWithFlattening` handles this for training workloads. For inference, it's less commonly needed because continuous batching already handles variable lengths gracefully.

### Priority Queuing and Request Scheduling

Not all requests are equal. An interactive chat session has a lower latency tolerance than a batch document analysis job. A premium customer's request may have stricter SLAs than a background task.

Priority queuing separates requests into classes and schedules high-priority requests ahead of low-priority ones. The simplest version has two queues: interactive and batch. Interactive requests are always run next; batch requests fill remaining capacity.

More sophisticated schedulers balance multiple objectives simultaneously: KV cache reuse (prefer requests that share prefixes with currently cached content), sequence length (shorter sequences complete faster, freeing batch slots), memory pressure (if cache is full, prefer sequences that fit in existing blocks), and priority.

SGLang's RadixAttention implements a tree-structured cache where shared prefixes are automatically discovered and reused across requests. The scheduler prefers requests that maximize cache hits, which reduces prefill cost and memory pressure simultaneously.

> **Key Insight:** Request scheduling is a form of cache-aware workload management. A scheduler that routes requests to servers with warm caches for their prefix — even across a multi-server deployment — can achieve cache hit rates far above what random routing would produce.

### Speculative Execution and Request Hedging

For latency-sensitive applications, speculative execution sends the same request to multiple servers simultaneously and uses whichever responds first. This trades cost (paying for N times the compute) for reduced tail latency (p99, p999). It's rarely appropriate for high-volume applications but can be the right choice for SLA-sensitive low-volume workloads where the tail latency cost is higher than the compute cost.

Hedging is a softer version: if a request hasn't completed within a threshold time, dispatch it to a second server. If the first server completes first, cancel the second request. This amortizes the hedging cost based on actual behavior rather than paying for it on every request.

### The Throughput-Latency Frontier

Every inference system has a throughput-latency frontier: a curve showing the best achievable latency at each throughput level. At low throughput, latency is near the single-request minimum. As throughput increases, latency rises — larger effective batches, longer queue wait times, more memory pressure. At some throughput level, latency degrades sharply as the system saturates.

Understanding where your system sits on this frontier is essential for capacity planning. A system operating at 80% of its saturation point has headroom for load spikes. A system at 98% will see severe latency degradation under any additional load.

The frontier shifts based on:
- Model size and architecture (determines base decode speed)
- Hardware (memory bandwidth, compute throughput)
- Sequence length distribution (longer sequences fill KV cache faster)
- KV cache configuration (determines max batch size before eviction)

Benchmarking tools like `llm-bench`, vLLM's benchmarking utilities, and Locust-based harnesses can map this frontier by sweeping request rates and measuring P50/P95/P99 TTFT and ITL.

**Key Takeaways:**
1. Batching is the primary mechanism for GPU utilization — but larger batches mean longer queue wait times. The tradeoff must match your latency SLA.
2. Continuous batching eliminates the structural waste of static batching and is now standard in all serious serving frameworks.
3. KV cache size determines maximum batch capacity — `max_num_seqs` is bounded by available cache blocks.
4. Cache-aware scheduling can dramatically improve effective cache hit rates at the fleet level.
5. Always know where your system sits on the throughput-latency frontier — that's the capacity planning curve.

> **Try This:** Run a sweep on your inference server: increase request concurrency from 1 to 2, 4, 8, 16, 32. At each level, measure P50 and P95 TTFT and ITL. Plot the curves. Find the knee — the concurrency level where latency starts degrading sharply. That's your effective saturation point. If it's lower than you need, the constraint is either compute (upgrade hardware), memory (reduce model size or precision), or batching configuration (tune max_num_seqs).

---

## Chapter 6: Quantization and Its Impact on Cache

Quantization is one of the most effective levers for reducing inference cost and latency, but its interaction with the KV cache is underappreciated. The decision to quantize a model — and at what precision — affects not just model weight memory, but the size and quality of every cached key-value pair.

### What Quantization Does

Neural networks are trained with floating-point weights, typically bf16 or fp32. Each weight is a 16-bit or 32-bit number representing a real value. Quantization converts these to lower-precision representations — int8 (8-bit integers), int4, fp8, or even lower — to reduce memory footprint and increase compute efficiency.

The memory reduction is proportional to the precision reduction. An fp16 model uses 16 bits per parameter. An int8 version uses 8 bits — half the memory. An int4 version uses 4 bits — a quarter. A 70B parameter model in fp16 requires approximately 140 GB of GPU memory just for weights. In int4, the same model requires approximately 35 GB. That's the difference between needing 2 A100s and needing 1.

Memory savings matter beyond just fitting the model. GPU memory freed from model weights can be used for KV cache. More KV cache means larger batch sizes and higher throughput.

### Weight Quantization vs. Activation Quantization

Weight quantization reduces the precision of the learned parameters (the weight matrices). This is the most common form and the most mature. Methods like GPTQ, AWQ, and GGUF all quantize weights, typically to int4 or int8, while keeping activations in fp16 during inference.

Activation quantization reduces the precision of the intermediate tensors computed during the forward pass. This is harder because activations have more dynamic range than weights — their values vary based on input, making it difficult to choose a fixed quantization scale. Techniques like SmoothQuant and LLM.int8() handle this with various strategies for managing activation outliers.

The distinction matters for KV cache because keys and values are activations, not weights.

### Quantizing the KV Cache

KV cache quantization stores the cached K and V tensors in reduced precision. This directly reduces cache memory footprint, allowing more tokens (and thus more sequences) to be cached simultaneously.

For a model with 80 layers, 8 KV heads, head dim 128, and 8K context:
- fp16: `2 × 80 × 8 × 8192 × 128 × 2 bytes = 2.68 GB` per sequence
- int8: `2 × 80 × 8 × 8192 × 128 × 1 byte = 1.34 GB` per sequence
- fp8: same as int8 with different numerical properties

Halving cache size means fitting twice as many sequences in the same memory budget — directly doubling maximum batch size, which directly increases throughput.

The quality tradeoff depends on the quantization method:

**fp8 KV cache**: The fp8 format (available on H100s and newer hardware) offers near-fp16 quality for most workloads. The larger dynamic range of fp8 compared to int8 means fewer outlier values are clipped or rounded badly. This is increasingly the default for high-throughput deployments on modern hardware.

**int8 KV cache**: More aggressive. Quality degradation is workload-dependent — tasks requiring precise recall of specific values in long contexts (RAG, document QA, code generation) can be more sensitive than open-ended generation. Benchmarking on your actual use case is essential.

**int4 KV cache**: Significant quality degradation for most tasks. Not commonly used in production for interactive workloads, but appears in edge deployment and memory-constrained scenarios.

```python
# vLLM: enable FP8 KV cache quantization
from vllm import LLM, SamplingParams

llm = LLM(
    model="meta-llama/Llama-3-70b-instruct",
    kv_cache_dtype="fp8",
    tensor_parallel_size=4,
    gpu_memory_utilization=0.90,
)

# Verify cache dtype in output
outputs = llm.generate(
    ["What is the capital of France?"],
    SamplingParams(max_tokens=50)
)
```

> **Warning:** KV cache quantization is not the same as model weight quantization. A model loaded in fp16 with fp8 KV cache stores weights at full precision but caches activations at reduced precision. The combination is valid and commonly used. Don't conflate them in your configuration or your mental model.

### The Quantization Hierarchy in Practice

Production deployments often layer multiple quantization strategies:

1. **Model weights**: int4 or int8 using AWQ or GPTQ for memory reduction
2. **KV cache**: fp8 for memory efficiency with minimal quality loss
3. **Activations** (during computation): kept in fp16 for accuracy

This layered approach optimizes for different objectives at each layer. Weight quantization primarily benefits model loading time and the memory available for KV cache. KV cache quantization primarily benefits batch size and sequence length capacity.

The actual performance numbers from this combination on a 70B model:

| Configuration | Weight Memory | KV Cache per Seq (8K) | Max Batch (80GB) |
|---|---|---|---|
| fp16 weights, fp16 KV | 140 GB | 2.68 GB | Needs 2+ GPUs |
| int4 weights, fp16 KV | 35 GB | 2.68 GB | ~16 sequences |
| int4 weights, fp8 KV | 35 GB | 1.34 GB | ~33 sequences |
| int4 weights, int8 KV | 35 GB | 1.34 GB | ~33 sequences |

Numbers are approximate and exclude OS overhead, framework buffers, and activations. But the order of magnitude is right, and the ratios are reliable.

### Quality Benchmarking for Quantized Cache

The practical risk with KV cache quantization is task-specific degradation that doesn't show up in standard benchmarks. MMLU, HellaSwag, and similar benchmarks may not detect issues that appear in:
- Long document summarization (outlier values in late positions)
- Multi-hop reasoning across long contexts
- Code generation with precise numerical outputs
- Tasks where exact copying of values from context is required

Before deploying fp8 or int8 KV cache in production, run your actual workload or a representative sample. Measure output quality directly — not with generic benchmarks, but with domain-specific evaluations. If you're serving a code assistant, test with code. If you're serving a document QA system, test with your documents.

> **Key Insight:** The right quantization level is workload-dependent, not a universal answer. fp8 KV cache is safe for most tasks. Int8 is usually safe for open-ended generation. Int4 cache is a last resort for constrained environments. Evaluate on your actual use case before committing to a configuration.

### Hardware Considerations

Quantization benefits depend on hardware support. Native fp8 arithmetic is available on H100 and H200 GPUs. On A100s, fp8 operations are emulated and may not provide the expected speedup. Int8 has native support on A100 (via Tensor Core int8 operations) but the throughput advantage is smaller than the theoretical 2× because other factors become bottlenecks.

When evaluating hardware for an inference deployment, check the hardware's memory bandwidth and its supported quantization data types. An H100 at 3.35 TB/s memory bandwidth with native fp8 can dramatically outperform an A100 at 2.0 TB/s on decode-bound workloads — the H100 can move quantized cache data nearly twice as fast.

**Key Takeaways:**
1. KV cache quantization (fp8, int8) reduces cache memory footprint, enabling larger batches for the same GPU memory.
2. Weight quantization and cache quantization are independent decisions with different tradeoffs — layer them for maximum effect.
3. fp8 is the practical default for KV cache on modern hardware — quality loss is minimal, memory savings are 50%.
4. Quality impact from cache quantization is task-dependent; benchmark on your actual workload.
5. Hardware support matters: native fp8 operations on H100/H200 provide real speedups; on A100s, the benefit is primarily memory reduction.

> **Try This:** If you're running vLLM or a similar framework, serve the same model with `kv_cache_dtype="auto"` (fp16 or bf16) and with `kv_cache_dtype="fp8"`. Run your production prompt distribution through both configurations. Compare quality on a random sample using your normal evaluation metric, and compare GPU memory utilization and throughput. If quality is equivalent and throughput improves, fp8 is a free win.

---

## Chapter 7: Speculative Decoding

Speculative decoding is one of the most counterintuitive optimizations in LLM inference. The idea sounds wasteful at first: use a small, fast model to speculatively generate multiple tokens, then have the large model verify them all in a single forward pass. But the math works out to a significant speedup, and understanding why requires a clear picture of where decode time is actually spent.

### The Core Problem with Auto-Regressive Generation

Standard token generation is inherently sequential. The model generates one token, appends it to the context, generates the next token, appends it, and so on. Each step requires a full forward pass through all layers of the model.

For a 70B parameter model on an A100, a single forward pass for one new decode token takes roughly 150-200ms at small batch sizes. Generating 200 tokens takes 30-40 seconds. This is slow — not because the GPU isn't working hard, but because the GPU is memory bandwidth-bound: it's spending most of its time loading the 140GB of model weights from GPU memory into the arithmetic units, over and over, for each token.

The insight behind speculative decoding: the bottleneck is memory bandwidth, not arithmetic. If you can batch multiple token verifications into a single forward pass, you amortize the memory bandwidth cost across all of them simultaneously.

### How Speculative Decoding Works

The algorithm has two components: a **draft model** (small and fast) and a **target model** (large and accurate).

The draft model generates K tokens speculatively — quickly, without waiting for verification. These are candidate continuations of the current context. The target model then runs a single forward pass that evaluates all K candidate tokens simultaneously by running its attention over the full context including all K candidates.

The target model's forward pass computes, in parallel:
- The probability of token K[0] given the context
- The probability of token K[1] given context + K[0]
- The probability of token K[2] given context + K[0] + K[1]
- ... and so on

Using a rejection sampling procedure, tokens are accepted or rejected based on whether the draft model's probability agrees with the target model's probability. If all K tokens are accepted, the system has generated K tokens in approximately the time of one target model forward pass. If some are rejected, the process restarts from the first rejection point.

The expected number of tokens generated per target model forward pass — the "acceptance rate" or "acceptance length" — determines the speedup. If the draft model consistently proposes tokens the target model agrees with, acceptance rates of 80-90% are achievable. At an acceptance rate of 90% with K=4 draft tokens, the expected accepted tokens per step is approximately 3.6, meaning nearly 4× throughput for decode.

```
Without speculative decoding:
Step 1: generate token 1 (150ms)
Step 2: generate token 2 (150ms)
Step 3: generate token 3 (150ms)
Total for 3 tokens: 450ms

With speculative decoding (K=4, 90% acceptance):
Step 1: draft generates tokens 1-4 (small model, ~20ms)
         target verifies all 4 in one pass (150ms)
         accepts 3.6 tokens on average
Total for ~3.6 tokens: 170ms
Per-token time: ~47ms vs ~150ms — 3.2× speedup
```

> **Key Insight:** Speculative decoding doesn't reduce the number of forward passes through the target model — it increases the expected token yield per pass. The speedup comes from amortizing the target model's memory bandwidth cost across multiple tokens simultaneously.

### Choosing a Draft Model

The draft model needs to be:
1. Fast enough that generating K draft tokens is cheap relative to one target model pass
2. Accurate enough that acceptance rates are high

The typical approach is to use a smaller model from the same family as the target. If your target is Llama 3 70B, a good draft model might be Llama 3 8B. They share a vocabulary, similar training distribution, and style. The draft model is often 10-20× smaller than the target.

**Medusa** is an alternative that replaces the separate draft model with additional prediction heads attached directly to the target model. Each head predicts tokens at different offsets (token i+1, i+2, i+3, etc.) in a single forward pass. This eliminates the draft model entirely, trading some accuracy for zero added model complexity. Medusa heads are fine-tuned on the same data used to fine-tune the target model.

**EAGLE** (Extrapolation Algorithm for Greater Language-model Efficiency) goes further by training the draft head to use the target model's intermediate hidden states as context, improving acceptance rates substantially. EAGLE-2 achieves acceptance lengths of 4-5 tokens on many tasks.

**Self-speculative decoding** uses early exit: the target model itself produces a draft by running only the first N layers (which is fast), then verifies by running all layers. This requires no separate model but works best when the task's difficulty is uneven — easy tokens can be predicted from shallow layers alone.

### Token Trees and Parallel Sampling

Rather than generating a single linear sequence of K draft tokens, some implementations generate a tree of possible continuations. At each draft step, the draft model can propose multiple candidate tokens. The result is a tree where each branch represents a different possible continuation.

The target model then evaluates all branches simultaneously using tree attention (a specialized attention mask that enforces each branch's conditional independence). This further improves token yield per target model pass by exploring multiple possible paths in parallel.

```
Linear draft (K=4):
Context → [A] → [B] → [C] → [D]

Tree draft (branching factor 2, depth 2):
Context → [A1] → [C1]
                → [C2]
        → [A2] → [D1]
                → [D2]
```

Tree speculative decoding with an acceptance-rate-optimized drafting strategy can achieve average accepted lengths of 6-8 tokens per target pass on well-matched workloads.

### When Speculative Decoding Helps (and When It Doesn't)

Speculative decoding provides the most benefit when:
- The workload is compute-limited by decode (high output-to-input token ratio)
- The target model is large (memory bandwidth bound)
- The draft model achieves high acceptance rates for the specific domain

It provides less benefit when:
- Batch sizes are large (the target model is already efficient per token at large batches)
- The task has high output entropy (code, poetry, creative writing — the draft model often guesses wrong)
- The target and draft models have different tokenizers or vocabularies

The relationship between batch size and speculative decoding benefit is particularly important. At batch size 1, the target model is severely underutilized. Speculative decoding's ability to run multiple token verifications in one pass provides a large relative improvement. At batch size 256, the target model is already reasonably well-utilized, and speculative decoding's benefit shrinks — in some cases, the overhead of running the draft model plus tree attention can make things worse.

> **Warning:** Don't deploy speculative decoding without benchmarking at your actual batch size distribution. The speedup from published benchmarks is often measured at batch size 1. If you run mixed batch sizes in production, the effective speedup will be lower. For high-throughput batch deployments, speculative decoding may not be beneficial at all.

### Implementation in Production Frameworks

vLLM supports speculative decoding natively:

```bash
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3-70b-instruct \
  --speculative-model meta-llama/Llama-3-8b-instruct \
  --num-speculative-tokens 5 \
  --tensor-parallel-size 4
```

The `--num-speculative-tokens` parameter sets K. Higher K increases expected token yield when acceptance rates are high, but increases the cost when acceptance rates are low (rejected tokens were computed but not used). For most workloads, K=4 or K=5 is a reasonable starting point.

Monitoring acceptance rate is essential after deployment. vLLM exposes this metric:

```python
# Check speculative acceptance rate from vLLM metrics endpoint
import requests

metrics = requests.get("http://localhost:8000/metrics").text
# Look for: vllm:spec_decode_draft_acceptance_rate
```

If acceptance rate drops below ~70%, the speedup from speculative decoding may not justify the overhead. Revisit the draft model choice or the K value.

**Key Takeaways:**
1. Speculative decoding generates multiple tokens per target model forward pass, amortizing the memory bandwidth cost that dominates decode latency.
2. Speedup depends on acceptance rate — draft and target model must agree frequently for the optimization to pay off.
3. Batch size inversely affects speculative decoding benefit; it's most valuable at small batch sizes where the target model is least utilized.
4. Tree drafting and Medusa/EAGLE variants extend the basic algorithm with higher expected token yields.
5. Always monitor acceptance rate in production — a dropped acceptance rate erases the benefit and can slow things down.

> **Try This:** If you have a production endpoint running a large model, add a speculative decoding configuration with the same model family's smallest variant as the draft model. Benchmark at your median concurrent request count. Compare P50 and P99 ITL with and without speculative decoding. If you see improvement at median load, check whether it degrades under peak load when batch sizes grow. The crossover point where speculative decoding stops helping tells you your effective batch size ceiling.

---

## Chapter 8: Monitoring Inference Performance

All the optimization strategies in this guide have one thing in common: they only work if you know they're working. Monitoring inference performance isn't a secondary concern or something you add later. It's the mechanism by which you make informed decisions about every configuration choice, every hardware decision, and every code change.

The goal isn't dashboards for their own sake. It's signal: specific metrics that tell you whether the system is behaving the way you expect, and specific indicators that tell you where to look when it isn't.

### The Core Metrics

**Time-to-first-token (TTFT)**: The duration from when a request is received to when the first output token is generated. This is prefill time plus any queue wait. It's what the user experiences as "response latency."

**Inter-token latency (ITL)**: The time between successive output tokens during generation. Also called time-per-output-token (TPOT). This is what users experience as "generation speed" — how fast tokens stream in.

**Request throughput**: Requests completed per second. This is the system-level efficiency metric.

**Token throughput**: Output tokens generated per second. More granular than request throughput because it accounts for variable output lengths.

**KV cache utilization**: What fraction of allocated KV cache blocks are currently in use. High utilization means you're close to your capacity ceiling. Near 100% means you're evicting aggressively.

**Cache hit rate**: For prefix caching, the fraction of input tokens served from cache rather than recomputed. Low hit rate on a workload with long system prompts indicates a structural problem with your prompt design or cache configuration.

**Queue depth**: How many requests are waiting to be processed. Rising queue depth indicates the system is running behind. Combined with TTFT, it tells you whether latency is driven by compute time or wait time.

### Instrumentation

vLLM exposes a Prometheus endpoint at `/metrics`. The key metrics to scrape:

```
# Latency histograms
vllm:e2e_request_latency_seconds
vllm:time_to_first_token_seconds
vllm:time_per_output_token_seconds
vllm:request_queue_time_seconds

# Throughput
vllm:prompt_tokens_total
vllm:generation_tokens_total
vllm:requests_total

# KV Cache
vllm:gpu_cache_usage_perc
vllm:gpu_prefix_cache_hit_rate

# Speculative decoding (if enabled)
vllm:spec_decode_draft_acceptance_rate
vllm:spec_decode_efficiency
```

A minimal Prometheus scrape config:

```yaml
scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8000']
    metrics_path: '/metrics'
    scrape_interval: 15s
```

Pair this with Grafana dashboards. The vLLM project maintains a Grafana dashboard template that covers the core metrics out of the box.

### Distributed Tracing

For production systems handling multiple request types, distributed tracing provides request-level detail that aggregate metrics miss. Trace a request through the full pipeline: API receipt, queue wait, prefill execution, KV cache lookup, decode execution, response serialization.

OpenTelemetry is the standard instrumentation framework. vLLM has nascent OpenTelemetry support; most production teams add it at the gateway layer:

```python
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter

provider = TracerProvider()
processor = BatchSpanProcessor(OTLPSpanExporter(endpoint="localhost:4317"))
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)

tracer = trace.get_tracer(__name__)

async def handle_inference_request(request):
    with tracer.start_as_current_span("inference_request") as span:
        span.set_attribute("model", request.model)
        span.set_attribute("prompt_tokens", len(request.tokens))

        with tracer.start_as_current_span("prefill"):
            first_token = await model.prefill(request)

        span.set_attribute("ttft_ms", first_token.ttft_ms)

        tokens = []
        with tracer.start_as_current_span("decode"):
            async for token in model.decode(first_token):
                tokens.append(token)

        span.set_attribute("output_tokens", len(tokens))
        span.set_attribute("cache_hit_rate", request.cache_stats.hit_rate)

        return tokens
```

### Alerting Thresholds

Metrics only matter if someone responds when they go wrong. Define alerts with intent, not thresholds that trigger too often or too rarely.

**TTFT P95 > 2× baseline**: Indicates either increased prompt length, reduced cache hit rate, or GPU pressure. Investigate before it becomes P50.

**ITL P95 > 3× baseline**: Decode is degrading. Likely causes: batch size growing (more pressure), KV cache eviction increasing (longer effective cache lookups), or hardware contention.

**KV cache utilization > 90%**: Approaching memory pressure. Above 95%, eviction rates spike and latency follows. Add capacity or reduce batch size limit.

**Prefix cache hit rate < target (e.g., < 80%)**: For workloads designed for high cache hit rates, this indicates a prompt structure change, a deployment issue, or a routing change. Investigate immediately.

**Queue depth > 50**: Requests accumulating faster than they're being processed. The system is under load. Autoscale if applicable, or investigate whether a runaway request is blocking the queue.

> **Key Insight:** The most useful alert threshold is rate of change, not absolute value. A P95 TTFT that's been stable at 800ms for a month is a baseline. P95 TTFT rising from 800ms to 1,200ms over 4 hours is a signal. Build derivative-based alerts alongside threshold-based ones.

### Profiling Under Load

When metrics indicate a problem, the next step is profiling. For GPU inference, the two primary profiling tools are NVIDIA Nsight Systems (nsys) and PyTorch Profiler.

PyTorch Profiler traces individual operations and memory allocations:

```python
import torch
from torch.profiler import profile, record_function, ProfilerActivity

with profile(
    activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA],
    record_shapes=True,
    profile_memory=True,
    with_stack=True
) as prof:
    with record_function("model_inference"):
        output = model.generate(input_ids, max_new_tokens=100)

print(prof.key_averages().table(sort_by="cuda_time_total", row_limit=20))
prof.export_chrome_trace("inference_trace.json")
```

The Chrome trace format can be viewed at `chrome://tracing` or in Perfetto. It shows the timeline of every CUDA kernel, when the GPU was active versus waiting, and where time is concentrated.

For KV cache-specific issues, look for:
- High time in memory copy operations (suggests cache blocks being moved)
- CPU-GPU synchronization points (suggests cache eviction decisions happening on CPU)
- Attention kernels that are slower than expected for the sequence length

**nvtx markers** in vLLM's source code label key regions — prefill, decode, cache operations — making them visible in Nsight profiles. These are the breadcrumbs that turn raw profiling data into interpretable timelines.

### Capacity Planning From Metrics

Historical metrics are the foundation of capacity planning. With 90 days of request traces, you can answer:
- What's the P99 prompt length? This sets your maximum cache block requirement.
- What's the peak concurrent request rate? This determines required server count.
- What's the average output length? This determines decode load per request.
- What's the cache hit rate at peak load? This determines whether cache pressure scales with load.

A simple capacity model:

```python
def estimate_required_gpus(
    requests_per_second,
    avg_prompt_tokens,
    avg_output_tokens,
    cache_hit_rate,
    model="llama3-70b"
):
    # Effective prefill tokens (accounting for cache hits)
    prefill_tokens_per_req = avg_prompt_tokens * (1 - cache_hit_rate)
    decode_tokens_per_req = avg_output_tokens

    # Time per request on single A100 (rough estimates)
    prefill_ms_per_1k_tokens = 50   # varies with model
    decode_ms_per_token = 150       # varies with model and batch

    prefill_ms = (prefill_tokens_per_req / 1000) * prefill_ms_per_1k_tokens
    decode_ms = decode_tokens_per_req * decode_ms_per_token
    total_ms_per_request = prefill_ms + decode_ms

    # Requests per second a single GPU can handle (single stream)
    rps_per_gpu = 1000 / total_ms_per_request

    # Add 20% headroom for bursts and variance
    required_gpus = (requests_per_second / rps_per_gpu) * 1.2
    return max(1, int(required_gpus + 0.5))
```

This is a back-of-envelope estimate, not a precise model — real throughput depends on batching efficiency, which depends on sequence length distribution. But it gives you an order-of-magnitude starting point for sizing discussions.

For example: 10 requests/second, 2,000-token average prompt, 300-token average output, 70% cache hit rate:

```python
prefill_tokens = 2000 * (1 - 0.70)  # 600 tokens effective
prefill_ms = (600 / 1000) * 50      # 30ms
decode_ms  = 300 * 150              # 45,000ms — clearly the dominant term
# → total: ~45 seconds single-stream per request
# → rps_per_gpu: 0.022
# → required_gpus (10 rps): ceil(10 / 0.022 * 1.2) = 546
```

The numbers expose how aggressively you need continuous batching. That 150ms per decode token is a serial assumption — at batch size 64, per-token effective cost drops to 5-10ms, which changes the GPU count by an order of magnitude. The model's value is identifying which inputs move the needle: cache hit rate, output length, and batch efficiency are the three that matter most.

> **Warning:** Capacity planning models built on average metrics will underestimate requirements. P95 and P99 values are the right inputs for sizing. Average request rates don't capture burst patterns; average prompt lengths don't capture the long-tail requests that saturate KV cache. Design for the tail, not the mean.

### Building a Metrics Culture

The most common operational failure mode for inference deployments isn't technical — it's organizational. Systems are deployed, they work well enough initially, and then load grows, models get updated, prompt structures change, and nobody notices that TTFT has drifted from 400ms to 1,200ms over six months because nobody is watching.

Establish baselines during initial deployment. Document expected behavior: "TTFT P95 < 800ms at 50 concurrent requests with 2K average prompt length." When metrics deviate from baseline, investigate. Make metric review part of your regular operational rhythm, not just an on-call escalation trigger.

The metrics aren't the product — they're the signal that the product is working. Treat them accordingly.

**Key Takeaways:**
1. The four primary inference metrics are TTFT, ITL, KV cache utilization, and prefix cache hit rate. Monitor all four from day one.
2. Distributed tracing reveals per-request behavior that aggregate metrics hide — essential for diagnosing specific latency patterns.
3. Alert on rate of change, not just absolute thresholds — a sudden drift is more meaningful than a stable high value.
4. PyTorch Profiler and Nsight are the tools for GPU-level investigation when metrics indicate a problem.
5. Capacity planning should use P95/P99 inputs, not averages. Design for the tail.

> **Try This:** Set up a minimal monitoring stack: vLLM with Prometheus scraping enabled, a Grafana instance with the vLLM dashboard template, and three alerts: TTFT P95 rising 50% above a 7-day baseline, KV cache utilization exceeding 90%, and prefix cache hit rate dropping below your established baseline. Run this for two weeks on a non-critical workload. The alerts will fire for reasons you didn't anticipate. Investigating each one builds intuition that no amount of reading will replicate.

---

## Conclusion

The concepts in this guide form a chain. Attention's quadratic scaling makes the KV cache necessary. The KV cache's memory cost makes quantization valuable. Quantization's effect on cache size changes the batching calculus. Batching strategy affects how prefill and decode compete for GPU resources. Speculative decoding addresses decode latency specifically. And monitoring is the feedback loop that makes any of this actionable.

None of these optimizations exist in isolation, and the right configuration for your system isn't the one that maximizes any single metric. It's the one that hits your latency targets at the lowest cost, given your actual workload distribution.

The workload distribution is the part most engineers underinvest in understanding. Before tuning KV cache block size, batching parameters, or speculative decoding acceptance thresholds, know your prompt length distribution, your output length distribution, your concurrency patterns, and your cache hit baseline. Most of the optimization decisions in this guide become straightforward once the workload is characterized. Most of the wrong decisions trace back to optimizing for an assumed workload that doesn't match reality.

The other underinvestment is in measurement. The tools are available and not particularly expensive to operate. Prometheus and Grafana are free. PyTorch Profiler is built into the framework. vLLM exposes metrics endpoints by default. What's missing is usually the habit of looking.

LLM inference is maturing quickly. The techniques described here — PagedAttention, continuous batching, speculative decoding, prefix caching — were all introduced between 2022 and 2024. The hardware is evolving in parallel: H100 and H200 GPUs with native fp8 support, NVLink interconnects with higher bandwidth for tensor parallelism, and specialized inference chips from multiple vendors. The optimization landscape will continue shifting.

The underlying principles won't change. Memory bandwidth limits decode speed. Quadratic attention scaling drives the need for caching. Fixed hardware means utilization determines cost-efficiency. Those constraints are structural. The optimizations that work against them today will continue working, in some form, as the hardware and frameworks evolve.

Start with the basics: measure what you have, understand what's bottlenecking, apply the most appropriate optimization. The sophisticated techniques are worth understanding, but the basics close the most gaps. A 90% prefix cache hit rate on a well-structured prompt will often outperform speculative decoding with a poorly-configured draft model.

Know the fundamentals. Measure everything. Optimize deliberately.

---

## Appendix A: Glossary

**Attention**: The mechanism by which transformer models weigh the relevance of different tokens to each other. Each token queries all others and receives a weighted mixture of their information. Scales quadratically with sequence length.

**Batch size**: The number of sequences processed simultaneously in a single forward pass. Larger batches improve GPU utilization but increase per-request latency if requests must wait for the batch to fill.

**Block (KV cache)**: A fixed-size allocation unit for KV cache in PagedAttention-based systems. Typically 16-32 tokens per block. Allows non-contiguous cache allocation and cross-sequence sharing.

**BF16 (Brain Float 16)**: A 16-bit floating-point format with the same exponent range as float32 but reduced mantissa precision. Standard for training and inference on modern hardware.

**Cache eviction**: The removal of KV cache entries when memory capacity is exhausted. Evicted entries must be recomputed on next access.

**Cache hit rate**: The fraction of input tokens that are served from the KV cache rather than recomputed. High hit rates indicate efficient cache utilization.

**Chunked prefill**: A technique that splits long prompt prefill operations into smaller chunks, interleaved with decode steps, to prevent long prefills from stalling active decode sequences.

**Continuous batching**: Also called iteration-level scheduling. A scheduling strategy that allows sequences to join and leave the batch at each decode step, rather than at fixed batch boundaries. Enables near-constant GPU utilization.

**Decode phase**: The token-by-token generation phase that follows prefill. Memory bandwidth-bound. Generates one new token per forward pass by attending to cached KV tensors.

**Draft model**: In speculative decoding, the small fast model that generates candidate tokens for verification by the target model.

**EAGLE**: A speculative decoding variant that trains draft heads using the target model's hidden states for improved acceptance rates.

**FP8**: An 8-bit floating-point format natively supported on H100/H200 GPUs. Used for KV cache quantization and weight storage to reduce memory footprint while preserving most of the numerical range of fp16.

**GQA (Grouped-Query Attention)**: An attention variant where groups of query heads share key-value head pairs. Reduces KV cache memory relative to multi-head attention. Used in Llama 3, Mistral, and most modern open-weights models.

**GPTQ**: A post-training quantization method for weight quantization to int4 or int8. Calibrated using a small dataset to minimize quantization error.

**Int8**: 8-bit integer representation. When used for weights or KV cache, halves the memory footprint relative to fp16.

**Inter-token latency (ITL)**: The time between successive output tokens during generation. Also called time-per-output-token (TPOT). Directly perceived as generation speed by users.

**KV cache**: Stored key and value tensors from attention computation, enabling reuse of previously computed attention state during token generation.

**Medusa**: A speculative decoding variant that adds multiple prediction heads to the target model itself, eliminating the need for a separate draft model.

**MHA (Multi-Head Attention)**: The standard attention mechanism using separate K and V projections for each attention head.

**MQA (Multi-Query Attention)**: An attention variant where all query heads share a single set of K and V heads. Maximally reduces KV cache size.

**PagedAttention**: A KV cache management algorithm that allocates cache in fixed non-contiguous blocks, analogous to virtual memory paging. Enables near-zero memory waste and cross-request prefix sharing.

**Prefill phase**: The initial parallel processing of all input tokens. Compute-bound. Produces KV cache entries for the input and generates the first output token's logits.

**Prefix caching**: Caching KV tensors for shared prompt prefixes across requests. Requests sharing the same prefix only compute KV for new tokens, not the shared prefix.

**Quantization**: Reducing the numerical precision of weights or activations to lower-bit formats to reduce memory and improve throughput. Types include weight quantization (GPTQ, AWQ), activation quantization, and KV cache quantization.

**Rejection sampling (speculative decoding)**: The process by which the target model accepts or rejects draft tokens. Accepted tokens are appended; at the first rejection, the target model's own prediction is used and speculative sampling restarts.

**SGLang**: A serving framework for LLMs with a focus on structured generation and high-performance batch processing. Features RadixAttention for automatic prefix sharing.

**Speculative decoding**: An inference optimization that generates multiple draft tokens using a small model, then verifies them in a single target model forward pass, increasing token yield per pass.

**SWA (Sliding Window Attention)**: An attention variant where each token attends only to a local window of previous tokens. O(N × window_size) complexity instead of O(N²).

**Target model**: In speculative decoding, the large accurate model that verifies draft tokens.

**Tensor parallelism**: Distributing model weight tensors across multiple GPUs so that each GPU computes a portion of each forward pass. Standard for models too large to fit on a single GPU.

**Time-to-first-token (TTFT)**: The duration from request receipt to generation of the first output token. Primarily determined by prefill time.

**TensorRT-LLM**: NVIDIA's optimized inference library for LLMs. Provides pre-built kernels for attention, quantization, and speculative decoding on NVIDIA hardware.

**vLLM**: An open-source LLM serving framework featuring PagedAttention, continuous batching, and speculative decoding. The most widely deployed open-source inference server for production use.

**AWQ (Activation-aware Weight Quantization)**: A weight quantization method that protects important weights identified by activation analysis, achieving better quality than GPTQ at similar compression levels.

---

## Appendix B: Tools & Resources

### Inference Serving Frameworks

**vLLM** — `https://github.com/vllm-project/vllm`
The production standard for open-source LLM serving. Implements PagedAttention, continuous batching, speculative decoding, fp8 KV cache, and multi-LoRA serving. OpenAI-compatible API. Prometheus metrics endpoint.

**SGLang** — `https://github.com/sgl-project/sglang`
High-performance serving with RadixAttention for automatic prefix sharing, structured generation (JSON, regex), and batch inference. Often faster than vLLM for structured output workloads.

**TensorRT-LLM** — `https://github.com/NVIDIA/TensorRT-LLM`
NVIDIA's production inference library. Highly optimized for NVIDIA hardware. More complex deployment model but maximum performance on A100/H100.

**llama.cpp** — `https://github.com/ggerganov/llama.cpp`
CPU and mixed CPU/GPU inference. Excellent for edge deployment, development environments, and architectures where full GPU clusters aren't available. GGUF quantization format.

**Ollama** — `https://github.com/ollama/ollama`
Developer-friendly wrapper around llama.cpp. Not suitable for production high-throughput serving but useful for local development and prototyping.

### Quantization Tools

**AutoAWQ** — `https://github.com/casper-hansen/AutoAWQ`
Activation-aware weight quantization (AWQ). Produces int4 models with better quality retention than GPTQ at equivalent compression.

**AutoGPTQ** — `https://github.com/PanQiWei/AutoGPTQ`
GPTQ quantization implementation for HuggingFace models. Well-supported, wide model coverage.

**HuggingFace Transformers Quantization** — `https://huggingface.co/docs/transformers/quantization`
Native quantization support in Transformers via bitsandbytes (int8, int4), GPTQ, and AWQ. Useful for evaluating quantized models before committing to a specific format.

### Benchmarking and Profiling

**vLLM Benchmarks** — `vllm/benchmarks/benchmark_throughput.py` and `benchmark_latency.py`
Built-in benchmark scripts for throughput and latency. Support ShareGPT and custom datasets.

**lm-evaluation-harness** — `https://github.com/EleutherAI/lm-evaluation-harness`
Standard benchmark suite for model quality evaluation. Use for baseline quality measurements before and after quantization.

**NVIDIA Nsight Systems** — `https://developer.nvidia.com/nsight-systems`
GPU profiling and tracing. Essential for understanding GPU kernel timing, memory transfer patterns, and identifying bottlenecks in custom inference code.

**PyTorch Profiler** — `torch.profiler`
Built into PyTorch. Provides CUDA kernel timing and memory tracking with Chrome trace output.

### Monitoring

**Prometheus + Grafana** — `https://prometheus.io`, `https://grafana.com`
Standard metrics stack. vLLM, SGLang, and TensorRT-LLM all expose Prometheus endpoints.

**vLLM Grafana Dashboard** — Available in the vLLM repository under `examples/`
Pre-built dashboard covering TTFT, ITL, cache utilization, throughput, and speculative decoding metrics.

### Hosted API Documentation

**Anthropic Prompt Caching** — `https://docs.anthropic.com/en/docs/build-with-claude/prompt-caching`
Full documentation for Claude's `cache_control` API, pricing, TTL behavior, and best practices.

**OpenAI Prompt Caching** — `https://platform.openai.com/docs/guides/prompt-caching`
Automatic caching documentation. Covers eligibility, pricing, and cache hit detection.

**Google Gemini Context Caching** — `https://ai.google.dev/gemini-api/docs/caching`
Explicit cache management API with named caches and configurable TTL.

### Hardware References

**NVIDIA A100 Datasheet** — Memory bandwidth 2.0 TB/s (HBM2e), 80 GB variant
**NVIDIA H100 Datasheet** — Memory bandwidth 3.35 TB/s (HBM3), 80 GB variant, native fp8 support
**NVIDIA H200 Datasheet** — Memory bandwidth 4.8 TB/s (HBM3e), 141 GB variant

Memory bandwidth is the critical specification for decode-phase performance. When comparing hardware for inference deployment, weight memory bandwidth more heavily than raw FLOPS for large-model, long-context workloads.

---

## Appendix C: Further Reading

### Foundational Papers

**"Attention Is All You Need"** — Vaswani et al., 2017
The transformer paper. The scaled dot-product attention mechanism described here is still the basis for modern LLMs. The paper is accessible and concise — worth reading in full even if you know the concepts.

**"Efficient Large Scale Language Modeling with Mixtures of Experts"** — Artetxe et al., Meta AI, 2021
Establishes the MoE paradigm that affects inference differently than dense transformers — cache management for MoE models has unique characteristics.

**"Fast Transformer Decoding: One Write-Head is All You Need"** — Shazeer, 2019
The original multi-query attention paper. Short. Establishes the theoretical basis for the KV cache memory reductions that MQA and GQA exploit.

**"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints"** — Ainslie et al., Google, 2023
The GQA paper. Explains why grouped-query attention achieves quality close to MHA at dramatically lower KV memory cost. Directly relevant to understanding modern model architectures.

**"Efficient Memory Management for Large Language Model Serving with PagedAttention"** — Kwon et al., UC Berkeley, 2023
The vLLM paper. Introduces PagedAttention and the block-based KV cache management that is now standard in production serving.

**"Continuous Batching: A Requisite Feature of Production LLM Serving"** — Anyscale engineering blog, 2023
Practical explanation of continuous batching and its throughput implications. More accessible than academic papers.

**"Fast Inference from Transformers via Speculative Decoding"** — Leviathan et al., Google, 2023
The formal treatment of speculative decoding with acceptance rate analysis. The math is tractable and worth understanding.

**"EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty"** — Li et al., 2024
EAGLE and EAGLE-2. Demonstrates substantially higher acceptance rates through better draft model architecture using the target model's feature states.

**"Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads"** — Cai et al., 2024
The Medusa paper. Self-contained speculative decoding without a separate draft model.

**"SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models"** — Xiao et al., MIT/NVIDIA, 2022
The key paper on handling activation outliers in quantization. Background for understanding why int8 activation quantization is harder than int8 weight quantization.

**"AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration"** — Lin et al., MIT, 2023
The AWQ method. Explains the insight behind protecting salient weights and why it outperforms uniform quantization at equivalent bit widths.

### Engineering References and Blogs

**vLLM Blog** — `https://blog.vllm.ai`
Engineering posts from the vLLM team. Covers performance improvements, new features, and real-world benchmarking as they're released.

**Hugging Face Blog** — `https://huggingface.co/blog`
Consistently good technical posts on quantization, fine-tuning, and deployment. The posts on BitsAndBytes, GPTQ, and Flash Attention are particularly useful.

**NVIDIA Technical Blog** — `https://developer.nvidia.com/blog`
GPU architecture explanations and inference optimization techniques from the hardware side. The H100 inference architecture post and the TensorRT-LLM performance posts are worth reading.

**Lilian Weng's Blog (OpenAI)** — `https://lilianweng.github.io`
Deep technical posts on LLM architecture, attention variants, and efficiency techniques. The posts on large language models and LLM serving are comprehensive reference material.

**Tim Dettmers' Blog** — `https://timdettmers.com`
The leading practical resource on GPU selection and quantization for LLMs. "Which GPU for Deep Learning" and the bitsandbytes quantization posts are essential reading for hardware and quantization decisions.

### Community Resources

**r/LocalLLaMA** — Performance reports, benchmarks, and practical experience with specific models and configurations from practitioners running self-hosted deployments.

**Hugging Face Forums** — Inference and deployment discussion with responses from engineers at HuggingFace and the broader community.

**vLLM GitHub Discussions** — Production deployment questions and issues. The issue tracker and discussions contain a wealth of practical experience with specific workloads and configurations.

---

*KV Cache and Inference Optimization: The Infrastructure Layer That Determines Your Real LLM Costs*
*Version 1.0 — April 2026*
*By David Kelly Price*

*Published by Pyckle*


---

## Related Blog Posts

- [The Hidden Tax on Every Inference](https://pyckle.co/blog/the-hidden-tax-on-every-inference-why-kv-cache-compression-is-having-a-moment.html)
- [1 Million Tokens on a Budget](https://pyckle.co/blog/1-million-tokens-on-a-budget-gpu-changes-nothing-and-everything.html)
- [The Token Tax Is Real](https://pyckle.co/blog/the-token-tax-is-realand-developers-are-finally-doing-something-about-it.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*