---
title: "Token Economics"
subtitle: "Cutting Your LLM Bill Without Cutting Quality"
author: "David Kelly Price"
version: "1.0"
date: 2026-04-20
status: draft
type: ebook
target_audience: "Engineering managers, senior engineers, and platform engineers responsible for LLM-powered systems and their costs — making decisions about inference spend"
estimated_pages: 70
chapters:
  - "Where the Money Actually Goes"
  - "The Token Tax: Input vs. Output Costs"
  - "Context Compression Strategies"
  - "Prompt Caching and When to Use It"
  - "Model Selection and Cost-Quality Tradeoffs"
  - "Batching, Routing, and Request Shaping"
  - "Measuring What You're Spending"
  - "Building a Token Budget System"
tags:
  - pyckle
  - ebook
  - token-cost
  - llm-cost
  - optimization
  - inference
  - engineering-management
  - draft
---

<!-- DESIGN & LAYOUT NOTES

Target formats:
- Primary: Markdown (source of truth)
- Export: PDF via Pandoc, web page
- Print-ready: Letter size, 1" margins

Typography:
- Headers: Sans-serif (brand-consistent)
- Body: Serif or clean sans-serif for readability
- Code: Monospace, syntax highlighted, line-numbered where helpful

Callout box types:
- **Try This** — Exercises and hands-on activities
- **Key Insight** — Important concepts worth remembering
- **Warning** — Common mistakes or gotchas

Figures:
- Captioned and numbered (Figure 1, Figure 2, etc.)
- Referenced by number in body text
-->

---

# Token Economics

## Cutting Your LLM Bill Without Cutting Quality

**By David Kelly Price**

Version 1.0 — April 2026

---

## Table of Contents

- About This Guide
- Chapter 1: Where the Money Actually Goes
- Chapter 2: The Token Tax: Input vs. Output Costs
- Chapter 3: Context Compression Strategies
- Chapter 4: Prompt Caching and When to Use It
- Chapter 5: Model Selection and Cost-Quality Tradeoffs
- Chapter 6: Batching, Routing, and Request Shaping
- Chapter 7: Measuring What You're Spending
- Chapter 8: Building a Token Budget System
- Conclusion
- Appendix A: Glossary
- Appendix B: Tools & Resources
- Appendix C: Further Reading

---

## About This Guide

LLM pricing looks simple on the surface. Providers publish a rate card — dollars per million tokens, in and out — and you do the math. Except the math almost never matches the bill.

The gap between what you expect to spend and what you actually spend is almost always a product of misunderstanding what you're purchasing. Tokens are the unit of billing, but they are also the unit of computation, the unit of attention, and the unit of quality. Cutting them blindly costs you quality. Not understanding them costs you money. The goal is to understand them well enough to make principled decisions about both.

This guide is written for teams that are past the prototype stage. You have something running in production, or you're close to it, and you're starting to think seriously about what it will cost to operate at real scale. You're not trying to make the model cheaper by making it worse. You're trying to identify the waste — the tokens you're spending without getting value back.

There is a lot of waste. In almost every production system I've seen, between 20 and 50 percent of token spend is recoverable without any meaningful quality degradation. The techniques to recover it aren't exotic. They're engineering disciplines applied to a new kind of resource.

Each chapter covers one lever. By the end, you'll have a framework for thinking about inference spend the way you'd think about any other infrastructure cost — not as a fixed overhead you manage, but as a variable you can engineer.

The numbers in this guide use Claude pricing (Anthropic) as the primary reference point, with periodic comparisons to OpenAI and Google where instructive. The specific rates will drift; the relationships between them tend to be more stable. When in doubt, verify against current provider documentation.

---

## Chapter 1: Where the Money Actually Goes

Most LLM cost conversations start in the wrong place. Teams look at the model pricing page, estimate their average request size, multiply by expected volume, and produce a number that's usually off by a factor of two to five. Not because the math is wrong, but because the model of the system is wrong.

Understanding where inference spend actually accumulates requires understanding the shape of a typical production request — not just the prompt you wrote, but everything that rides along with it.

### The Anatomy of a Request

When you call an LLM API, you send tokens and receive tokens. The input side includes:

- Your system prompt
- Any retrieved context (documents, code, database records)
- Conversation history, if the system maintains it
- The user's actual message

The output side includes:

- The model's response
- Any intermediate reasoning (chain-of-thought, if you're using it)
- Tool call arguments and structured data, if the model is using function calls

What surprises most teams is the relative size of these components. The user's message is typically the smallest piece. In a well-designed RAG system, retrieved context often accounts for 60 to 80 percent of input tokens. In a multi-turn chat system with naive history management, conversation history grows without bound and eventually dominates everything.

> **Key Insight:** Your users' messages are rarely your biggest cost driver. System prompts, retrieved context, and conversation history almost always are.

### The Context Window as a Budget

The context window is both a capability and a constraint. A 200,000-token context window sounds liberating until you realize that filling it costs proportionally. On Claude Sonnet 4.5, you're paying roughly $3 per million input tokens. A single request that fills a 200K context window costs $0.60 in input tokens alone — before the model generates a single word of response.

This reframes the context window not as free space but as expensive real estate. Every token you add to context should earn its place by meaningfully improving response quality. Every token that doesn't earn its place is waste.

The concept of context efficiency — the ratio of value produced to tokens consumed — is the first principle of token economics. A 10,000-token context that produces a great answer is more efficient than a 100,000-token context that produces the same answer. They don't always produce the same answer, which is why "just truncate everything" is not a strategy. But they also don't always produce meaningfully different answers, which is why "add everything that might help" is an expensive habit.

### Where Costs Concentrate in Practice

**Conversational systems** accumulate cost primarily through history growth. If you're appending every exchange to a growing context and passing the full history with each request, your average token count per request will grow linearly with conversation length. A 50-turn conversation doesn't just have 50 times the tokens of a single exchange — it has 50 plus 49 plus 48... which is 1,275 times the single-exchange size for the first message's worth of content.

**RAG systems** accumulate cost through context stuffing. The naive approach to retrieval is to fetch the top-k chunks and concatenate them. If k is 10 and your chunks are 500 tokens each, you've added 5,000 tokens before the system prompt. Teams often increase k when quality seems low, which is frequently the wrong response — more chunks of mediocre relevance don't outperform fewer chunks of high relevance, and they cost more.

**Agentic systems** accumulate cost through tool call overhead and scratchpad growth. Each step in a multi-step agent pipeline carries the full history of prior steps. A five-step agent that uses 1,000 tokens per step doesn't cost 5,000 tokens — it costs roughly 15,000, because each step includes the accumulated output of all prior steps.

**Code generation systems** accumulate cost through large input files. Sending an entire 2,000-line file to ask about a 50-line function is a common pattern that's expensive and usually unnecessary.

> **Warning:** Agentic systems have multiplicative cost structures, not additive ones. Model your costs before going to production, or the first real traffic spike will be educational in an expensive way.

### The Output Side

Output tokens are typically two to five times more expensive per token than input tokens, depending on the provider and model. This makes verbose response styles costly in a way that's easy to underestimate.

A model that responds with 300 tokens when 100 would suffice costs three times more for that output. At scale, this is significant. A system processing 100,000 requests per day where average output length could be reduced from 300 to 150 tokens — without changing quality — would see output costs cut in half. On a $10,000/month inference budget, that's potentially $5,000/month.

Output length is partially a model behavior and partially a prompt behavior. Many teams don't realize that the way you write prompts significantly influences how long responses are. Explicit length guidance, response format specifications, and even small phrasing choices substantially affect output token counts.

### Hidden Token Costs

Beyond the primary prompt-response cycle, several patterns add tokens that don't show up in naive cost models:

**Retry logic** can silently double your costs if you're retrying failed or low-quality requests without tracking that spend separately. A 5% retry rate on expensive requests can add 5% to your bill, invisibly.

**Streaming responses** have the same token cost as non-streaming responses. Streaming is about latency, not cost. Some teams assume streaming is cheaper because they can interrupt early — but the tokens are metered at generation, not delivery.

**Embedded metadata** in prompts adds up. JSON structures with field names, XML tags, Markdown formatting, and verbose delimiters all cost tokens. A response returned as a rich JSON object with descriptive keys might cost 20 to 30 percent more than the same data returned as a flat structure.

**Templating waste** — the invisible cost of how prompts are constructed — is endemic in systems where prompts are assembled from multiple sources. Redundant whitespace, repeated instructions, and duplicated context sections are common in prompt templates that have been edited by multiple people over time.

### Building a Cost Model

Before you can optimize anything, you need to understand where you are. A cost model for an LLM system should account for:

| Component | Typical % of Input Tokens | Notes |
|---|---|---|
| System prompt | 5–20% | Fixed; amortizes well |
| Retrieved context | 30–60% | Scales with k and chunk size |
| Conversation history | 0–50% | Grows with turn count |
| User message | 5–20% | Irreducible |
| Tool definitions | 2–10% | Often overlooked |

The output side is simpler: model behavior plus prompt influence. The key variable is your average output length, which you should measure empirically rather than estimate.

Once you have this breakdown, you know where to focus. Most optimization effort should go where most tokens go.

> **Try This:** Pull a sample of 100 real requests from your production logs. Break down the token counts by component (system prompt, context, history, user message) and calculate the percentage each component represents. If you don't have this breakdown in your logs yet, add it — you can't optimize what you can't measure.

### Key Takeaways

1. The user's message is rarely the dominant cost. System prompts, context, and history usually are.
2. The context window is expensive real estate, not free space. Every token should earn its place.
3. Conversational systems have quadratic cost growth with naive history management. Agentic systems have multiplicative costs per step.
4. Output tokens cost two to five times more than input tokens, depending on provider and model.
5. Hidden costs — retries, metadata overhead, templating waste — are common and measurable.

**Chapter Exercise:** For your primary LLM system, build a per-request cost breakdown. Instrument your API calls to log token counts by component. Calculate the cost-per-request and cost-per-day at current volume. Identify the single largest token consumer. That's where Chapter 3 starts.

---

## Chapter 2: The Token Tax: Input vs. Output Costs

Every major LLM provider prices input and output tokens separately, and output tokens are almost always more expensive. Understanding why — and what that means for how you design systems — is the foundation of effective cost management.

### Why Output Is More Expensive

The asymmetry isn't arbitrary. From a computational standpoint, generating a token requires more work than processing one. During inference, input tokens are processed in parallel — the attention mechanism operates across all input tokens simultaneously. Output tokens, by contrast, are generated sequentially: each new token depends on all prior tokens, including the output generated so far. This sequential dependency is the fundamental bottleneck in autoregressive generation.

More compute per token means higher marginal cost. Providers pass this through in pricing. The typical input-to-output ratio is roughly 1:3 to 1:5, meaning an output token costs three to five times an input token. At the time of writing, Claude Sonnet 4.5 is priced at $3 per million input tokens and $15 per million output tokens — a 1:5 ratio.

This ratio has a significant implication: a system that doubles its output length doubles its output cost. The input-side cost might barely move, but output cost is what you're really managing when you think about response verbosity.

### Calculating Your Effective Token Rate

Because input and output tokens occur in a mix, your "effective" cost per token depends on the ratio of input to output in your workload. If a typical request has 2,000 input tokens and 300 output tokens:

```
Effective rate = (2000 × $0.000003 + 300 × $0.000015) / 2300
              = ($0.006 + $0.0045) / 2300
              = $0.0105 / 2300
              ≈ $0.00000457 per token
```

Compare that to a system with 500 input tokens and 1,000 output tokens:

```
Effective rate = (500 × $0.000003 + 1000 × $0.000015) / 1500
              = ($0.0015 + $0.015) / 1500
              = $0.0165 / 1500
              ≈ $0.000011 per token
```

The second system's effective rate is more than double — driven entirely by the output-heavy ratio. This is why coding assistants and summarization tools tend to have very different cost profiles even if their average request size looks similar.

> **Key Insight:** Your input-to-output ratio is one of the most important cost variables in your system. Measure it. Different use cases within the same product can have ratios that differ by an order of magnitude.

### Output Control as a Cost Lever

Because output is expensive, controlling output length is one of the highest-leverage optimizations available. The good news is that output length is meaningfully controllable without degrading quality — if you're explicit about what you need.

**Max tokens parameter.** Every major provider exposes a `max_tokens` parameter that hard-caps response length. This is a blunt instrument — the model truncates mid-thought if it hits the cap — but used correctly, it prevents runaway verbose responses and can be tuned per-use-case. If your typical valid response is under 500 tokens, setting `max_tokens` to 750 protects you from occasional 2,000-token responses without affecting normal behavior.

**Prompt-level length instructions.** Models follow explicit instructions about response length, and these instructions are far more effective than most teams expect. Telling a model to respond in "2-3 sentences" or "under 150 words" reliably produces shorter responses. The key is precision — "be concise" is less effective than "respond in under 100 words."

The same principle applies to format. Asking for a JSON response instead of natural language prose often reduces token count, because prose has connecting tissue — transitions, hedges, context-setting — that structured data doesn't need. A JSON object with five fields might be 150 tokens; a natural language equivalent might be 400.

**Response format specification.** Defining the exact structure you expect — whether that's a JSON schema, a numbered list, a fixed template — compresses output in two ways: it eliminates structural ambiguity that models fill with words, and it allows you to parse outputs efficiently rather than having the model explain its work.

```python
# Verbose output (natural language)
# "Based on my analysis, the sentiment of this text is positive.
#  The author uses enthusiastic language and several positive
#  indicators were found..."
# ~40 tokens

# Compact output (JSON)
# {"sentiment": "positive", "confidence": 0.87}
# ~12 tokens
```

At scale, the difference between these two approaches compounds quickly. 100,000 requests per day at 28 fewer output tokens each saves 2.8 million output tokens daily. At $15 per million output tokens, that's $42/day — $1,260/month — from a single formatting decision.

### The Hidden Cost of Chain-of-Thought

Chain-of-thought prompting — asking the model to "think step by step" or show its reasoning — reliably improves accuracy on complex tasks. It's also expensive, because reasoning tokens are output tokens.

A response that includes 500 tokens of reasoning followed by 100 tokens of answer costs the same as any other 600-token response. If the reasoning is necessary for answer quality, the cost is justified. If the task is simple enough that the model reasons correctly without it, the reasoning is pure overhead.

The appropriate use of chain-of-thought is calibrated to task complexity. Simple classification, extraction, and lookup tasks rarely benefit from extended reasoning. Complex multi-step problems, ambiguous contexts, and tasks requiring planning often do. The mistake is applying it uniformly.

Newer models with explicit "extended thinking" modes (such as Claude's thinking mode) charge for thinking tokens separately — often at input token rates rather than output token rates. This is worth checking per provider, as the economics can shift the calculation significantly.

> **Warning:** Asking for chain-of-thought on every request without calibration can double or triple your output token count. Measure the quality improvement on your specific task before making it a default.

### Structured Outputs and Schema Overhead

Modern APIs support constrained decoding — forcing the model to produce output that conforms to a specific JSON schema. This is useful for reliability, but it has a token cost dimension worth understanding.

The schema itself costs tokens when it's included in the prompt. A detailed schema with 10 fields, descriptions, and enum values might cost 200–400 tokens per request. At high volume, this becomes meaningful.

The tradeoff is between schema overhead and response parsing reliability. For complex schemas in high-volume systems, it's worth measuring whether a simpler prompt-level instruction ("respond ONLY with valid JSON, format: {...}") produces sufficiently reliable output, since it typically costs fewer tokens than a full schema definition.

### Comparing Costs Across Providers

As of early 2026, the major provider pricing landscape looks roughly like this (approximate; always verify current rates):

| Provider / Model | Input ($/M tokens) | Output ($/M tokens) | Ratio |
|---|---|---|---|
| Claude Sonnet 4.5 | $3.00 | $15.00 | 1:5 |
| Claude Haiku 4.5 | $0.80 | $4.00 | 1:5 |
| GPT-4o | $2.50 | $10.00 | 1:4 |
| GPT-4o mini | $0.15 | $0.60 | 1:4 |
| Gemini 1.5 Flash | $0.075 | $0.30 | 1:4 |
| Gemini 1.5 Pro | $1.25 | $5.00 | 1:4 |

The ratio is remarkably consistent across providers, which suggests it reflects a genuine computational asymmetry rather than a pricing strategy. What varies is the absolute level — Haiku is roughly 1/5th the cost of Sonnet for similar workloads.

### Thinking About Cost Per Value Unit

The trap in pure cost optimization is treating tokens like a resource to minimize rather than a resource to allocate. The right frame is return on token investment: what value do you get per token spent?

An expensive research query that produces a high-quality 1,000-word analysis for a paying customer might have better unit economics than a cheap 50-token classification that gets every third classification wrong. Cost per token matters less than cost per successful outcome.

This is why output quality tracking belongs in any cost management system. You can't know whether your optimization efforts are working without knowing whether they're degrading quality. The two signals must be measured together.

> **Try This:** Pick your top three system prompts by token volume. Time how long it takes you to read each one. If any section feels redundant, hedging, or like it's there "just in case," highlight it. Then measure whether removing those sections changes response quality on a set of representative test cases. Clarity almost always beats volume.

### Key Takeaways

1. Output tokens are typically 3–5× more expensive than input tokens due to the sequential computation required to generate them.
2. Your effective cost per token depends on your input-to-output ratio — measure it per use case.
3. Explicit length and format instructions are highly effective at reducing output tokens without sacrificing quality.
4. Chain-of-thought prompting is worth the output cost on complex tasks; it's overhead on simple ones.
5. Schema definitions for structured outputs add input token overhead that may or may not be worth it depending on task complexity and volume.

**Chapter Exercise:** For your most-used endpoint, run 50 test requests with and without explicit length instructions (e.g., "Respond in under 200 words"). Compare average output token counts and manually rate the quality of each set. Calculate the projected monthly savings if the shorter outputs are equally good.

---

## Chapter 3: Context Compression Strategies

The context window is the costliest piece of real estate in your stack. Unlike compute or storage, every token in context gets billed on every request — there's no amortization once data is loaded. The question isn't whether to manage context aggressively; it's which techniques apply to your specific system.

### The Compression Mindset

Context compression isn't about making the model work with less information. It's about identifying which information is doing work — contributing to better responses — and which is just taking up space. Those are different problems.

Irrelevant tokens in context don't just cost money; they can actively degrade performance. This is counterintuitive, but well-documented. Large language models have a known "lost in the middle" problem: information at the beginning and end of long contexts is retrieved more reliably than information in the middle. Stuffing a context full of marginally relevant documents can bury the actually relevant information under noise.

Context compression, done well, improves quality and reduces cost simultaneously. That's the target.

### Conversation History Management

For conversational systems, unbounded history is the fastest path to runaway costs. The standard approaches are:

**Sliding window.** Keep only the last N turns of conversation. Simple, predictable, and works well when conversations are mostly sequential — each message builds on the prior one rather than referencing something from early in the conversation. The failure mode is losing relevant early context (a user's stated preference in turn 1 that you need in turn 15).

**Summarization.** Periodically compress older conversation turns into a summary. "The user asked about X, we discussed Y, they decided on Z" is far more compact than the verbatim exchange. The cost of generating the summary is real but typically one-time — the summary replaces multiple turns and stays compressed.

A practical pattern is a two-tier history: recent turns verbatim, older turns as a rolling summary. This preserves the immediacy of recent context while managing growth.

```python
def build_context(history: list[dict], max_recent_turns: int = 5) -> list[dict]:
    """
    Returns recent turns verbatim; older turns as a single summary message.
    """
    if len(history) <= max_recent_turns:
        return history

    older = history[:-max_recent_turns]
    recent = history[-max_recent_turns:]

    summary = summarize_turns(older)  # call model or extractive summary

    return [
        {"role": "system", "content": f"Earlier conversation summary: {summary}"},
        *recent
    ]
```

**Extractive summaries.** If you can identify key facts from conversation history — user preferences, decisions made, constraints stated — you can extract just those facts rather than summarizing entire exchanges. A structured "user profile" that accumulates facts over a conversation is often more useful than a narrative summary, and can be far more compact.

**Token budget enforcement.** Set a hard limit on conversation history tokens and prune from the oldest end when the budget is exceeded. This is blunter than summarization but requires no model call and is entirely deterministic. Works well when recency is what matters most.

> **Key Insight:** The right history management strategy depends on your conversation structure. Sequential conversations (each message builds on the last) do well with sliding windows. Reference-heavy conversations (users frequently refer back to earlier decisions) need summarization or fact extraction.

### Retrieval-Augmented Generation Compression

RAG systems have their own compression challenge: the retrieval step produces candidates, and you have to decide which candidates actually go into the context.

**Chunk size calibration.** Most RAG systems use fixed-size chunks (e.g., 512 tokens). The right chunk size depends on the density of relevant information in your documents. Technical documentation benefits from smaller chunks because the relevant fact is often localized. Narrative documents benefit from larger chunks because removing surrounding context degrades comprehension.

If your chunks are large relative to the relevance of individual sections, you're paying for a lot of context that isn't earning its place.

**Re-ranking and filtering.** Semantic retrieval returns documents by embedding similarity, which is a proxy for relevance but not a perfect measure. A re-ranking step — using a smaller, cheaper model to score retrieved chunks for relevance to the specific query — can dramatically reduce the number of chunks that make it into the final context.

A typical pattern: retrieve top 20 chunks by embedding similarity, re-rank them with a lightweight model, send only the top 5 to the primary model. The re-ranking model call costs a fraction of reducing 20 chunks to 5 in the expensive model's context.

```python
def get_context(query: str, k_retrieve: int = 20, k_final: int = 5) -> list[str]:
    # Step 1: broad retrieval
    candidates = vector_store.search(query, k=k_retrieve)

    # Step 2: re-rank with lightweight model or cross-encoder
    scores = reranker.score(query, [c.text for c in candidates])
    ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)

    # Step 3: return only top-k
    return [chunk.text for chunk, _ in ranked[:k_final]]
```

**Contextual compression.** Rather than sending entire document chunks, extract the sentences or passages most relevant to the specific query. This is computationally more expensive but can reduce context size by 50–70% while maintaining or improving answer quality.

LangChain's `ContextualCompressionRetriever` and similar patterns implement this. The tradeoff is latency and the cost of the compression step — which usually runs much cheaper than sending the full chunks to a frontier model.

**Maximal Marginal Relevance (MMR).** When you retrieve multiple chunks, they often overlap significantly in content. MMR selects chunks to maximize relevance while minimizing redundancy. Instead of five chunks that all say the same thing about a topic, MMR gives you five chunks that each contribute distinct information. This is a pure win: same token budget, more information diversity.

### Document-Level Strategies

When working with large documents, the question isn't just which chunks to retrieve — it's what level of granularity to retrieve at.

**Hierarchical retrieval** first identifies the relevant section of a document (chapter, section, heading group) and then retrieves at fine granularity within that section. This dramatically reduces the search space and concentrates retrieved tokens on the genuinely relevant portion of the document.

**Selective context injection.** For tasks that require only specific information from a document (e.g., "what is the refund policy in this terms of service?"), you can pre-process documents to create targeted indexes rather than full-text indexes. A "policy registry" that maps policy topics to relevant excerpts costs a fraction of sending the full document.

**Dynamic context sizing.** Not every query needs the same amount of context. A factual lookup needs one relevant passage. A synthesis task needs multiple sources. An analysis task needs broad context. Routing queries to different context-size tiers based on query type is an underused optimization.

### System Prompt Compression

System prompts are often the most neglected source of token waste. They're written once, serve multiple purposes, and grow by accretion as teams add new instructions, edge case handlers, and formatting requirements over time.

A system prompt audit is worth doing quarterly:

1. Read every instruction and ask whether removing it would observably change behavior.
2. Identify redundant instructions (same constraint stated multiple times in different words).
3. Look for examples that could be replaced with clearer instructions.
4. Find formatting overhead — long XML tags, verbose JSON keys, excessive whitespace.

The goal isn't minimalism for its own sake. Some instructions are load-bearing. But most system prompts have 20–30% removable overhead that accumulated without intent.

One useful technique: rephrase instructions as constraints rather than explanations. "Do not include disclaimers in your responses" (7 tokens) communicates the same thing as "When responding to user queries, please remember that it is not necessary or helpful to include disclaimers or caveats about the limitations of your knowledge, as users are aware of this already" (41 tokens). The shorter version usually works just as well.

> **Warning:** Aggressive system prompt compression can break subtle behaviors that you didn't realize the prompt was controlling. Always test against a regression suite before deploying a significantly shortened system prompt.

### Caching as Compression

Caching isn't compression in the traditional sense, but it functionally reduces the cost of context for repeated requests. This is covered in detail in Chapter 4, but the core concept belongs here: if a large portion of your context is static across requests, you can cache it and pay for it only once per cache window.

The compression mindset and the caching mindset are complementary. Compress what can be compressed; cache what can't be compressed further.

### Measuring Compression Effectiveness

Any compression strategy should be evaluated on three dimensions:

1. **Token reduction** — how many fewer tokens per request?
2. **Quality impact** — does answer quality change? (requires evaluation)
3. **Latency impact** — does the compression step add meaningful latency?

Token reduction is easy to measure. Quality impact requires either human evaluation, automated scoring against a benchmark, or proxy metrics (user corrections, thumbs-down rates, escalations). Latency impact is straightforward to instrument.

The right compression strategy is the one that maximizes token reduction while minimizing quality impact per unit of implementation complexity.

> **Try This:** Take your current RAG system and run the same 20 representative queries with k=10 and k=3. Compare the answer quality (either manually or with an LLM-as-judge setup). If k=3 produces comparable quality, you've found a 70% context reduction with zero architectural change.

### Key Takeaways

1. Context compression improves quality and reduces cost — irrelevant tokens don't just cost money, they actively hurt model performance.
2. Conversation history needs active management: sliding windows, summarization, or fact extraction depending on your conversation structure.
3. RAG systems benefit from re-ranking and contextual compression to reduce chunk count without losing relevant information.
4. System prompts grow by accretion. A periodic audit almost always reveals 20–30% removable overhead.
5. Measure compression strategies on token reduction, quality impact, and latency together — never just one.

**Chapter Exercise:** Implement a simple conversation history summarizer. After 8 turns, compress turns 1–4 into a 150-token summary using a lightweight model call. Measure the token count difference on a 20-turn conversation and verify that summary-based context is functionally equivalent to full-history context on your use case.

---

## Chapter 4: Prompt Caching and When to Use It

Prompt caching is one of the most cost-effective optimizations available for production LLM systems, and one of the most underused. When it applies, it can cut input token costs by 80 to 90 percent on the cached portion. When it doesn't apply, using it adds latency without saving money. Understanding the conditions that determine which situation you're in is the practical challenge.

### How Prompt Caching Works

Every major provider now offers some form of prompt caching, though the implementation details vary. The common principle: if the beginning of your prompt is identical across multiple requests, the provider processes it once and caches the key-value (KV) state from the attention computation. Subsequent requests that share the same prefix skip the recomputation and pay a reduced rate for the cached portion.

On Anthropic's API, cached tokens are charged at $0.30 per million for reads (vs. $3.00 for normal input) and $3.75 per million for writes (the first request that populates the cache). The cache persists for five minutes and is refreshed on each hit.

```python
import anthropic

client = anthropic.Anthropic()

# Mark the system prompt as cacheable
response = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=1024,
    system=[
        {
            "type": "text",
            "text": LARGE_SYSTEM_PROMPT,  # 5000-token system prompt
            "cache_control": {"type": "ephemeral"}  # Mark for caching
        }
    ],
    messages=[
        {"role": "user", "content": user_message}
    ]
)
```

The cached portion is processed once when the cache is populated, then each subsequent request within the cache window reads the cached state at one-tenth the normal cost.

### The Conditions That Make Caching Valuable

Caching is only valuable when three conditions hold:

**The cached prefix is large.** Caching a 100-token system prompt saves fractions of a cent. Caching a 50,000-token system prompt with retrieved documents included saves real money. The minimum cache size for a meaningful benefit is typically 1,000 tokens — providers often have minimum thresholds anyway (Anthropic requires at least 1,024 tokens to be eligible for caching).

**The same prefix recurs frequently.** A cache that's populated once and hit zero times costs money on the write. The write cost ($3.75/M on Anthropic) is higher than the normal input cost ($3.00/M), so a cache that's never re-read is worse than no cache. For caching to break even, you need at least a handful of cache hits within the cache window.

The breakeven formula is simple:

```
Cache write cost + (hits × cache read cost) < hits × normal input cost

Breakeven hits = cache_write_cost / (normal_input_cost - cache_read_cost)
               = 3.75 / (3.00 - 0.30)
               = 3.75 / 2.70
               ≈ 1.4 hits
```

You need roughly 2 cache hits to break even. After that, every additional hit is pure savings.

**The prefix is truly stable.** Caches are invalidated by any change to the prefix. A system prompt that includes the current timestamp, user-specific data, or dynamic content will have zero cache hit rate regardless of how frequently you call the API. The prefix must be identical — byte-for-byte — across all requests that share a cache.

> **Key Insight:** Prompt caching rewards architectural discipline. Systems that keep static content static and dynamic content separate benefit most. Systems that mix dynamic data throughout their prompts cannot use caching effectively.

### Structuring Prompts for Caching

The single most important structural decision is putting static content first. Caching works on prefixes — the longest shared prefix across requests is the cache boundary. Dynamic content anywhere before the end of the cached section breaks the cache for everything after it.

Optimal structure for cacheable prompts:

```
1. [STATIC] System instructions        ← cache boundary
2. [STATIC] Retrieved background docs  ← cache boundary
3. [DYNAMIC] User-specific context     ← NOT cached
4. [DYNAMIC] Conversation history      ← NOT cached
5. [DYNAMIC] Current user message      ← NOT cached
```

Non-optimal (and unfortunately common) structure:

```
1. [STATIC] Some instructions
2. [DYNAMIC] User name and preferences  ← breaks cache here
3. [STATIC] More instructions
4. [STATIC] Background documents
5. [DYNAMIC] User message
```

In the second structure, the cache boundary is after the first static block — everything after the dynamic user data gets processed fresh on every request.

Refactoring to make caching work often requires moving user-specific data to a later position in the prompt. This is sometimes uncomfortable because it feels like the instructions should know about the user before the documents. In practice, models handle this separation well — you can refer forward and backward in context.

### Where Caching Applies in Practice

**Large system prompts.** If your system prompt is several thousand tokens (detailed persona, extensive instructions, example conversations), caching is straightforward — the system prompt is almost always static.

**RAG with shared documents.** If all users of a feature query from the same document corpus, the retrieved documents can be pre-fetched for common queries and cached as part of a shared prefix. This requires a layer that identifies likely documents before the user query, which is a more sophisticated architecture but can produce dramatic savings at scale.

**Code context and large file injection.** Coding assistants that send full file contents as context are natural candidates for caching. If the file hasn't changed between requests (which is common during iterative editing of the same file), the file content can be cached.

**Knowledge base injection.** Some applications load a static knowledge base into every request — product documentation, FAQ content, policy documents. This content is inherently cacheable and often represents hundreds or thousands of tokens per request.

**Conversation history with shared prefix.** In a multi-turn conversation, earlier turns are shared by definition across each subsequent request. Caching up to the most recent exchange pays dividends in long conversations.

### When Caching Doesn't Help

**High uniqueness workloads.** If every request has a unique, large context (e.g., a document processing pipeline where each request processes a different document), there are no shared prefixes and caching provides no benefit.

**Short sessions.** If your cache hit rate is below 1.4 per cache write (roughly), caching costs more than it saves. Short sessions with few requests before the user leaves don't generate enough hits.

**Fully dynamic prompts.** Any system that generates prompts from scratch for each request — interpolating user data throughout — cannot benefit from prefix caching without architectural changes.

**Low-traffic systems.** At low request volumes, the cache might never get hit before the TTL expires. Caching infrastructure adds complexity; make sure the volume justifies it.

> **Warning:** Cache write costs are higher than normal read costs. If your traffic is bursty and your cache window consistently expires before you get meaningful hits, you're paying extra on every write. Monitor cache hit rates, not just cache savings.

### Monitoring Cache Effectiveness

The Anthropic API returns cache usage data in every response:

```python
response = client.messages.create(...)

cache_creation_input_tokens = response.usage.cache_creation_input_tokens
cache_read_input_tokens = response.usage.cache_read_input_tokens
input_tokens = response.usage.input_tokens

# Calculate effective cost
input_cost = input_tokens * 0.000003
cache_read_cost = cache_read_input_tokens * 0.0000003
cache_write_cost = cache_creation_input_tokens * 0.00000375

total_cost = input_cost + cache_read_cost + cache_write_cost
savings = cache_read_input_tokens * (0.000003 - 0.0000003)
```

Track this per endpoint and session. A cache hit rate below 80% on a system where you expect high caching suggests prompt structure issues — probably dynamic content in the static prefix area.

### Multi-Turn Caching Patterns

For conversational systems, each exchange extends the shared prefix for subsequent turns. The pattern of caching the full conversation history up to the last message is supported via `cache_control` markers:

```python
messages = [
    {"role": "user", "content": [
        {"type": "text", "text": "What is the capital of France?",
         "cache_control": {"type": "ephemeral"}}  # Cache after this turn
    ]},
    {"role": "assistant", "content": "Paris."},
    {"role": "user", "content": "What's its population?"}  # Current turn, not cached
]
```

This tells the provider to cache everything up to the marked message, so the current turn's processing can benefit from cached prior context.

### Key Takeaways

1. Prompt caching reduces cached input token cost by ~90% with a ~5-minute TTL. You need roughly 2 cache hits to break even on write cost.
2. Caching requires a stable prefix. Any dynamic content before the cache boundary invalidates the cache for everything after it.
3. Structure prompts with static content first (instructions, background docs) and dynamic content last (user message, session-specific data).
4. Monitor cache hit rates actively — a low hit rate on a supposedly cacheable system often indicates prompt structure problems.
5. High-uniqueness workloads (one document per request) cannot benefit from prefix caching.

**Chapter Exercise:** Review your primary system prompt. Identify any dynamic content currently interpolated into the static portion. Refactor to move dynamic elements to the end of the prompt. Measure the cache hit rate before and after.

---

## Chapter 5: Model Selection and Cost-Quality Tradeoffs

The model you choose is the single biggest lever on both cost and quality. A 10x difference in price between a frontier model and a smaller model is common. Whether that price difference is justified depends entirely on whether the frontier model produces meaningfully better outputs for your specific task.

The mistake most teams make is choosing a model once during development — when they're exploring capabilities and frontier models are the obvious choice — and never revisiting that choice once they understand their actual workload.

### The Model Landscape in 2026

The current generation of LLMs spans several capability tiers, each with dramatically different cost profiles:

**Frontier / reasoning models** (GPT-4o, Claude Sonnet 4.5, Gemini 1.5 Pro, Claude Opus): $2–$15/M input, $10–$75/M output. These are the most capable models for complex reasoning, nuanced writing, multi-step planning, and tasks requiring broad world knowledge. They're appropriate when task quality directly drives business value and cheaper models demonstrably underperform.

**Capable mid-tier models** (GPT-4o mini, Claude Haiku 4.5, Gemini Flash): $0.10–$1/M input, $0.30–$4/M output. These are 5–20× cheaper than frontier models and perform adequately on structured tasks, simple Q&A, classification, extraction, and summarization. The quality gap to frontier models is real but often irrelevant to the task at hand.

**Specialized models**: Some providers offer models optimized for specific tasks — code generation, embedding, reranking — that outperform general-purpose models on their target task at lower cost. AWS Bedrock, Cohere, and others have offerings in this space.

**Self-hosted open models**: Running Llama 3, Mistral, or similar open-weight models on your own infrastructure eliminates per-token API costs entirely. The economics work at sufficient scale; the operational burden is real.

### The Task-Model Fit Framework

The right model for a task is the cheapest one that produces acceptable quality on that task. "Acceptable" is defined by your business requirements, not by the model's theoretical capabilities.

To find the right fit, you need to know what your task actually requires. A useful taxonomy:

**Complexity dimension:**
- Low: classification, extraction, lookup, simple formatting
- Medium: summarization, simple Q&A, structured generation
- High: multi-step reasoning, complex writing, code generation for novel problems

**Ambiguity dimension:**
- Low: well-defined input and output formats, little interpretation required
- Medium: some interpretation required, but within clear bounds
- High: open-ended, requires judgment about what the user wants

**Domain knowledge dimension:**
- Generic: widely represented in training data (news, code, general knowledge)
- Specialized: narrow domain knowledge (specific medical procedures, proprietary codebase understanding, specific regulatory frameworks)

Tasks that are low complexity, low ambiguity, and use generic knowledge are strong candidates for smaller, cheaper models. Tasks that are high on any of these dimensions may genuinely require frontier models.

> **Key Insight:** Most production workloads are not uniformly complex. A customer service system handles everything from "where is my order?" (low complexity) to "I want to escalate this to a supervisor and I'm threatening legal action" (high ambiguity). Routing these differently can cut costs by 40–60% without affecting quality where it matters.

### Running Model Comparison Experiments

Model selection should be empirical, not intuitive. The process:

1. Define your task and success criteria (human ratings, accuracy metrics, format compliance rate).
2. Build a representative test set — at minimum 50 examples, ideally 200+.
3. Run all candidate models against the test set.
4. Score outputs against your criteria.
5. Compare cost per successful outcome.

Cost per successful outcome is more informative than cost per request. A model that costs $0.05 per request and succeeds 95% of the time has a cost per success of $0.053. A model that costs $0.10 per request and succeeds 99% of the time has a cost per success of $0.101. The cheaper model has better unit economics even though it fails more often — in this case.

But if failure means human review at $2 per case, the calculation flips. Include the downstream cost of failure in your model.

```python
def cost_per_successful_outcome(
    model_cost_per_request: float,
    success_rate: float,
    failure_handling_cost: float = 0.0
) -> float:
    """
    model_cost_per_request: API cost in dollars
    success_rate: 0.0 to 1.0
    failure_handling_cost: downstream cost when model fails
    """
    avg_cost = model_cost_per_request + (1 - success_rate) * failure_handling_cost
    return avg_cost / success_rate
```

### The Cascading / Routing Architecture

One of the most cost-effective architectures for mixed-complexity workloads is the cascade: attempt the task with a cheap model first; route to a more expensive model only if the cheap model fails or signals low confidence.

This works best when:
- You can reliably identify low-confidence outputs from the cheap model
- Failures are cheap to detect (format errors, missing required fields)
- The workload has a mix of easy and hard cases

A simple cascade for document classification:

```python
def classify_document(doc: str) -> dict:
    # First attempt: cheap model
    result = haiku_client.classify(doc)

    if result.confidence > 0.85 and result.format_valid:
        return result  # Accept cheap result

    # Escalate to frontier model
    return sonnet_client.classify(doc)
```

The economics depend on your difficulty distribution. If 80% of cases are easy and resolve correctly with the cheap model, and 20% escalate to the expensive model, your blended cost is:

```
0.80 × cheap_cost + 0.20 × expensive_cost
```

That's typically 30–50% of the all-expensive cost, depending on the price ratio between models.

### Fine-Tuning as Cost Reduction

Fine-tuning a smaller model on examples of good outputs from a frontier model — a technique sometimes called distillation — can produce a model that matches frontier performance on your specific task at fraction of the cost.

The economics are favorable when:
- Your task is well-defined and consistent (not open-ended)
- You have sufficient examples (typically hundreds to thousands)
- Volume is high enough that training cost amortizes over many requests
- Inference cost is currently significant

Fine-tuning a small model like GPT-4o mini on your specific classification task might bring its accuracy from 87% to 97% — matching or exceeding the frontier model — at one-tenth the inference cost. The training cost is one-time and amortizes over volume.

The caution: fine-tuned models are specialized. They perform well on the distribution they were trained on and degrade on anything outside that distribution. Narrow tasks with stable input distributions are good candidates; open-ended tasks are not.

> **Warning:** Fine-tuning is not free. Data preparation, training compute, and ongoing maintenance are real costs. Calculate your time-to-ROI before committing.

### On Prompt Engineering Across Models

A prompt written for one model often doesn't transfer cleanly to another. Smaller models are typically less robust to ambiguous or complex instructions; they benefit from simpler, more explicit prompts. A frontier model prompt that relies on inference and implicit understanding may need to be rewritten as explicit step-by-step instructions for a smaller model to follow reliably.

When switching models, test your prompts — don't just swap the model name. The cost of one afternoon's prompt refinement is trivial compared to the cost of a production failure.

### Key Takeaways

1. Model selection is the highest-leverage cost decision. Revisit it once you understand your actual workload — don't stay on the model you used during prototyping.
2. Right-size to task complexity. Most workloads have both simple and complex requests that don't need the same model.
3. Measure cost per successful outcome, not cost per request. Include downstream failure costs.
4. Cascade architectures (cheap first, expensive on escalation) routinely cut blended costs by 30–50% for mixed-complexity workloads.
5. Fine-tuning can close the quality gap between small and large models for well-defined, high-volume tasks.

**Chapter Exercise:** Take your second most expensive endpoint (by total monthly spend). Build a test set of 50 examples. Run it against your current model and the next cheaper tier. Score accuracy manually or with an LLM judge. Calculate the cost-per-successful-outcome for each. The result will tell you whether the cost delta is justified.

---

## Chapter 6: Batching, Routing, and Request Shaping

So far, the optimizations in this guide have focused on reducing what you send per request. This chapter covers the other dimension: how you send it. The structure of your API call patterns — timing, grouping, formatting, and routing — has meaningful cost implications that don't require changing a single token of prompt content.

### Batch Processing

Major providers offer batch inference APIs at significantly reduced rates. Anthropic's Message Batches API and OpenAI's Batch API both run at 50% of standard pricing for requests that can tolerate higher latency (typically completed within 24 hours).

The half-price discount is substantial. For any workload that doesn't require real-time responses, batch processing is essentially free money.

Batch-appropriate workloads:

- Document processing (OCR correction, categorization, extraction)
- Offline summarization pipelines
- Scheduled report generation
- Training data generation and evaluation
- Nightly enrichment jobs
- Async annotation tasks

Non-batch workloads:

- Any user-facing interaction requiring a response in under a second
- Real-time content moderation
- Interactive coding assistants

The architecture is simple. Instead of making N individual API calls, you submit a single batch request containing all N prompts, poll for completion, and retrieve results.

```python
import anthropic
import json

client = anthropic.Anthropic()

# Prepare batch requests
requests = [
    {
        "custom_id": f"document-{i}",
        "params": {
            "model": "claude-haiku-4-5-20251001",
            "max_tokens": 256,
            "messages": [
                {"role": "user", "content": f"Categorize this document: {doc}"}
            ]
        }
    }
    for i, doc in enumerate(documents)
]

# Submit batch
batch = client.messages.batches.create(requests=requests)
batch_id = batch.id

# Poll for completion
import time
while True:
    status = client.messages.batches.retrieve(batch_id)
    if status.processing_status == "ended":
        break
    time.sleep(60)

# Retrieve results
results = {}
for result in client.messages.batches.results(batch_id):
    results[result.custom_id] = result.result.message.content
```

For large data processing pipelines, the combination of batch pricing (50% off) and smaller model selection (10–20× off) can reduce inference costs by 95% compared to calling a frontier model synchronously. This is not an edge case — it's the standard setup for any serious data processing workload.

> **Key Insight:** Batch processing at 50% off is one of the easiest optimizations to implement. If you're making synchronous API calls for any offline workload, you're overpaying by definition.

### Request Rate Shaping

Most LLM providers implement rate limits in terms of tokens per minute (TPM) rather than requests per minute. This creates an optimization opportunity: the shape of your requests affects how efficiently you use your rate limit budget.

A request with 10,000 tokens uses the same rate budget as ten requests of 1,000 tokens each. But from a latency standpoint, the former adds 10,000 tokens of wait time to your rate limit window, while the latter can be distributed more granularly.

For systems with variable request sizes, rate limit pressure is disproportionately created by large requests. Strategies to manage this:

**Request splitting.** Break very large requests into smaller, parallel requests when the task permits it. Processing a 50,000-token document? Split it into five 10,000-token segments and process in parallel (if parallelism is semantically valid for your task).

**Priority queuing.** Route high-priority (user-facing) requests ahead of low-priority (background processing) requests in your queue. This ensures latency-sensitive workloads aren't starved by batch work.

**Smoothing.** Don't fire all your background processing requests at once. Spread them across time to avoid consuming your rate limit budget in bursts that starve synchronous traffic.

### Routing by Task

Request routing — directing different request types to different models or endpoints — was touched on in Chapter 5 from a quality perspective. It has an equally important cost dimension.

A routing layer can classify incoming requests by complexity, type, or expected cost before forwarding them to the appropriate backend. The router itself should be inexpensive — a lightweight classifier, a heuristic rule engine, or a small model.

Common routing signals:

- **Query length**: short queries are often simpler
- **Detected task type**: extraction vs. generation vs. analysis
- **User tier**: premium users might get frontier models; standard users might get mid-tier
- **Context size**: requests under a token threshold go to cheaper model; large contexts to frontier
- **Confidence scoring**: a fast, cheap first-pass estimates difficulty before routing

```python
class ModelRouter:
    def route(self, request: Request) -> str:
        # Simple heuristics
        if len(request.messages[-1]["content"]) < 100:
            if request.task_type in ("classification", "extraction"):
                return "claude-haiku-4-5"

        if request.estimated_input_tokens > 50_000:
            return "claude-sonnet-4-6"  # Long context needs capable model

        if request.task_type == "code_generation":
            return "claude-sonnet-4-6"  # Code is high-complexity

        return "claude-haiku-4-5"  # Default to cheaper model
```

The routing logic can be as simple or sophisticated as your workload demands. Even a simple rule-based router that sends 40% of requests to a cheaper model cuts costs without touching prompt quality.

### Request Deduplication and Caching

Not all requests are unique. In many systems, the same question gets asked repeatedly — either by different users asking about common topics, or by the same user asking the same thing in a new session.

**Semantic caching** stores responses indexed by embedding similarity to the query. Incoming queries are compared against cached queries; if a sufficiently similar query exists in cache, the cached response is returned without an API call.

This is different from prompt caching (Chapter 4), which caches computation inside the API. Semantic caching is a layer above the API that avoids the call entirely.

```python
class SemanticCache:
    def __init__(self, similarity_threshold: float = 0.95):
        self.threshold = similarity_threshold
        self.cache = {}  # In practice: Redis or a vector store

    def get(self, query: str) -> str | None:
        query_embedding = embed(query)
        for cached_query, (cached_embedding, cached_response) in self.cache.items():
            similarity = cosine_similarity(query_embedding, cached_embedding)
            if similarity > self.threshold:
                return cached_response
        return None

    def set(self, query: str, response: str):
        self.cache[query] = (embed(query), response)
```

The hit rate depends heavily on your workload. FAQ systems, customer service bots, and knowledge base assistants often have 30–60% cache hit rates because user questions cluster around common topics. Open-ended creative or analytical assistants will have much lower rates.

The cost of a cache hit is essentially zero (an embedding comparison). The savings are the full API cost of the avoided request. At 30% hit rate on a $0.01/request system processing 1M requests/month, semantic caching saves $3,000/month.

> **Warning:** Cached responses go stale. A semantic cache that serves outdated information is a liability, not an asset. Include TTLs and invalidation logic. Know the staleness tolerance of your use case before deploying.

### Request Coalescing

In systems with many concurrent users, similar requests often arrive within seconds of each other. Coalescing — holding requests briefly to batch them together — allows you to serve multiple users from a single API call in some architectures.

This applies most naturally to streaming inference endpoints where you control the generation infrastructure (self-hosted models). For API-based providers, coalescing doesn't directly save tokens, but it reduces the number of API calls (relevant for rate limits) and can enable shared context optimization.

For shared resources like real-time document summarization ("summarize this article"), coalescing requests for the same document within a short window means one API call serves all concurrent requesters. At high concurrency on popular content, this can substantially reduce total API volume.

### Response Streaming and Progressive Delivery

Streaming responses don't cost less than non-streaming responses, but they have latency implications that affect how you architect request pipelines. A streaming response that allows the application to begin processing before the full response is available can enable earlier termination of irrelevant responses.

If your application can evaluate response quality mid-stream — detecting an off-topic answer or a formatting error early — you can cancel the stream before the full response is generated. On API providers that meter only delivered tokens, this directly reduces cost.

Not all providers meter this way; verify before building your architecture around it.

> **Try This:** Review your current LLM API call patterns. Identify any calls that happen in a scheduled job, offline pipeline, or background worker. Calculate how much these cost per month. Then calculate the same cost using the batch API at 50% off. The difference is money you can recover with a few hours of engineering work.

### Key Takeaways

1. Batch processing at 50% off is available from major providers and applies to any non-real-time workload. If you're not using it for background jobs, you're overpaying.
2. A routing layer that directs requests to appropriately sized models based on complexity typically cuts blended costs by 30–50%.
3. Semantic caching can eliminate API calls entirely for repeated or similar queries; hit rates of 30–60% are achievable in FAQ-type systems.
4. Rate shaping and priority queuing protect latency-sensitive workloads from being starved by bulk processing.
5. Request splitting and coalescing are useful for managing rate limits and shared-resource efficiency at high concurrency.

**Chapter Exercise:** Identify your three highest-volume API endpoints. For each, answer: (1) Is this latency-sensitive? (2) What fraction of queries are likely duplicates or near-duplicates? (3) What fraction of requests could be served by a cheaper model? Use these answers to draft a routing and caching strategy.

---

## Chapter 7: Measuring What You're Spending

The prerequisite for every optimization strategy in this guide is knowing what you're spending and where. Most teams running LLM systems in production are operating with inadequate visibility into their token consumption. They see the monthly bill; they don't see the cost structure that produced it.

This chapter is about building the measurement foundation that makes everything else possible.

### What to Measure

Token cost visibility requires four types of data:

**Per-request metrics:**
- Input tokens (total, by component if possible)
- Output tokens
- Cached tokens (read and write separately)
- Model used
- Endpoint / use case
- User or session identifier
- Request timestamp and latency
- Success/failure flag

**Aggregate metrics:**
- Cost per endpoint (daily, weekly, monthly)
- Cost per user segment
- Cost per use case
- Average request cost over time
- Token efficiency trends (cost per successful outcome)

**Quality signals (cost-adjusted):**
- Quality score per request or session (however you define quality)
- Cost per successful outcome
- Model escalation rates (if using cascades)

**Anomaly indicators:**
- Requests significantly above average token count
- Sudden cost spikes by endpoint
- Unexplained cache hit rate drops

### Instrumenting API Calls

Every LLM API call should be logged with full token counts. This is straightforward to add at the client layer:

```python
import anthropic
import time
from dataclasses import dataclass
from typing import Any

@dataclass
class RequestMetrics:
    endpoint: str
    model: str
    input_tokens: int
    output_tokens: int
    cache_read_tokens: int
    cache_write_tokens: int
    latency_ms: float
    cost_usd: float
    success: bool
    session_id: str | None = None

PRICING = {
    "claude-sonnet-4-6": {
        "input": 3.00 / 1_000_000,
        "output": 15.00 / 1_000_000,
        "cache_read": 0.30 / 1_000_000,
        "cache_write": 3.75 / 1_000_000,
    },
    "claude-haiku-4-5-20251001": {
        "input": 0.80 / 1_000_000,
        "output": 4.00 / 1_000_000,
        "cache_read": 0.08 / 1_000_000,
        "cache_write": 1.00 / 1_000_000,
    }
}

def calculate_cost(model: str, usage) -> float:
    rates = PRICING.get(model, PRICING["claude-sonnet-4-6"])
    return (
        usage.input_tokens * rates["input"] +
        usage.output_tokens * rates["output"] +
        getattr(usage, "cache_read_input_tokens", 0) * rates["cache_read"] +
        getattr(usage, "cache_creation_input_tokens", 0) * rates["cache_write"]
    )

class InstrumentedClient:
    def __init__(self, metrics_sink, endpoint: str, session_id: str = None):
        self.client = anthropic.Anthropic()
        self.metrics = metrics_sink
        self.endpoint = endpoint
        self.session_id = session_id

    def create(self, **kwargs) -> anthropic.types.Message:
        start = time.time()
        success = True

        try:
            response = self.client.messages.create(**kwargs)
        except Exception as e:
            success = False
            raise
        finally:
            if success:
                latency = (time.time() - start) * 1000
                cost = calculate_cost(kwargs["model"], response.usage)

                self.metrics.record(RequestMetrics(
                    endpoint=self.endpoint,
                    model=kwargs["model"],
                    input_tokens=response.usage.input_tokens,
                    output_tokens=response.usage.output_tokens,
                    cache_read_tokens=getattr(response.usage, "cache_read_input_tokens", 0),
                    cache_write_tokens=getattr(response.usage, "cache_creation_input_tokens", 0),
                    latency_ms=latency,
                    cost_usd=cost,
                    success=success,
                    session_id=self.session_id,
                ))

        return response
```

### Building a Metrics Pipeline

For most teams, the right storage for LLM metrics is the same observability infrastructure you use for everything else — but organized to support cost queries.

A minimal schema for a metrics table (SQL or equivalent):

```sql
CREATE TABLE llm_requests (
    id              BIGSERIAL PRIMARY KEY,
    timestamp       TIMESTAMPTZ NOT NULL DEFAULT NOW(),
    endpoint        TEXT NOT NULL,
    model           TEXT NOT NULL,
    session_id      TEXT,
    user_id         TEXT,
    input_tokens    INTEGER NOT NULL,
    output_tokens   INTEGER NOT NULL,
    cache_read_tokens INTEGER DEFAULT 0,
    cache_write_tokens INTEGER DEFAULT 0,
    cost_usd        DECIMAL(10, 8) NOT NULL,
    latency_ms      INTEGER,
    success         BOOLEAN DEFAULT TRUE,
    metadata        JSONB
);

-- Indexes for common query patterns
CREATE INDEX idx_llm_requests_endpoint ON llm_requests(endpoint, timestamp);
CREATE INDEX idx_llm_requests_model ON llm_requests(model, timestamp);
CREATE INDEX idx_llm_requests_user ON llm_requests(user_id, timestamp);
```

With this schema, you can answer questions like:

```sql
-- Cost by endpoint, last 7 days
SELECT endpoint,
       SUM(cost_usd) as total_cost,
       AVG(cost_usd) as avg_cost,
       COUNT(*) as request_count,
       AVG(input_tokens) as avg_input,
       AVG(output_tokens) as avg_output
FROM llm_requests
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY endpoint
ORDER BY total_cost DESC;

-- Cache efficiency
SELECT endpoint,
       SUM(cache_read_tokens) as total_cached_reads,
       SUM(input_tokens) as total_input,
       SUM(cache_read_tokens)::FLOAT / NULLIF(SUM(input_tokens), 0) as cache_hit_rate,
       SUM(cost_usd) as actual_cost,
       SUM(input_tokens * 0.000003) as cost_without_cache,
       SUM(cost_usd) / NULLIF(SUM(cost_usd) + (SUM(cache_read_tokens) * 0.0000027), 0) as savings_rate
FROM llm_requests
WHERE timestamp > NOW() - INTERVAL '7 days'
GROUP BY endpoint;
```

### Dashboards and Alerting

The goal of instrumentation is to make cost behavior visible and actionable. At minimum, you want:

**A cost dashboard** with:
- Daily spend by endpoint and model (trend line, last 30 days)
- Average tokens per request by endpoint (input and output separately)
- Cache hit rates by endpoint
- Cost per user segment if you have user tiers

**Alerts** for:
- Daily spend 30% above the rolling 7-day average (spike detection)
- Cache hit rate dropping below threshold (cache invalidation or prompt change issue)
- Average request cost above a per-endpoint ceiling (regression in efficiency)
- Any single request above an absolute cost threshold (runaway context)

These alerts don't require sophisticated infrastructure — a simple cron job that queries your metrics table and sends a Slack message covers the basics.

> **Warning:** Cost spikes often indicate bugs, not just usage spikes. A request with 50,000 tokens where you expect 2,000 is more likely an infinite loop or a context management bug than genuine usage. Alert on outlier requests, not just aggregate spend.

### Attribution and Chargeback

If you're running LLM infrastructure across multiple teams or features, cost attribution is a management necessity. Without it, you can't hold teams accountable for their spending or make informed resource allocation decisions.

The minimum requirement for attribution is tagging every request with:
- Feature or product area
- Team or cost center
- User tier (if relevant to pricing)

This metadata allows you to produce a per-team cost breakdown that's actually useful for conversations about optimization priorities.

**Unit economics per feature.** For each feature that uses LLM APIs, calculate the cost per feature-specific value metric. For a summarization feature, that might be cost per document summarized. For a Q&A feature, cost per question answered. For a coding assistant, cost per code suggestion accepted.

This reframes cost from an infrastructure expense to a product unit cost, which is a much more actionable framing for business conversations.

> **Key Insight:** "We spent $30,000 on LLM APIs last month" is not actionable. "Our document summarization feature costs $0.047 per document, and it processed 250,000 documents" is — because you can now evaluate whether that cost is worth it at scale.

### Identifying Optimization Opportunities

Once you have measurement in place, opportunities surface quickly. Common patterns:

**The high-tail problem.** 5% of requests might account for 30% of cost. These outlier requests often reveal specific patterns: multi-turn conversations that went long, RAG queries that retrieved large contexts, or specific user behaviors that trigger expensive paths.

**The endpoint cost ranking surprise.** Usually, the most-called endpoint is not the most expensive one. A rarely-used but expensive feature often ranks higher on cost than a high-volume cheap feature. Measurement reveals this; intuition almost never does.

**The quality-cost gap.** Correlating quality metrics with cost often reveals that expensive requests are not better — they're just longer. When high-quality and low-quality responses have the same average cost, you have headroom to optimize length without sacrificing quality.

> **Try This:** Generate a cost ranking of your top 10 API endpoints for the past 30 days. Rank them by total cost. Then rank them by request volume. If the rankings are very different (which they usually are), you've identified where expensive requests that aren't high-frequency are hiding.

### Key Takeaways

1. You cannot optimize what you cannot measure. Per-request cost attribution is the prerequisite for everything else.
2. Log input tokens, output tokens, cached tokens, model, endpoint, and user segment on every request.
3. Aggregate metrics should surface cost by endpoint, model, and user tier on a daily basis.
4. Alerts should trigger on spend spikes, cache hit rate drops, and outlier requests — not just total spend.
5. Unit economics (cost per value metric, not just cost per request) is the framing that makes cost conversations actionable.

**Chapter Exercise:** Instrument one production API endpoint with full token count logging today. Build a simple daily cost query against the data and run it for one week. Write down the three most surprising things you learn.

---

## Chapter 8: Building a Token Budget System

The optimizations in the prior chapters are point solutions. This chapter is about making cost management a first-class architectural concern — building systems that enforce token budgets proactively rather than discovering overruns reactively.

A token budget system treats token consumption the way a well-run engineering team treats compute or database capacity: as a resource with predictable costs, explicit allocations, and automatic responses when limits are approached.

### What a Token Budget System Is

A token budget system has three components:

1. **Budget definitions** — explicit limits on token consumption per request, session, user, or time period
2. **Budget enforcement** — runtime logic that monitors consumption against limits and takes action when limits are approached or exceeded
3. **Budget visibility** — dashboards and alerts that surface consumption against budget in real time

This is distinct from pure monitoring (which tells you what happened) and pure optimization (which reduces consumption). A budget system actively constrains consumption in real time, which creates a forcing function for efficiency that pure observation doesn't provide.

### Designing Budget Hierarchies

Budgets can exist at multiple levels:

**Request-level budgets** are the most granular and the most impactful for preventing single runaway requests. They cap the tokens any single API call can consume.

**Session-level budgets** constrain the total token consumption across a multi-turn conversation. When a session approaches its budget, the system can trigger summarization, warn the user, or gracefully end the conversation.

**User-level budgets** implement per-user consumption limits — relevant for metered products or when you want to prevent individual users from disproportionately consuming shared capacity.

**Feature-level budgets** allocate token budgets across product features. If your document analysis feature is budgeted at 10M tokens/day, exceeding that limit triggers a fallback, a queue, or a rate limit — rather than runaway spend.

**Monthly budgets** provide the guardrail against month-end billing surprises. A hard monthly cap at 120% of your expected spend prevents catastrophic overruns.

These budgets form a hierarchy. A request-level budget prevents single outliers; a session-level budget manages conversation growth; a feature-level budget distributes capacity across the product; a monthly budget is the absolute backstop.

### Implementing Request-Level Budgets

The simplest form of request-level budget enforcement is the `max_tokens` parameter combined with context size monitoring:

```python
class BudgetedRequest:
    def __init__(
        self,
        max_input_tokens: int = 8000,
        max_output_tokens: int = 500,
        model: str = "claude-haiku-4-5-20251001"
    ):
        self.max_input = max_input_tokens
        self.max_output = max_output_tokens
        self.model = model
        self.client = anthropic.Anthropic()

    def create(self, system: str, messages: list) -> anthropic.types.Message:
        # Estimate input size before calling
        estimated_input = self._estimate_tokens(system, messages)

        if estimated_input > self.max_input:
            raise TokenBudgetExceeded(
                f"Estimated input {estimated_input} exceeds budget {self.max_input}"
            )

        return self.client.messages.create(
            model=self.model,
            max_tokens=self.max_output,
            system=system,
            messages=messages,
        )

    def _estimate_tokens(self, system: str, messages: list) -> int:
        # Rough estimate: 4 chars per token
        total_chars = len(system)
        for m in messages:
            content = m.get("content", "")
            if isinstance(content, str):
                total_chars += len(content)
            elif isinstance(content, list):
                total_chars += sum(len(c.get("text", "")) for c in content)
        return total_chars // 4
```

A more sophisticated implementation uses the provider's tokenizer to get exact counts rather than estimates. Anthropic exposes token counting via the API:

```python
# Count tokens before committing to the full request
token_count = client.messages.count_tokens(
    model="claude-haiku-4-5-20251001",
    system=system_prompt,
    messages=messages,
)

if token_count.input_tokens > INPUT_BUDGET:
    # Compress context before proceeding
    messages = compress_history(messages, target_tokens=INPUT_BUDGET - len(system_prompt)//4)
```

### Session Budget Management

Session budgets require tracking cumulative consumption across multiple turns. A session budget manager:

```python
class SessionBudgetManager:
    def __init__(
        self,
        session_id: str,
        total_token_budget: int = 50_000,
        compression_threshold: float = 0.7
    ):
        self.session_id = session_id
        self.total_budget = total_token_budget
        self.threshold = compression_threshold
        self.tokens_used = 0
        self.history = []

    @property
    def budget_remaining(self) -> int:
        return self.total_budget - self.tokens_used

    @property
    def budget_fraction_used(self) -> float:
        return self.tokens_used / self.total_budget

    def record_usage(self, input_tokens: int, output_tokens: int):
        self.tokens_used += input_tokens + output_tokens

    def should_compress(self) -> bool:
        return self.budget_fraction_used >= self.threshold

    def get_context_for_request(self, new_message: str) -> list:
        if self.should_compress():
            return self._compressed_history() + [
                {"role": "user", "content": new_message}
            ]
        return self.history + [{"role": "user", "content": new_message}]

    def _compressed_history(self) -> list:
        # Summarize older history, keep recent turns verbatim
        if len(self.history) <= 4:
            return self.history

        older = self.history[:-4]
        recent = self.history[-4:]
        summary = self._summarize(older)

        return [
            {"role": "system", "content": f"Earlier: {summary}"},
            *recent
        ]

    def _summarize(self, turns: list) -> str:
        # Cheap summarization call
        ...
```

> **Key Insight:** A session budget that triggers automatic compression at 70% usage prevents the cost spike at 100% without requiring user-visible interruption. The compression is invisible to the user and keeps the session affordable.

### Feature-Level Budget Enforcement

Feature-level budgets require a shared counter accessible across all instances of your application. Redis is the natural choice because it provides atomic increment operations, TTL support, and sub-millisecond reads — the right properties for a distributed budget counter that needs to be both accurate and fast.

```python
import redis
from datetime import datetime

class FeatureBudget:
    def __init__(
        self,
        feature_name: str,
        daily_token_limit: int,
        redis_client: redis.Redis
    ):
        self.feature = feature_name
        self.daily_limit = daily_token_limit
        self.redis = redis_client

    @property
    def _key(self) -> str:
        date_str = datetime.utcnow().strftime("%Y-%m-%d")
        return f"token_budget:{self.feature}:{date_str}"

    def consume(self, tokens: int) -> bool:
        """Returns True if budget available, False if exceeded."""
        current = self.redis.incrby(self._key, tokens)
        self.redis.expire(self._key, 86400 * 2)  # 48hr TTL

        if current > self.daily_limit:
            # Rollback — we over-consumed
            self.redis.decrby(self._key, tokens)
            return False
        return True

    def remaining(self) -> int:
        current = int(self.redis.get(self._key) or 0)
        return max(0, self.daily_limit - current)

    def usage_fraction(self) -> float:
        current = int(self.redis.get(self._key) or 0)
        return current / self.daily_limit
```

Usage in your application:

```python
feature_budget = FeatureBudget(
    "document-analysis",
    daily_token_limit=5_000_000,
    redis_client=redis
)

def analyze_document(doc: str) -> str:
    estimated_tokens = len(doc) // 4 + 500  # rough estimate

    if not feature_budget.consume(estimated_tokens):
        return queue_for_batch_processing(doc)  # graceful fallback

    return llm_analyze(doc)
```

The pattern extends naturally to soft warnings. A `usage_fraction()` check before each request lets you alert your on-call rotation at 80% consumption and start degrading gracefully at 90%, rather than hitting the wall at 100% with no warning.

> **Warning:** The `incrby` / `decrby` rollback pattern is not perfectly atomic under very high concurrency. For strict enforcement at thousands of requests per second, replace it with a Lua script or a Redis `EVAL` that does the check and increment atomically in a single round trip.

### Monthly Budget Management

Feature-level daily budgets handle operational spikes. Monthly budgets handle the billing cycle. They're a different concern — one is about protecting real-time system behavior, the other is about preventing end-of-month surprises on a $50,000 invoice.

A monthly budget layer works best as a soft limit combined with an escalation path:

```python
class MonthlyBudget:
    def __init__(
        self,
        monthly_usd_limit: float,
        soft_threshold: float = 0.80,
        hard_threshold: float = 1.0,
        redis_client: redis.Redis = None,
        alert_fn=None
    ):
        self.monthly_limit = monthly_usd_limit
        self.soft_threshold = soft_threshold
        self.hard_threshold = hard_threshold
        self.redis = redis_client
        self.alert = alert_fn

    @property
    def _key(self) -> str:
        now = datetime.utcnow()
        return f"monthly_spend:{now.year}:{now.month:02d}"

    def record_spend(self, cost_usd: float):
        new_total = float(self.redis.incrbyfloat(self._key, cost_usd))
        self.redis.expire(self._key, 86400 * 35)  # slightly over one month

        fraction = new_total / self.monthly_limit

        if fraction >= self.hard_threshold:
            self.alert(f"HARD LIMIT: Monthly LLM spend {new_total:.2f} reached {fraction*100:.0f}% of budget")
        elif fraction >= self.soft_threshold:
            self.alert(f"WARNING: Monthly LLM spend {new_total:.2f} at {fraction*100:.0f}% of budget")

    def current_spend(self) -> float:
        return float(self.redis.get(self._key) or 0)

    def is_over_hard_limit(self) -> bool:
        return self.current_spend() >= self.monthly_limit * self.hard_threshold
```

The hard limit here doesn't automatically block requests — hard-blocking on a monthly counter creates failure modes that are worse than the overage. Instead, it triggers human review. Let the on-call engineer decide whether to cut spend or accept the overage. The system's job is to make sure that decision happens before the bill arrives.

### Budget Visibility and Dashboards

A budget system without visibility is theater. The enforcement logic tells you what's happening; dashboards tell you what's coming.

The key views for a budget dashboard:

**Budget burn rate.** For each feature, plot daily token consumption against the daily budget. A feature consistently burning at 90% of budget is a different problem than one that's bursty — the latter might need burst capacity rather than a higher baseline limit.

**Budget headroom by feature.** A simple table: feature name, daily limit, yesterday's consumption, 7-day average, days until limit at current burn rate. This tells you at a glance which features are running hot before they trip.

**Cost per unit of work.** For each feature, the cost per business unit — cost per document, cost per query answered, cost per code review. Plotted over time, this shows whether your optimization efforts are working. A rising cost-per-unit on a stable feature is a signal that something changed (prompt, model, context size) and nobody noticed.

**Budget utilization heatmap.** Token consumption by hour of day and day of week, by feature. Traffic patterns matter for capacity planning and for understanding why budget limits trip at specific times.

> **Key Insight:** Budget dashboards should face product managers, not just engineers. When a PM can see that the document analysis feature costs $0.047 per document and processed 250,000 documents last week, they have the information to make pricing and capacity decisions. Keep the data locked in infrastructure dashboards and those decisions get made with guesswork.

### Cost Anomaly Detection

Budget systems should include automatic anomaly detection — not just hard limits, but soft alerts when consumption patterns deviate from expectations.

A simple approach: maintain a rolling average of daily spend per feature and alert when the current day's spending exceeds the rolling average by more than 2 standard deviations.

```python
import numpy as np

def detect_spend_anomaly(
    feature: str,
    current_daily_spend: float,
    historical_daily_spends: list[float],
    alert_threshold: float = 2.0
) -> bool:
    if len(historical_daily_spends) < 7:
        return False  # Not enough history

    mean = np.mean(historical_daily_spends)
    std = np.std(historical_daily_spends)

    if std == 0:
        return current_daily_spend > mean * 1.5

    z_score = (current_daily_spend - mean) / std
    return z_score > alert_threshold
```

### Graceful Degradation Strategies

A budget system without graceful degradation is just a hard wall. When budgets are exceeded, the experience degrades — but degradation should be designed, not accidental.

Common graceful degradation strategies:

**Queue and batch.** When real-time budget is exhausted, queue requests for batch processing at half price. Users get their results later; you don't blow the budget.

**Model downgrade.** When an expensive model's quota is exhausted, route to a cheaper model with a transparent message that response quality might differ.

**Feature reduction.** Disable optional enhancements (detailed explanations, alternative suggestions, verbose formatting) when approaching budget limits. Core functionality remains available.

**Rate limiting with retry.** Implement rate limiting at the application layer with retry-after semantics. Users are told to try again in N minutes. The budget recovers as the time window rolls.

> **Warning:** A budget system that hard-fails without graceful degradation creates user-visible outages when it triggers. Design the fallback behavior before you hit the limit, not during the incident when you're over it.

### The Token Budget as Product Feature

In metered products — where users pay based on usage — the token budget system is a product feature, not just infrastructure. Users need to:

- See their current usage against their plan limit
- Get warnings before they hit limits
- Understand what happens when they do
- Have a path to increase their limit (upgrade)

The data model that supports your internal budget enforcement is the same data that powers the user-facing usage dashboard. Building it once to serve both purposes is more efficient than parallel implementations.

### Putting It Together: A Budget Architecture

A production token budget system integrates the components above into a coherent architecture:

```
Request arrives
       │
       ▼
Feature budget check ──── Over limit ──► Queue for batch / return 429
       │ (Redis counter)
   Available
       │
       ▼
Session budget check ──── Near limit ──► Trigger history compression
       │ (in-memory or Redis)
   Available
       │
       ▼
Request-level budget ──── Over limit ──► Compress context before API call
       │ (token count API)
   Within budget
       │
       ▼
API call (with max_tokens)
       │
       ▼
Record usage ──────────────────────────► Metrics pipeline
       │
       ▼
Update session, feature, and monthly counters
       │
       ▼
Check anomaly threshold ── Anomaly ────► Alert
       │
   Normal
       │
       ▼
Return response
```

Every step in this pipeline is fast. The Redis operations are sub-millisecond. Token counting via API adds 50–100ms latency but prevents expensive mistakes. The overhead is well worth it.

> **Try This:** Define token budgets for your three most expensive features today — not to enforce them immediately, but to make the budgets explicit. Write them down: daily token limit, monthly dollar limit, graceful fallback behavior. You'll find that the act of writing the fallback behavior reveals product decisions you haven't made yet. Make them now, before a billing incident forces your hand.

### Key Takeaways

1. A token budget system makes cost management proactive rather than reactive. It treats tokens as a resource with explicit allocations, not an unlimited expense.
2. Budgets should exist at multiple levels: per-request, per-session, per-feature, and monthly — each serving a different purpose.
3. Session budget managers that trigger automatic compression at 70% usage prevent runaway conversation costs without interrupting the user experience.
4. Monthly budgets are soft guardrails that trigger human decisions, not automatic blocks — the goal is an alert before the invoice, not a service outage at an arbitrary threshold.
5. Budget dashboards belong in front of product managers and finance, not just in engineering runbooks. Cost per unit of work is the framing that drives real decisions.
6. Graceful degradation strategies — queue, downgrade, reduce — should be designed before limits are hit, not during incidents.

**Chapter Exercise:** Pick your most expensive feature and implement a daily token budget with a Redis counter. Set the initial limit at 150% of current average daily consumption (so it doesn't trigger immediately). Implement one graceful fallback. Run it for two weeks and observe whether the budget ever triggers. If it does, the fallback will catch it; if it doesn't, you've established your baseline for a tighter budget.

---

## Conclusion

Token economics is not a new discipline — it's cost engineering applied to a new resource. The principles are the same: measure before you optimize, understand the cost structure before you cut, and make changes that preserve quality while eliminating waste.

The eight chapters in this guide each address one lever. In practice, they work best as a portfolio. Prompt caching without context compression leaves the compressible portions expensive. Model routing without measurement leaves routing decisions uninformed. Budget enforcement without graceful degradation creates user-visible outages.

The teams that manage LLM costs effectively aren't the ones with the cleverest prompts or the most aggressive truncation strategies. They're the ones that treat inference spend as an engineering problem: instrumented, modeled, and continuously improved against explicit quality targets.

The numbers matter. A team running a $50,000/month inference bill that applies the strategies in this guide — context compression, prompt caching, model routing, batch processing — routinely reaches $20,000–$30,000/month without touching quality. That's a recurring saving that compounds across the life of the product.

More importantly, the discipline required to achieve those savings — measuring carefully, making decisions empirically, designing systems that degrade gracefully — produces better engineering across the board. Token economics is the forcing function. Better systems are the outcome.

---

## Appendix A: Glossary

**Attention mechanism.** The core computational operation in transformer models. During generation, it computes relationships between all tokens in context. Scales quadratically with context length, which is why long contexts are expensive.

**Autoregressive generation.** The mode by which LLMs produce output: one token at a time, each token conditioned on all prior tokens. Sequential by nature, which is why output tokens are computationally more expensive than input tokens.

**Batch inference.** Submitting multiple prompts as a single job for asynchronous processing, typically at 50% of the synchronous API cost. Suitable for non-real-time workloads.

**Cache hit rate.** The fraction of requests that benefit from an existing cache entry. For prompt caching, this is the fraction of requests where the cached prefix was already populated and valid.

**Cascade / routing architecture.** A pattern that routes requests to progressively more capable (and expensive) models based on the complexity or requirements of the request. Usually starts with a cheap model and escalates only when necessary.

**Chain-of-thought prompting.** A technique that instructs the model to reason step-by-step before producing a final answer. Improves accuracy on complex tasks; generates additional output tokens for the reasoning.

**Chunk.** In RAG systems, a unit of text extracted from a document for indexing and retrieval. Typically 200–1,000 tokens, depending on document type and granularity requirements.

**Context compression.** Techniques that reduce the number of tokens in context without proportionally reducing the information value. Includes summarization, extraction, and re-ranking.

**Context window.** The maximum number of tokens a model can process in a single inference call, including both input and output. Frontier models in 2026 typically support 128K–1M token windows.

**Effective token rate.** The blended cost per token for a workload, accounting for the actual mix of input and output tokens.

**Embedding.** A numerical vector representation of text. Used for semantic similarity comparison in RAG systems and semantic caches.

**Fine-tuning.** Adapting a pre-trained model to a specific task or domain by training on additional examples. Can produce smaller, specialized models that match frontier performance on narrow tasks.

**KV cache / Key-Value cache.** The internal state computed by transformer models during attention computation. Prompt caching stores and reuses this state across requests with a shared prefix.

**Maximal Marginal Relevance (MMR).** A retrieval strategy that selects documents to maximize both relevance to the query and diversity among selected documents.

**Output tokens.** Tokens generated by the model in response to the prompt. Typically 3–5× more expensive than input tokens due to sequential generation.

**Prompt caching.** An API feature offered by major providers that caches the KV state for a static prompt prefix, reducing the cost of subsequent requests that share the same prefix.

**RAG (Retrieval-Augmented Generation).** A technique that retrieves relevant documents from an external index and includes them in the prompt context before generating a response.

**Re-ranking.** A post-retrieval step that scores retrieved documents for relevance to the specific query, allowing the system to select only the most relevant subset for inclusion in context.

**Semantic cache.** An application-layer cache that stores LLM responses indexed by semantic similarity to the query. Returns cached responses for semantically similar queries without an API call.

**System prompt.** The instructions provided to an LLM at the beginning of a context window, typically defining the model's role, behavior, and constraints. Usually static across requests.

**Token.** The basic unit of text processed by LLMs. Approximately 4 characters or 0.75 words in English, though this varies by language and content type. The unit of API billing.

**Token budget.** An explicit allocation of token consumption for a request, session, feature, or time period. Enforced at runtime to prevent cost overruns.

**TPM (Tokens Per Minute).** The rate limit unit used by most LLM API providers. The number of tokens (input plus output) that can be processed within a rolling 60-second window.

---

## Appendix B: Tools & Resources

### API Documentation

- **Anthropic API Reference** — Covers message batches, prompt caching, token counting, and model pricing. Primary reference for Claude-based systems.
- **OpenAI API Reference** — Equivalent documentation for GPT models, including Batch API and caching behavior.
- **Google AI Studio / Vertex AI** — Documentation for Gemini models, including context caching and batch prediction.

### Token Counting and Cost Calculation

- **Anthropic token counting endpoint** (`POST /v1/messages/count_tokens`) — Count tokens before committing to a full request.
- **tiktoken** (OpenAI) — Python library for tokenizing text with OpenAI models. Useful for estimating costs before API calls.
- **Anthropic Tokenizer** — Available in the Anthropic Python SDK for local token counting without an API call.

### Context Management

- **LangChain ConversationSummaryMemory** — Implements conversation summarization for long contexts.
- **LangChain ContextualCompressionRetriever** — Retrieves and compresses document chunks based on query relevance.
- **LlamaIndex** — Framework with built-in support for hierarchical retrieval, re-ranking, and context window management.

### Observability and Cost Tracking

- **LangSmith** — Observability platform for LLM applications with per-request cost tracking and quality evaluation.
- **Helicone** — Open-source LLM observability proxy that logs requests and responses with cost attribution.
- **Phoenix (Arize)** — Observability and evaluation platform with LLM cost tracking.
- **Langfuse** — Open-source LLM observability with per-session cost tracking and user attribution.

### Semantic Caching

- **GPTCache** — Library implementing semantic caching with multiple embedding backends.
- **Redis** — General-purpose caching infrastructure, used for feature-level budget enforcement and exact-match response caching.

### Evaluation

- **RAGAS** — Evaluation framework for RAG systems, measuring retrieval quality and answer faithfulness.
- **DeepEval** — LLM evaluation framework with cost-adjusted metrics.
- **PromptFoo** — Open-source tool for testing and evaluating LLM outputs against quality benchmarks.

### Cost Modeling

- **LLM Pricing Calculator (various community tools)** — Several community-maintained calculators allow comparison across providers for specific token mixes. Search for current options, as these update frequently.

---

## Appendix C: Further Reading

### On LLM Architecture and Cost Fundamentals

**"Attention Is All You Need"** (Vaswani et al., 2017) — The original transformer paper. Understanding the attention mechanism explains why long contexts are expensive and why output tokens cost more than input tokens.

**"Scaling Laws for Neural Language Models"** (Kaplan et al., 2020) — Documents the relationship between model size, compute, and performance. The conceptual foundation for understanding capability-cost tradeoffs across model tiers.

### On Retrieval-Augmented Generation

**"Lost in the Middle: How Language Models Use Long Contexts"** (Liu et al., 2023) — Empirical demonstration that models retrieve information more reliably from the beginning and end of long contexts. The basis for context position engineering.

**"REPLUG: Retrieval-Augmented Language Model Pre-Training"** (Shi et al., 2023) — Research on optimizing RAG architectures, including the effect of retrieval precision on downstream quality.

### On Prompt Engineering and Efficiency

**"Large Language Models Are Human-Level Prompt Engineers"** (Zhou et al., 2022) — Research demonstrating that prompt phrasing significantly affects model output quality and length. Foundational for understanding prompt-level output control.

**"Chain-of-Thought Prompting Elicits Reasoning in Large Language Models"** (Wei et al., 2022) — Original paper demonstrating chain-of-thought reasoning improvements. Quantifies the quality gain relative to the token cost.

### On Model Distillation and Fine-Tuning

**"Knowledge Distillation: A Survey"** (Gou et al., 2021) — Comprehensive overview of techniques for training smaller models to match larger model performance.

**"LIMA: Less Is More for Alignment"** (Zhou et al., 2023) — Demonstrates that fine-tuning on a small, high-quality dataset can produce surprisingly capable specialized models. Relevant to the economics of distillation.

### On Production LLM Systems

**"Building LLM-Powered Applications"** (various authors, O'Reilly, 2024) — Practical coverage of production architecture patterns for LLM systems, including cost management.

**MLOps.community and AI Engineer World's Fair proceedings** — Conference talks from practitioners running LLM systems at scale. More current than most published research and grounded in production experience.

---

*Token Economics: Cutting Your LLM Bill Without Cutting Quality*
*Version 1.0 — April 2026*
*By David Kelly Price*


---

## Related Blog Posts

- [Token Waste Is a Solvable Problem](https://pyckle.co/blog/token-waste-is-a-solvable-problem.html)
- [68% of That LLM Bill Was Optional](https://pyckle.co/blog/sixty-eight-percent-of-that-llm-bill-was-optional.html)
- [The Token Tax Is Real](https://pyckle.co/blog/the-token-tax-is-realand-developers-are-finally-doing-something-about-it.html)
- [1 Million Tokens on a Budget](https://pyckle.co/blog/1-million-tokens-on-a-budget-gpu-changes-nothing-and-everything.html)

---

*[Browse all free guides →](https://pyckle.co/books.html)*