Engineers ~60–90 pages

KV Cache and Inference Optimization

The Infrastructure Layer That Determines Your Real LLM Costs

Free Ebook EPUB + Markdown By David Kelly Price

About This Ebook

Platform engineers and ML engineers running inference infrastructure — deploying or evaluating self-hosted LLMs and inference APIs, responsible for latency and cost

What you'll learn:

How Attention Works (and Why It Matters for Cost)
The KV Cache Explained
Prompt Caching in Hosted APIs
Prefill vs. Decode: Where Time Goes
Batching Strategies for Throughput
Quantization and Its Impact on Cache
Speculative Decoding
Monitoring Inference Performance

Get instant access to the EPUB and Markdown versions — read offline, share freely, and explore at your own pace.

Free Semantic Code Search

Try Pyckle in your codebase

The tool this book explores — semantic search, context routing, and code intelligence for Claude Code.

Get Started Free