Engineers
~60–90 pages
KV Cache and Inference Optimization
The Infrastructure Layer That Determines Your Real LLM Costs
Free Ebook
EPUB + Markdown
By David Kelly Price
About This Ebook
Platform engineers and ML engineers running inference infrastructure — deploying or evaluating self-hosted LLMs and inference APIs, responsible for latency and cost
What you'll learn:
- How Attention Works (and Why It Matters for Cost)
- The KV Cache Explained
- Prompt Caching in Hosted APIs
- Prefill vs. Decode: Where Time Goes
- Batching Strategies for Throughput
- Quantization and Its Impact on Cache
- Speculative Decoding
- Monitoring Inference Performance
Get instant access to the EPUB and Markdown versions — read offline, share freely, and explore at your own pace.
Free Semantic Code Search
Try Pyckle in your codebase
The tool this book explores — semantic search, context routing, and code intelligence for Claude Code.