Back to All Books
Engineers ~60–90 pages

KV Cache and Inference Optimization

The Infrastructure Layer That Determines Your Real LLM Costs

Free Ebook EPUB + Markdown By David Kelly Price

About This Ebook

Platform engineers and ML engineers running inference infrastructure — deploying or evaluating self-hosted LLMs and inference APIs, responsible for latency and cost

What you'll learn:

  • How Attention Works (and Why It Matters for Cost)
  • The KV Cache Explained
  • Prompt Caching in Hosted APIs
  • Prefill vs. Decode: Where Time Goes
  • Batching Strategies for Throughput
  • Quantization and Its Impact on Cache
  • Speculative Decoding
  • Monitoring Inference Performance

Get instant access to the EPUB and Markdown versions — read offline, share freely, and explore at your own pace.

Free Semantic Code Search

Try Pyckle in your codebase

The tool this book explores — semantic search, context routing, and code intelligence for Claude Code.

Get Started Free