Sixty-Eight Percent of That LLM Bill Was Optional

🎧
Listen to this article 7 min
Download MP3

A post made rounds this week across r/mlops and r/ClaudeAI: someone broke down their team's $3,200 monthly LLM bill and found that 68% of it—over two thousand dollars—was preventable waste.

The responses split predictably. One camp demanded the breakdown methodology. Another insisted their use case was different. A third quietly bookmarked the thread and went back to ignoring their own token budget.

That last group is the largest. Most teams have no idea where their tokens actually go.

The Visibility Problem

Token billing is opaque by design. A monthly invoice arrives. The number is larger than last month. Someone on the team says "we're using it more" and that becomes the explanation.

Usage is not the same as utility. A model processing 100,000 tokens does not mean 100,000 tokens contributed to the output. Most of what gets sent to LLMs—context, system prompts, conversation history, retrieved documents—is overhead. Necessary overhead in some cases. Redundant in most.

The pattern is consistent: teams know their total spend but cannot answer basic questions about token usage. Which query types consume the most? What percentage of context actually influences the response? How often does the same information get sent repeatedly across sessions?

Without answers, token optimization is guesswork.

Where Waste Accumulates

The 68% figure tracks with what other production teams report. The specific sources of waste are consistent enough to be predictable.

Redundant context loading. The same files, documentation, and system prompts sent with every request. Session-aware caching could eliminate most of this. Few teams implement it.

Naive retrieval. Retrieval augmented generation systems configured to return everything that passes a similarity threshold instead of what actually matters for the query. Twenty retrieved chunks when three would suffice. The model processes all of them. The invoice reflects all of them.

Retry overhead. Malformed prompts, context window overflows, rate limit errors—all triggering retries that double or triple the token cost of a single interaction. Monitoring catches the retries. It does not always catch the root cause.

Kitchen-sink context. The assumption that more information means better results, so everything gets included. Full file contents when function signatures would work. Entire documentation pages when a paragraph is relevant. Dependency trees that exist "just in case."

None of these are malicious. They are the default behaviors of tools optimized for capability, not efficiency.

Less Context Often Means Better Output

Bigger context windows do not justify filling them. That assumption costs money and degrades output.

Models do not weight all context equally. The lost in the middle problem is well-documented—critical information buried in page seven of a context dump gets less attention than irrelevant information in page one. Recent tokens influence attention more than distant ones. Signal drowns in noise.

Research on context utilization confirms this. Models given curated, relevant context outperform models given everything. The additional tokens do not help. They hurt.

This creates an odd alignment between context compression and output quality. The path to lower bills and the path to better results overlap. Tighter context. Smarter retrieval. Intentional inclusion instead of "send everything and let the model sort it out."

The 68% waste figure is not just about dollars. It represents tokens that made the output worse while making the invoice larger.

The Monitoring Gap

Token costs become manageable when they become visible.

The post that started this conversation succeeded because someone actually instrumented their LLM usage. They tracked token consumption by query type, by workflow, by time of day. The waste became obvious once they could see it.

Most teams do not have this visibility. Usage aggregates into a single monthly number. Optimization conversations happen in the abstract—"we should probably use caching" or "retrieval could be smarter"—without data to prioritize.

The infrastructure exists. Proxies that track per-request token consumption. Logging systems that categorize queries. Dashboards that surface patterns. The challenge is treating this as real infrastructure instead of a nice-to-have.

The teams that treat token monitoring like they treat application performance monitoring—essential, not optional—tend to find similar efficiency gains. The 68% was always there. It just was not visible.

What Optimization Actually Looks Like

The developers controlling their LLM costs are not using worse models or accepting worse outputs. They are treating context like a resource to be managed.

Semantic retrieval replaces file dumps. Instead of sending entire directories, a chunking strategy that maps content to queries pulls only what matters. Even generic embedding models outperform alphabetical file listings.

Session-aware caching eliminates redundancy. A file analyzed three requests ago does not need full content again. Diffs capture what changed. Summaries capture what matters. References point back to cached context.

Query-specific scoping limits context to what the question actually needs. An authentication question does not require database schemas. A UI bug does not need backend service configurations.

Prompt structures reduce retries. Clear, consistent formatting. Explicit output requirements. Error handling that prevents malformed responses from cascading into multiple retries.

None of this requires exotic infrastructure. It requires deciding that token count matters enough to engineer for it.

The Cost Curve

LLM costs are heading in one direction as capabilities improve and enterprise adoption grows. Models get more capable. Context windows expand. Teams integrate AI into more workflows.

The 68% waste figure is from a team that decided to look. The teams that have not looked are probably higher. The teams scaling up usage without addressing the underlying inefficiency will see linear cost growth where they expected efficiency gains.

This is a solvable problem. The waste is not inherent to LLM usage. It is inherent to LLM usage without visibility, without retrieval optimization, without context management. Those are infrastructure choices, not model limitations.

The meter keeps running. What it runs on is optional.

← Back to News

Go Deeper — Free Guides

Free Guides

Books & Guides — Code Intelligence

Free ebooks and guides on semantic search, embeddings, RAG, and AI-assisted development.

Browse all guides →