Claude Opus 4.5 dropped last week, and the conversation shifted almost immediately from capability to cost. Reddit threads lit up with developers sharing their experiences: impressive outputs, impressive invoices. One post in r/ClaudeAI put it bluntly—"Opus burns so many tokens that I'm not sure every company can afford this cost."
The sentiment isn't new. But the urgency is.
The Math That Changed
When GPT-3 launched, token costs were a curiosity. Most applications fit comfortably within limits, and the bills were manageable enough to ignore. That era is over.
Modern AI coding assistants don't just process prompts—they ingest entire codebases. They read documentation, analyze dependencies, understand architectural patterns. The context window isn't a constraint anymore; it's a firehose. Claude's 200K token window, GPT-4's 128K, Gemini's million-token experiments—these aren't features. They're invitations to spend.
And developers accepted the invitation. The default behavior became "send everything." Repository dumps. Full file trees. Entire documentation sites. The reasoning was sound: more context means better outputs. Give the model everything it might need, and it'll figure out what's relevant.
That reasoning has a cost. A literal one.
The 89% Problem
A recent post caught attention not for its complaint but for its solution. A developer built a CLI proxy for Claude Code sessions and reported saving 10 million tokens—89% of their previous usage. The mechanism wasn't sophisticated: aggressive caching, deduplication of repeated context, and smarter batching of requests.
The numbers reveal something uncomfortable. Most of those tokens weren't doing useful work. They were overhead—the same files sent repeatedly, the same context rebuilt from scratch, the same information transmitted because the tooling didn't remember what it already knew.
This isn't a model problem. It's an architecture problem.
Context as a System, Not a Dump
The instinct to maximize context makes sense at the individual query level. If the model might need a file, include it. If it might reference documentation, send it. The cost of missing context is a wrong answer. The cost of extra context is... invisible. Until the invoice arrives.
But context doesn't have to be all-or-nothing. The developers finding success with token efficiency aren't using less capable models or accepting worse outputs. They're treating context like a system—with retrieval, caching, and selective inclusion based on actual relevance.
The difference between sending 200K tokens and sending 20K relevant tokens isn't just cost. It's often quality. Models perform better with focused context than with everything-and-the-kitchen-sink. The signal-to-noise ratio matters.
What Efficient Context Looks Like
The approaches gaining traction share common patterns:
Semantic retrieval over file dumps. Instead of sending entire directories, retrieve the chunks actually relevant to the query. Embedding-based search, even with generic models, beats alphabetical file listing.
Session-aware caching. If the model analyzed a file three requests ago, it doesn't need the full content again. Diffs, summaries, or context compression techniques can maintain continuity without retransmission.
Structured hierarchies. Send the map before the territory. High-level summaries let the model request specifics rather than processing everything speculatively.
Query-specific scoping. A question about authentication doesn't need the entire codebase. Intelligent scoping—even rough heuristics—dramatically reduces irrelevant context.
None of this is revolutionary. It's the same efficiency thinking applied to databases, caches, and APIs for decades. The difference is that token costs finally made it necessary.
The Enterprise Squeeze
For individual developers, token costs are annoying but manageable. For teams, the math scales uncomfortably. A ten-person engineering team, each running AI-assisted sessions throughout the day, can generate token bills that rival cloud compute costs. Some already do.
The reaction splits predictably. Some teams retreat to smaller models, accepting capability trade-offs for cost control. Others implement hard limits, capping usage in ways that frustrate developers. Neither approach solves the underlying problem.
The teams getting this right treat token efficiency as infrastructure, not austerity. They invest in context systems that deliver relevant information efficiently, rather than rationing access to capable models.
The Quality Paradox
Here's the part that surprises developers new to context optimization: less is often more.
Large context windows enable impressive capabilities, but they don't guarantee better outputs. Models can get lost in irrelevant information. They can weight recent tokens over important ones. They can struggle to identify what matters when everything is presented as equally important.
Curated context—the right 20K tokens instead of random 200K—frequently produces better results. The model spends its token budget on relevant information rather than filtering noise.
This creates a paradox. The path to better AI outputs and lower AI costs is the same path: smarter context management. The developers building custom proxies and retrieval systems aren't just saving money. They're getting better answers.
What Comes Next
The token efficiency conversation is just starting. Current solutions are largely custom—individual developers building personal tools, teams creating internal systems. The tooling ecosystem hasn't caught up.
That's changing. The same week Opus 4.5 launched with its capability-and-cost combination, multiple context optimization approaches surfaced across developer communities. The market is responding to genuine pain.
The question isn't whether context efficiency matters. The 89% savings post settled that. The question is whether it becomes standard practice or remains a competitive advantage for teams that figure it out early.
For now, the developers complaining about Opus costs and the developers celebrating Opus capabilities are often looking at the same model through different architectures. The model didn't change. The context systems around it did.
The token tax is real. It's also, increasingly, optional.